solaris.data API reference

solaris.data.coco COCO label format management

solaris.data.coco.coco_categories_dict_from_df(df, category_id_col, category_name_col, supercategory_col=None)[source]

Extract category IDs, category names, and supercat names from df.

Parameters
  • df (pandas.DataFrame) – A pandas.DataFrame of records to filter for category info.

  • category_id_col (str) – The name for the column in df that contains category IDs.

  • category_name_col (str) – The name for the column in df that contains category names.

  • supercategory_col (str, optional) – The name for the column in df that contains supercategory names, if one exists. If not provided, supercategory will be left out of the output.

Returns

A list of dict s that contain category records per the COCO dataset specification .

Return type

list of dict s

solaris.data.coco.df_to_coco_annos(df, output_path=None, geom_col='geometry', image_id_col=None, category_col=None, score_col=None, preset_categories=None, supercategory_col=None, include_other=True, starting_id=1, verbose=0)[source]

Extract COCO-formatted annotations from a pandas DataFrame.

This function assumes that annotations are already in pixel coordinates. If this is not the case, you can transform them using solaris.vector.polygon.geojson_to_px_gdf().

Note that this function generates annotations formatted per the COCO object detection specification. For additional information, see the COCO dataset specification.

Parameters
  • df (pandas.DataFrame) – A pandas.DataFrame containing geometries to store as annos.

  • image_id_col (str, optional) – The column containing image IDs. If not provided, it’s assumed that all are in the same image, which will be assigned the ID of 1.

  • geom_col (str, optional) – The name of the column in df that contains geometries. The geometries should either be shapely shapely.geometry.Polygon s or WKT strings. Defaults to "geometry".

  • category_col (str, optional) – The name of the column that specifies categories for each object. If not provided, all objects will be placed in a single category named "other".

  • score_col (str, optional) – The name of the column that specifies the ouptut confidence of a model. If not provided, will not be output.

  • preset_categories (list of :class:`dict`s, optional) – A pre-set list of categories to use for labels. These categories should be formatted per `the COCO category specification`_.

  • starting_id (int, optional) – The number to start numbering annotation IDs at. Defaults to 1.

  • verbose (int, optional) – Verbose text output. By default, none is provided; if True or 1, information-level outputs are provided; if 2, extremely verbose text is output.

  • _the COCO category specification (.) –

Returns

output_dict – A dictionary containing COCO-formatted annotation and category entries per the COCO dataset specification

Return type

dict

solaris.data.coco.geojson2coco(image_src, label_src, output_path=None, image_ext='.tif', matching_re=None, category_attribute=None, score_attribute=None, preset_categories=None, include_other=True, info_dict=None, license_dict=None, recursive=False, override_crs=False, explode_all_multipolygons=False, remove_all_multipolygons=False, verbose=0)[source]

Generate COCO-formatted labels from one or multiple geojsons and images.

This function ingests optionally georegistered polygon labels in geojson format alongside image(s) and generates .json files per the COCO dataset specification . Some models, like many Mask R-CNN implementations, require labels to be in this format. The function assumes you’re providing image file(s) and geojson file(s) to create the dataset. If the number of images and geojsons are both > 1 (e.g. with a SpaceNet dataset), you must provide a regex pattern to extract matching substrings to match images to label files.

Parameters
  • image_src (str or list or dict) –

    Source image(s) to use in the dataset. This can be:

    1. a string path to an image,
    2. the path to a directory containing a bunch of images,
    3. a list of image paths,
    4. a dictionary corresponding to COCO-formatted image records, or
    5. a string path to a COCO JSON containing image records.
    

    If a directory, the recursive flag will be used to determine whether or not to descend into sub-directories.

  • label_src (str or list) – Source labels to use in the dataset. This can be a string path to a geojson, the path to a directory containing multiple geojsons, or a list of geojson file paths. If a directory, the recursive flag will determine whether or not to descend into sub-directories.

  • output_path (str, optional) – The path to save the JSON-formatted COCO records to. If not provided, the records will only be returned as a dict, and not saved to file.

  • image_ext (str, optional) – The string to use to identify images when searching directories. Only has an effect if image_src is a directory path. Defaults to ".tif".

  • matching_re (str, optional) – A regular expression pattern to match filenames between image_src and label_src if both are directories of multiple files. This has no effect if those arguments do not both correspond to directories or lists of files. Will raise a ValueError if multiple files are provided for both image_src and label_src but no matching_re is provided.

  • category_attribute (str, optional) – The name of an attribute in the geojson that specifies which category a given instance corresponds to. If not provided, it’s assumed that only one class of object is present in the dataset, which will be termed "other" in the output json.

  • score_attribute (str, optional) – The name of an attribute in the geojson that specifies the prediction confidence of a model

  • preset_categories (list of :class:`dict`s, optional) – A pre-set list of categories to use for labels. These categories should be formatted per `the COCO category specification`_. example: [{‘id’: 1, ‘name’: ‘Fighter Jet’, ‘supercategory’: ‘plane’}, {‘id’: 2, ‘name’: ‘Military Bomber’, ‘supercategory’: ‘plane’}, … ]

  • include_other (bool, optional) – If set to True, and preset_categories is provided, objects that don’t fall into the specified categories will not be removed from the dataset. They will instead be passed into a category named "other" with its own associated category id. If False, objects whose categories don’t match a category from preset_categories will be dropped.

  • info_dict (dict, optional) –

    A dictonary with the following key-value pairs:

    - ``"year"``: :class:`int` year of creation
    - ``"version"``: :class:`str` version of the dataset
    - ``"description"``: :class:`str` string description of the dataset
    - ``"contributor"``: :class:`str` who contributed the dataset
    - ``"url"``: :class:`str` URL where the dataset can be found
    - ``"date_created"``: :class:`datetime.datetime` when the dataset
        was created
    

  • license_dict (dict, optional) –

    A dictionary containing the licensing information for the dataset, with the following key-value pairs:

    - ``"name": :class:`str` the name of the license.
    -  ``"url": :class:`str` a link to the dataset's license.
    

    Note: This implementation assumes that all of the data uses one license. If multiple licenses are provided, the image records will not be assigned a license ID.

  • recursive (bool, optional) – If image_src and/or label_src are directories, setting this flag to True will induce solaris to descend into subdirectories to find files. By default, solaris does not traverse the directory tree.

  • explode_all_multipolygons (bool, optional) – Explode the multipolygons into individual geometries using sol.utils.geo.split_multi_geometries. Be sure to inspect which geometries are multigeometries, each individual geometries within these may represent artifacts rather than true labels.

  • remove_all_multipolygons (bool, optional) – Filters MultiPolygons and GeometryCollections out of each tile geodataframe. Alternatively you can edit each polygon manually to be a polygon before converting to COCO format.

  • verbose (int, optional) – Verbose text output. By default, none is provided; if True or 1, information-level outputs are provided; if 2, extremely verbose text is output.

Returns

coco_dataset – A dictionary following the COCO dataset specification . Depending on arguments provided, it may or may not include license and info metadata.

Return type

dict

solaris.data.coco.make_coco_image_dict(image_ref, license_id=None)[source]

Take a dict of image_fname: image_id pairs and make a coco dict.

Note that this creates a relatively limited version of the standard COCO image record format record, which only contains the following keys:

* id ``(int)``
* width ``(int)``
* height ``(int)``
* file_name ``(str)``
* license ``(int)``, optional
Parameters
  • image_ref (dict) – A dictionary of image_fname: image_id key-value pairs.

  • license_id (int, optional) – The license ID number for the relevant license. If not provided, no license information will be included in the output.

Returns

coco_images – A list of COCO-formatted image records ready for export to json.

Return type

list