Creating reference CSVs for model training and inference

When you train models with solaris, it uses reference CSV files to find images and matching labels. Let’s go through what those are and what they should include. You’ll create (up to) three different reference files:

Training Data CSV

Your training data CSV must have two columns with the exact names below:

  • image: The image column defines the paths to each image file to be used during training, one path per row. You can use either the absolute path to the file or the path relative to the path that you run code in - we recommend using the absolute path for consistency.

  • label: The label column defines the paths to the label (mask) files. If you need to create masks first, check out the Python API tutorial or the CLI tutorial.

The image and label in each row must match! This is how solaris matches your training images to the expected outputs.

If you choose to have solaris split validation data out for you, it will randomly select a fraction of the rows for validation. The fraction used for validation is defined in the config YAML file - for more on how to do so, see the YAML config reference.

For more control over what data is used for training vs. validation, you can create a separate validation CSV.

Validation Data CSV

This CSV is the same as the Training Data CSV, but specifies images and masks to be used for epoch-wise validation. Make sure there’s no overlap between your training and validation sets - you don’t want any data leaks! If you want solaris to split the validation data out of the training data automatically, you don’t need to provide this.

Inference Data CSV

This reference file points to the image files that you wish to make predictions on. It therefore only needs to contain one column: image.

Using these files

Once you have made these labels, provide the paths to them in your configuration file; they’ll automatically be loaded into your config when you call solaris.utils.config.parse().