mml.core.data_preparation.dset_creator

class DSetCreator[source]

Bases: object

The dataset creator handles all relevant steps to prepare the dataset on your device. This includes: - downloading the data and checking hashes - extracting the data from archives - alternatively extract data from existing pytorch datasets - storing the data at the correct spots (unlabeled, train and test data) - optionally transforming masks of segmentation data

Main usage:
  1. Based on use case:
    1. call .download() / .kaggle_download() / .verify_pre_download() function to download/register files

    2. call .unpack_and_store() to unpack files and move them to the correct spot

    OR:
    1. create / import a pytorch dataset

    2. call .extract_from_pytorch_datasets to extract that data

  2. (optionally) turn the masks of segmentation tasks to the correct format with .transform_masks()

__init__(dset_name: str, download_path: Path | None = None, dset_path: Path | None = None)[source]

Creator class for datasets.

Parameters:
  • dset_name – name of the dataset (should be short, since is used in directory names)

  • download_path – (optional) a path to already downloaded files

  • dset_path – (optional) a path to an already created dset folder

download(url: str, file_name: str, data_kind: DataKind = DataKind.MIXED, md5: str | None = None) DataArchive[source]

Downloads files from the web to your local drive.

Parameters:
  • url (str) – URL to download from

  • file_name (str) – name of the file

  • data_kind (DataKind) – (optional) type of the data

  • md5 (Optional[str]) – (optional) md5 sum of the downloaded obj

Returns:

a reference to the created archive, may be used to modify keep_top_dir before extraction

extract_from_pytorch_datasets(datasets: Dict[str, Dataset], task_type: TaskType, allow_transforms: bool = False, class_names: List[str] | None = None) Path[source]

Can be used to store an existing dataset (e.g. imported from some other repository or from torchvision). Expects (image, class) tuples for classification and (image, mask) tuples for semantic segmentation to return over __getitem__. Unlabeled dataset is expected to return no tuple, but only image.

Parameters:
  • datasets – dict of datasets to be stored, keys must be from [training, testing, unlabeled]

  • task_type – task type of the dataset

  • allow_transforms – if the transform attribute is present and unequal to None whether to raise an error

  • class_names – optional list of class names to store data with class name directories

Returns:

the dataset root path (to be used by TaskCreator)

kaggle_download(competition: str | None = None, dataset: str | None = None, data_kind: DataKind = DataKind.MIXED) List[DataArchive][source]

Downloads all the data of a kaggle competition or dataset. Either specify competition XOR dataset parameter! Currently only a single kaggle download is supported per dataset!

Parameters:
  • competition – (optional) kaggle competition identifier, mutually exclusive with dataset parameter

  • dataset – (optional) kaggle dataset identifier, mutually exclusive with competition parameter

  • data_kind (DataKind) – (optional) type of the data

Returns:

a List of references to the created archives, may be used to modify keep_top_dir before extraction

transform_masks(masks: List[Path], transform: Dict[tuple, int], load: str = 'rgb', train: bool = True, ignore: List[tuple] | None = None) Path[source]

Takes the job of transforming masks to the frameworks format (cv2 readable greyscale images). Will write the transformed masks to dataset root -> training_labels / testing_labels -> transformed_masks -> same relative path as before (to mask within one of the data subfolders).

Parameters:
  • masks – list of paths to the mask files

  • transform – dict defining the transformation (mapping of mask values to classes)

  • load – mode defining the loading of a file

  • train – bool indicating if treated data is train or test

  • ignore – list of mask values that should be mapped to the ignored value of 255 (similar to transform)

Returns:

base folder for the transformed masks

unpack_and_store(clear_download_folder: bool = False) Path[source]

Unpacks all files and stores them at the correct spot.

Parameters:

clear_download_folder – if True deletes the download folder

Returns:

the dataset root path (to be used by TaskCreator)

verify_pre_download(file_name: str, instructions: str, data_kind: DataKind = DataKind.MIXED, md5: str | None = None) DataArchive[source]

Verifies a file that has been previously downloaded is present and adds it to internal archive list. This is useful if e.g. downloading requires credentials / registering or the data is non-public.

Parameters:
  • file_name (str) – name of the downloaded file or folder

  • instructions (str) – how to get the data

  • data_kind (DataKind) – (optional) kind of the data

  • md5 (str) – (optional) md5 sum of the downloaded obj, only effective for non-folder files

Returns:

a reference to the created archive, may be used to modify keep_top_dir before extraction

mask_transform(mask_path: Path, dset_path: Path, out_base: Path, load: str, vmapper: Callable) bool[source]

Parallelizable part of mask transform. Params keep names from within DsetCreator.transform_mask.

Returns:

boolean indicating success of saving the transformed mask