mml.core.data_preparation.dset_creator

class DSetCreator[source]

Bases: object

The dataset creator handles all relevant steps to prepare the dataset on your device. This includes: - downloading the data and checking hashes - extracting the data from archives - alternatively extract data from existing pytorch datasets - storing the data at the correct spots (unlabeled, train and test data) - optionally transforming masks of segmentation data

Main usage:

Based on use case:
1. call .download() / .kaggle_download() / .verify_pre_download() function to download/register files
2. call .unpack_and_store() to unpack files and move them to the correct spot
OR:
1. create / import a pytorch dataset
2. call .extract_from_pytorch_datasets to extract that data
(optionally) turn the masks of segmentation tasks to the correct format with .transform_masks()

__init__(dset_name: str, download_path: Path | None = None, dset_path: Path | None = None)[source]

Creator class for datasets.

Parameters:

dset_name – name of the dataset (should be short, since is used in directory names)
download_path – (optional) a path to already downloaded files
dset_path – (optional) a path to an already created dset folder

download(url: str, file_name: str, data_kind: DataKind = DataKind.MIXED, md5: str | None = None) → DataArchive[source]

Downloads files from the web to your local drive.

Parameters:

url (str) – URL to download from
file_name (str) – name of the file
data_kind (DataKind) – (optional) type of the data
md5 (Optional[str]) – (optional) md5 sum of the downloaded obj

Returns:

a reference to the created archive, may be used to modify keep_top_dir before extraction

extract_from_pytorch_datasets(datasets: Dict[str, Dataset], task_type: TaskType, allow_transforms: bool = False, class_names: List[str] | None = None) → Path[source]

Can be used to store an existing dataset (e.g. imported from some other repository or from torchvision). Expects (image, class) tuples for classification and (image, mask) tuples for semantic segmentation to return over __getitem__. Unlabeled dataset is expected to return no tuple, but only image.

Parameters:

datasets – dict of datasets to be stored, keys must be from [training, testing, unlabeled]
task_type – task type of the dataset
allow_transforms – if the transform attribute is present and unequal to None whether to raise an error
class_names – optional list of class names to store data with class name directories

Returns:

the dataset root path (to be used by TaskCreator)

kaggle_download(competition: str | None = None, dataset: str | None = None, data_kind: DataKind = DataKind.MIXED) → List[DataArchive][source]

Downloads all the data of a kaggle competition or dataset. Either specify competition XOR dataset parameter! Currently only a single kaggle download is supported per dataset!

Parameters:

competition – (optional) kaggle competition identifier, mutually exclusive with dataset parameter
dataset – (optional) kaggle dataset identifier, mutually exclusive with competition parameter
data_kind (DataKind) – (optional) type of the data

Returns:

a List of references to the created archives, may be used to modify keep_top_dir before extraction

transform_masks(masks: List[Path], transform: Dict[tuple, int], load: str = 'rgb', train: bool = True, ignore: List[tuple] | None = None) → Path[source]

Takes the job of transforming masks to the frameworks format (cv2 readable greyscale images). Will write the transformed masks to dataset root -> training_labels / testing_labels -> transformed_masks -> same relative path as before (to mask within one of the data subfolders).

Parameters:

masks – list of paths to the mask files
transform – dict defining the transformation (mapping of mask values to classes)
load – mode defining the loading of a file
train – bool indicating if treated data is train or test
ignore – list of mask values that should be mapped to the ignored value of 255 (similar to transform)

Returns:

base folder for the transformed masks

unpack_and_store(clear_download_folder: bool = False) → Path[source]

Unpacks all files and stores them at the correct spot.

Parameters:: clear_download_folder – if True deletes the download folder
Returns:: the dataset root path (to be used by TaskCreator)

verify_pre_download(file_name: str, instructions: str, data_kind: DataKind = DataKind.MIXED, md5: str | None = None) → DataArchive[source]

Verifies a file that has been previously downloaded is present and adds it to internal archive list. This is useful if e.g. downloading requires credentials / registering or the data is non-public.

Parameters:

file_name (str) – name of the downloaded file or folder
instructions (str) – how to get the data
data_kind (DataKind) – (optional) kind of the data
md5 (str) – (optional) md5 sum of the downloaded obj, only effective for non-folder files

Returns:

a reference to the created archive, may be used to modify keep_top_dir before extraction

mask_transform(mask_path: Path, dset_path: Path, out_base: Path, load: str, vmapper: Callable) → bool[source]

Parallelizable part of mask transform. Params keep names from within DsetCreator.transform_mask.

Returns:: boolean indicating success of saving the transformed mask