mml.core.data_preparation.dset_creator
- class DSetCreator[source]
Bases:
objectThe dataset creator handles all relevant steps to prepare the dataset on your device. This includes: - downloading the data and checking hashes - extracting the data from archives - alternatively extract data from existing pytorch datasets - storing the data at the correct spots (unlabeled, train and test data) - optionally transforming masks of segmentation data
- Main usage:
- Based on use case:
call .download() / .kaggle_download() / .verify_pre_download() function to download/register files
call .unpack_and_store() to unpack files and move them to the correct spot
- OR:
create / import a pytorch dataset
call .extract_from_pytorch_datasets to extract that data
(optionally) turn the masks of segmentation tasks to the correct format with .transform_masks()
- __init__(dset_name: str, download_path: Path | None = None, dset_path: Path | None = None)[source]
Creator class for datasets.
- Parameters:
dset_name – name of the dataset (should be short, since is used in directory names)
download_path – (optional) a path to already downloaded files
dset_path – (optional) a path to an already created dset folder
- download(url: str, file_name: str, data_kind: DataKind = DataKind.MIXED, md5: str | None = None) DataArchive[source]
Downloads files from the web to your local drive.
- extract_from_pytorch_datasets(datasets: Dict[str, Dataset], task_type: TaskType, allow_transforms: bool = False, class_names: List[str] | None = None) Path[source]
Can be used to store an existing dataset (e.g. imported from some other repository or from torchvision). Expects (image, class) tuples for classification and (image, mask) tuples for semantic segmentation to return over __getitem__. Unlabeled dataset is expected to return no tuple, but only image.
- Parameters:
datasets – dict of datasets to be stored, keys must be from [training, testing, unlabeled]
task_type – task type of the dataset
allow_transforms – if the transform attribute is present and unequal to None whether to raise an error
class_names – optional list of class names to store data with class name directories
- Returns:
the dataset root path (to be used by TaskCreator)
- kaggle_download(competition: str | None = None, dataset: str | None = None, data_kind: DataKind = DataKind.MIXED) List[DataArchive][source]
Downloads all the data of a kaggle competition or dataset. Either specify competition XOR dataset parameter! Currently only a single kaggle download is supported per dataset!
- Parameters:
competition – (optional) kaggle competition identifier, mutually exclusive with dataset parameter
dataset – (optional) kaggle dataset identifier, mutually exclusive with competition parameter
data_kind (DataKind) – (optional) type of the data
- Returns:
a List of references to the created archives, may be used to modify keep_top_dir before extraction
- transform_masks(masks: List[Path], transform: Dict[tuple, int], load: str = 'rgb', train: bool = True, ignore: List[tuple] | None = None) Path[source]
Takes the job of transforming masks to the frameworks format (cv2 readable greyscale images). Will write the transformed masks to dataset root -> training_labels / testing_labels -> transformed_masks -> same relative path as before (to mask within one of the data subfolders).
- Parameters:
masks – list of paths to the mask files
transform – dict defining the transformation (mapping of mask values to classes)
load – mode defining the loading of a file
train – bool indicating if treated data is train or test
ignore – list of mask values that should be mapped to the ignored value of 255 (similar to transform)
- Returns:
base folder for the transformed masks
- unpack_and_store(clear_download_folder: bool = False) Path[source]
Unpacks all files and stores them at the correct spot.
- Parameters:
clear_download_folder – if True deletes the download folder
- Returns:
the dataset root path (to be used by TaskCreator)
- verify_pre_download(file_name: str, instructions: str, data_kind: DataKind = DataKind.MIXED, md5: str | None = None) DataArchive[source]
Verifies a file that has been previously downloaded is present and adds it to internal archive list. This is useful if e.g. downloading requires credentials / registering or the data is non-public.
- Parameters:
- Returns:
a reference to the created archive, may be used to modify keep_top_dir before extraction