Adapt your dataset
NOTE:
mml handles preprocessing the data internally. No need to manually preprocess any data in advance. See preprocess mode.
Assume having used your dataset as a plain pytorch Dataset previously. Migration to mml is as easy as follows:
Step 1: Write your DsetCreator and TaskCreator
mml distinguishes the concepts of “Datasets” and “Tasks”. Whereby “Datasets” contains all data (plus maybe more meta information, additional tasks on the same data, additional test samples, etc.) and the “Task” is only a description which samples and labels of the “Dataset” belong to that specific task. There are a lot of convenience functions to simplify this process.
Example: Reusing your previous dataset definition
In this example we use some torchvision dataset to be integrated into mml, but it may be fully replaced with your existing dataset class.
from mml.api import (
DSetCreator,
TaskCreator,
get_iterator_and_mapping_from_image_dataset,
TaskType,
Keyword,
License,
Modality,
DataKind,
)
from torchvision.datasets import STL10
REFERENCE = """
@inproceedings{Coates2011AnAO,
title={An Analysis of Single-Layer Networks in Unsupervised Feature Learning},
author={Adam Coates and A. Ng and Honglak Lee},
booktitle={AISTATS},
year={2011}
}"""
dset_creator = DSetCreator(dset_name="STL_10_DEMO")
train = STL10(root=dset_creator.download_path, split="train", download=True)
test = STL10(root=dset_creator.download_path, split="test", download=True)
dset_path = dset_creator.extract_from_pytorch_datasets(
datasets={"training": train, "testing": test}, task_type=TaskType.CLASSIFICATION, class_names=train.classes
)
task_creator = TaskCreator(
dset_path=dset_path,
task_type=TaskType.CLASSIFICATION,
name="STL_10_DEMO",
desc="STL-10 image recognition task",
ref=REFERENCE,
url="https://cs.stanford.edu/~acoates/stl10/",
instr="downloaded via torchvision dataset (https://pytorch.org/vision/stable/generated/torchvision.datasets.STL10.html#torchvision.datasets.STL10)",
lic=License.UNKNOWN,
release="2011",
keywords=[Keyword.NATURAL_OBJECTS],
)
train_iterator, idx_to_class = get_iterator_and_mapping_from_image_dataset(
root=dset_path / "training_data", classes=train.classes
)
test_iterator, idx_to_class_2 = get_iterator_and_mapping_from_image_dataset(
root=dset_path / "testing_data", classes=test.classes
)
assert all([a == b for a, b in zip(idx_to_class, idx_to_class_2)])
task_creator.find_data(train_iterator=train_iterator, test_iterator=test_iterator, idx_to_class=idx_to_class)
task_creator.auto_complete()
That’s it already! For the future you may reference your task with stl10 (the value provided to alias= in the TaskCreator).
Example: DSetCreator when using public data
In this case we recommend to implement the DSetCreator from scratch including the download of the data. This allows for better reproducibility. There are the following convenience functions so far:
DSetCreator.download()to download given a URLDSetCreator.kaggle_download()to download given a kaggle dataset ID or competition IDDSetCreator.verify_pre_download()if parts of the data have to be downloaded manually (e.g. access only after registration)DSetCreator.unpack_and_store()simply call after any of the previous to extract the data from archive formatsDSetCreator.transform_masks()if necessary transform masks (e.g. from segmentation masks) to fit themmlrequirements
Example: TaskCreator, writing your own data iterator
If get_iterator_and_mapping_from_image_dataset does not fit your data structure, you may simply write an iterator yourself, as done with this example:
dset_creator = DSetCreator(dset_name="laryngeal_DEMO")
dset_creator.download(
url="https://zenodo.org/record/1003200/files/laryngeal%20dataset.tar?download=1",
file_name="laryngeal dataset.tar",
data_kind=DataKind.TRAINING_DATA,
)
dset_path = dset_creator.unpack_and_store()
laryngeal_tissue = TaskCreator(
dset_path=dset_path,
task_type=TaskType.CLASSIFICATION,
name="laryngeal_DEMO",
desc="Laryngeal dataset for patches of healthy and early-stage cancerous laryngeal tissues",
ref="...",
url="https://nearlab.polimi.it/medical/dataset/",
instr="download via zenodo.org/record/1003200/files/laryngeal%20dataset.tar?download=1",
lic=License.CC_BY_NC_4_0,
release="2017",
keywords=[Keyword.MEDICAL, Keyword.LARYNGOSCOPY, Keyword.TISSUE_PATHOLOGY, Keyword.ENDOSCOPY],
)
classes = ["Hbv", "He", "IPCL", "Le"]
folds = ["FOLD 1", "FOLD 2", "FOLD 3"]
data_iterator = []
for fold in folds:
root = dset_path / "training_data" / "laryngeal dataset" / f"{fold}"
folders = [p.name for p in root.iterdir() if p.is_dir()]
assert all([cl in folders for cl in classes]), "some class folder is not existent"
for class_folder in root.iterdir():
assert class_folder.is_dir()
if class_folder.name not in classes:
continue
for img_path in class_folder.iterdir():
data_iterator.append(
{
Modality.SAMPLE_ID: img_path.stem,
Modality.IMAGE: img_path,
Modality.CLASS: classes.index(class_folder.name),
}
)
idx_to_class = {classes.index(cl): cl for cl in classes}
laryngeal_tissue.find_data(train_iterator=data_iterator, idx_to_class=idx_to_class)
laryngeal_tissue.auto_complete()
Example: Multiple tasks per Dataset
This example will be added later.
Step 2: BONUS - automize the task creation
mml has a create mode to generate tasks automatically. If set up correctly the above datasets would be downloaded and prepared automatically when calling mml create tasks=example (assuming example.yaml is already provided). This is much more convenient if using mml from within and not as a library - nevertheless possible and allows any other mml user that installed your package to quickly start on your data and code.
make your code installable via a package - you need a
pyproject.tomlandsetup.cfgfile for thisdecorate the
DSetCreatorwith@register_dsetcreatorand yourTaskCreatorwith@register_taskcreatoradd an
activate.pyscript to the root of your package’s source codeimport the module (file) that defines the creators within this file
in your
setup.cfg(orsetup.pyorpyproject.toml, see here) provide the correct entry point formml
[options.entry_points]
mml.plugins =
some_key = your_package:activate
(replace some_key with a descriptive id and your_package with your package and your_module the module you want to refer to).
These tasks are now always linked when calling
mml task_list=[stl10,laryngeal_tissues]🎉
Step 3: BONUS - add your task(s) to a tasks config file
In order to refer to your task(s) later on create a tasks config file in a configs folder that is linked to mml
if you cloned
mmljust navigate intoconfigs/tasksif you are writing your own package, create a
configsfolder, best at your package root level, add atasksfolder insidecreate a new file
example.yamlwith the following content
# @package _global_
tasks:
- 'stl10'
- 'laryngeal_tissues'
pivot:
name: False
tags: ''
tagging:
all: False
variants: []
add something like the following to your
activate.py(see step before)
from hydra.core.config_search_path import ConfigSearchPath
from hydra.core.plugins import Plugins
from hydra.plugins.search_path_plugin import SearchPathPlugin
# register plugin configs
class MMLINSERTPLUGINNAMESearchPathPlugin(SearchPathPlugin):
def manipulate_search_path(self, search_path: ConfigSearchPath) -> None:
# Sets the search path for mml with copied config files
search_path.append(
provider="mml-???", path=f"pkg://mml_???.configs"
)
Plugins.instance().register(MMLINSERTPLUGINNAMESearchPathPlugin)
ofcourse you have to replace
INSERTPLUGINNAMEandmml-???/mml_???
These tasks are now always linked when calling mml tasks=example 🎉