Adapt your dataset

NOTE:

mml handles preprocessing the data internally. No need to manually preprocess any data in advance. See preprocess mode.

Assume having used your dataset as a plain pytorch Dataset previously. Migration to mml is as easy as follows:

Step 1: Write your DsetCreator and TaskCreator

mml distinguishes the concepts of “Datasets” and “Tasks”. Whereby “Datasets” contains all data (plus maybe more meta information, additional tasks on the same data, additional test samples, etc.) and the “Task” is only a description which samples and labels of the “Dataset” belong to that specific task. There are a lot of convenience functions to simplify this process.

Example: Reusing your previous dataset definition

In this example we use some torchvision dataset to be integrated into mml, but it may be fully replaced with your existing dataset class.

from mml.api import (
    DSetCreator,
    TaskCreator,
    get_iterator_and_mapping_from_image_dataset,
    TaskType,
    Keyword,
    License,
    Modality,
    DataKind,
)
from torchvision.datasets import STL10

REFERENCE = """
@inproceedings{Coates2011AnAO,
  title={An Analysis of Single-Layer Networks in Unsupervised Feature Learning},
  author={Adam Coates and A. Ng and Honglak Lee},
  booktitle={AISTATS},
  year={2011}
}"""

dset_creator = DSetCreator(dset_name="STL_10_DEMO")
train = STL10(root=dset_creator.download_path, split="train", download=True)
test = STL10(root=dset_creator.download_path, split="test", download=True)
dset_path = dset_creator.extract_from_pytorch_datasets(
    datasets={"training": train, "testing": test}, task_type=TaskType.CLASSIFICATION, class_names=train.classes
)
task_creator = TaskCreator(
    dset_path=dset_path,
    task_type=TaskType.CLASSIFICATION,
    name="STL_10_DEMO",
    desc="STL-10 image recognition task",
    ref=REFERENCE,
    url="https://cs.stanford.edu/~acoates/stl10/",
    instr="downloaded via torchvision dataset (https://pytorch.org/vision/stable/generated/torchvision.datasets.STL10.html#torchvision.datasets.STL10)",
    lic=License.UNKNOWN,
    release="2011",
    keywords=[Keyword.NATURAL_OBJECTS],
)
train_iterator, idx_to_class = get_iterator_and_mapping_from_image_dataset(
    root=dset_path / "training_data", classes=train.classes
)
test_iterator, idx_to_class_2 = get_iterator_and_mapping_from_image_dataset(
    root=dset_path / "testing_data", classes=test.classes
)
assert all([a == b for a, b in zip(idx_to_class, idx_to_class_2)])
task_creator.find_data(train_iterator=train_iterator, test_iterator=test_iterator, idx_to_class=idx_to_class)
task_creator.auto_complete()

That’s it already! For the future you may reference your task with stl10 (the value provided to alias= in the TaskCreator).

Example: DSetCreator when using public data

In this case we recommend to implement the DSetCreator from scratch including the download of the data. This allows for better reproducibility. There are the following convenience functions so far:

DSetCreator.download() to download given a URL
DSetCreator.kaggle_download() to download given a kaggle dataset ID or competition ID
DSetCreator.verify_pre_download() if parts of the data have to be downloaded manually (e.g. access only after registration)
DSetCreator.unpack_and_store() simply call after any of the previous to extract the data from archive formats
DSetCreator.transform_masks() if necessary transform masks (e.g. from segmentation masks) to fit the mml requirements

Example: TaskCreator, writing your own data iterator

If get_iterator_and_mapping_from_image_dataset does not fit your data structure, you may simply write an iterator yourself, as done with this example:

dset_creator = DSetCreator(dset_name="laryngeal_DEMO")
dset_creator.download(
    url="https://zenodo.org/record/1003200/files/laryngeal%20dataset.tar?download=1",
    file_name="laryngeal dataset.tar",
    data_kind=DataKind.TRAINING_DATA,
)
dset_path = dset_creator.unpack_and_store()
laryngeal_tissue = TaskCreator(
    dset_path=dset_path,
    task_type=TaskType.CLASSIFICATION,
    name="laryngeal_DEMO",
    desc="Laryngeal dataset for patches of healthy and early-stage cancerous laryngeal tissues",
    ref="...",
    url="https://nearlab.polimi.it/medical/dataset/",
    instr="download via zenodo.org/record/1003200/files/laryngeal%20dataset.tar?download=1",
    lic=License.CC_BY_NC_4_0,
    release="2017",
    keywords=[Keyword.MEDICAL, Keyword.LARYNGOSCOPY, Keyword.TISSUE_PATHOLOGY, Keyword.ENDOSCOPY],
)
classes = ["Hbv", "He", "IPCL", "Le"]
folds = ["FOLD 1", "FOLD 2", "FOLD 3"]
data_iterator = []
for fold in folds:
    root = dset_path / "training_data" / "laryngeal dataset" / f"{fold}"
    folders = [p.name for p in root.iterdir() if p.is_dir()]
    assert all([cl in folders for cl in classes]), "some class folder is not existent"
    for class_folder in root.iterdir():
        assert class_folder.is_dir()
        if class_folder.name not in classes:
            continue
        for img_path in class_folder.iterdir():
            data_iterator.append(
                {
                    Modality.SAMPLE_ID: img_path.stem,
                    Modality.IMAGE: img_path,
                    Modality.CLASS: classes.index(class_folder.name),
                }
            )
idx_to_class = {classes.index(cl): cl for cl in classes}
laryngeal_tissue.find_data(train_iterator=data_iterator, idx_to_class=idx_to_class)
laryngeal_tissue.auto_complete()

Example: Multiple tasks per Dataset

This example will be added later.

Step 2: BONUS - automize the task creation

mml has a create mode to generate tasks automatically. If set up correctly the above datasets would be downloaded and prepared automatically when calling mml create tasks=example (assuming example.yaml is already provided). This is much more convenient if using mml from within and not as a library - nevertheless possible and allows any other mml user that installed your package to quickly start on your data and code.

make your code installable via a package - you need a pyproject.toml and setup.cfg file for this
decorate the DSetCreator with @register_dsetcreator and your TaskCreator with @register_taskcreator
add an activate.py script to the root of your package’s source code
import the module (file) that defines the creators within this file
in your setup.cfg (or setup.py or pyproject.toml, see here) provide the correct entry point for mml

[options.entry_points]
mml.plugins =
    some_key = your_package:activate

(replace some_key with a descriptive id and your_package with your package and your_module the module you want to refer to).
These tasks are now always linked when calling mml task_list=[stl10,laryngeal_tissues] 🎉

Step 3: BONUS - add your task(s) to a tasks config file

In order to refer to your task(s) later on create a tasks config file in a configs folder that is linked to mml

if you cloned mml just navigate into configs/tasks
if you are writing your own package, create a configs folder, best at your package root level, add a tasks folder inside
create a new file example.yaml with the following content

# @package _global_

tasks:
  - 'stl10'
  - 'laryngeal_tissues'

pivot:
  name: False
  tags: ''

tagging:
  all: False
  variants: []

add something like the following to your activate.py (see step before)

from hydra.core.config_search_path import ConfigSearchPath
from hydra.core.plugins import Plugins
from hydra.plugins.search_path_plugin import SearchPathPlugin


# register plugin configs
class MMLINSERTPLUGINNAMESearchPathPlugin(SearchPathPlugin):
    def manipulate_search_path(self, search_path: ConfigSearchPath) -> None:
        # Sets the search path for mml with copied config files
        search_path.append(
            provider="mml-???", path=f"pkg://mml_???.configs"
        )


Plugins.instance().register(MMLINSERTPLUGINNAMESearchPathPlugin)

ofcourse you have to replace INSERTPLUGINNAME and mml-??? / mml_???

These tasks are now always linked when calling mml tasks=example 🎉