Adapt your dataset

NOTE:

mml handles preprocessing the data internally. No need to manually preprocess any data in advance. See preprocess mode.

Assume having used your dataset as a plain pytorch Dataset previously. Migration to mml is as easy as follows:

Step 1: Write your DsetCreator and TaskCreator

mml distinguishes the concepts of “Datasets” and “Tasks”. Whereby “Datasets” contains all data (plus maybe more meta information, additional tasks on the same data, additional test samples, etc.) and the “Task” is only a description which samples and labels of the “Dataset” belong to that specific task. There are a lot of convenience functions to simplify this process.

Example: Reusing your previous dataset definition

In this example we use some torchvision dataset to be integrated into mml, but it may be fully replaced with your existing dataset class.

from mml.api import (
    DSetCreator,
    TaskCreator,
    get_iterator_and_mapping_from_image_dataset,
    TaskType,
    Keyword,
    License,
    Modality,
    DataKind,
)
from torchvision.datasets import STL10
REFERENCE = """
@inproceedings{Coates2011AnAO,
  title={An Analysis of Single-Layer Networks in Unsupervised Feature Learning},
  author={Adam Coates and A. Ng and Honglak Lee},
  booktitle={AISTATS},
  year={2011}
}"""

dset_creator = DSetCreator(dset_name="STL_10_DEMO")
train = STL10(root=dset_creator.download_path, split="train", download=True)
test = STL10(root=dset_creator.download_path, split="test", download=True)
dset_path = dset_creator.extract_from_pytorch_datasets(
    datasets={"training": train, "testing": test}, task_type=TaskType.CLASSIFICATION, class_names=train.classes
)
task_creator = TaskCreator(
    dset_path=dset_path,
    task_type=TaskType.CLASSIFICATION,
    name="STL_10_DEMO",
    desc="STL-10 image recognition task",
    ref=REFERENCE,
    url="https://cs.stanford.edu/~acoates/stl10/",
    instr="downloaded via torchvision dataset (https://pytorch.org/vision/stable/generated/torchvision.datasets.STL10.html#torchvision.datasets.STL10)",
    lic=License.UNKNOWN,
    release="2011",
    keywords=[Keyword.NATURAL_OBJECTS],
)
train_iterator, idx_to_class = get_iterator_and_mapping_from_image_dataset(
    root=dset_path / "training_data", classes=train.classes
)
test_iterator, idx_to_class_2 = get_iterator_and_mapping_from_image_dataset(
    root=dset_path / "testing_data", classes=test.classes
)
assert all([a == b for a, b in zip(idx_to_class, idx_to_class_2)])
task_creator.find_data(train_iterator=train_iterator, test_iterator=test_iterator, idx_to_class=idx_to_class)
task_creator.auto_complete()

That’s it already! For the future you may reference your task with stl10 (the value provided to alias= in the TaskCreator).

Example: DSetCreator when using public data

In this case we recommend to implement the DSetCreator from scratch including the download of the data. This allows for better reproducibility. There are the following convenience functions so far:

  • DSetCreator.download() to download given a URL

  • DSetCreator.kaggle_download() to download given a kaggle dataset ID or competition ID

  • DSetCreator.verify_pre_download() if parts of the data have to be downloaded manually (e.g. access only after registration)

  • DSetCreator.unpack_and_store() simply call after any of the previous to extract the data from archive formats

  • DSetCreator.transform_masks() if necessary transform masks (e.g. from segmentation masks) to fit the mml requirements

Example: TaskCreator, writing your own data iterator

If get_iterator_and_mapping_from_image_dataset does not fit your data structure, you may simply write an iterator yourself, as done with this example:

dset_creator = DSetCreator(dset_name="laryngeal_DEMO")
dset_creator.download(
    url="https://zenodo.org/record/1003200/files/laryngeal%20dataset.tar?download=1",
    file_name="laryngeal dataset.tar",
    data_kind=DataKind.TRAINING_DATA,
)
dset_path = dset_creator.unpack_and_store()
laryngeal_tissue = TaskCreator(
    dset_path=dset_path,
    task_type=TaskType.CLASSIFICATION,
    name="laryngeal_DEMO",
    desc="Laryngeal dataset for patches of healthy and early-stage cancerous laryngeal tissues",
    ref="...",
    url="https://nearlab.polimi.it/medical/dataset/",
    instr="download via zenodo.org/record/1003200/files/laryngeal%20dataset.tar?download=1",
    lic=License.CC_BY_NC_4_0,
    release="2017",
    keywords=[Keyword.MEDICAL, Keyword.LARYNGOSCOPY, Keyword.TISSUE_PATHOLOGY, Keyword.ENDOSCOPY],
)
classes = ["Hbv", "He", "IPCL", "Le"]
folds = ["FOLD 1", "FOLD 2", "FOLD 3"]
data_iterator = []
for fold in folds:
    root = dset_path / "training_data" / "laryngeal dataset" / f"{fold}"
    folders = [p.name for p in root.iterdir() if p.is_dir()]
    assert all([cl in folders for cl in classes]), "some class folder is not existent"
    for class_folder in root.iterdir():
        assert class_folder.is_dir()
        if class_folder.name not in classes:
            continue
        for img_path in class_folder.iterdir():
            data_iterator.append(
                {
                    Modality.SAMPLE_ID: img_path.stem,
                    Modality.IMAGE: img_path,
                    Modality.CLASS: classes.index(class_folder.name),
                }
            )
idx_to_class = {classes.index(cl): cl for cl in classes}
laryngeal_tissue.find_data(train_iterator=data_iterator, idx_to_class=idx_to_class)
laryngeal_tissue.auto_complete()

Example: Multiple tasks per Dataset

This example will be added later.

Step 2: BONUS - automize the task creation

mml has a create mode to generate tasks automatically. If set up correctly the above datasets would be downloaded and prepared automatically when calling mml create tasks=example (assuming example.yaml is already provided). This is much more convenient if using mml from within and not as a library - nevertheless possible and allows any other mml user that installed your package to quickly start on your data and code.

  • make your code installable via a package - you need a pyproject.toml and setup.cfg file for this

  • decorate the DSetCreator with @register_dsetcreator and your TaskCreator with @register_taskcreator

  • add an activate.py script to the root of your package’s source code

  • import the module (file) that defines the creators within this file

  • in your setup.cfg (or setup.py or pyproject.toml, see here) provide the correct entry point for mml

[options.entry_points]
mml.plugins =
    some_key = your_package:activate
  • (replace some_key with a descriptive id and your_package with your package and your_module the module you want to refer to).

  • These tasks are now always linked when calling mml task_list=[stl10,laryngeal_tissues] 🎉

Step 3: BONUS - add your task(s) to a tasks config file

In order to refer to your task(s) later on create a tasks config file in a configs folder that is linked to mml

  • if you cloned mml just navigate into configs/tasks

  • if you are writing your own package, create a configs folder, best at your package root level, add a tasks folder inside

  • create a new file example.yaml with the following content

# @package _global_

tasks:
  - 'stl10'
  - 'laryngeal_tissues'

pivot:
  name: False
  tags: ''

tagging:
  all: False
  variants: []
  • add something like the following to your activate.py (see step before)

from hydra.core.config_search_path import ConfigSearchPath
from hydra.core.plugins import Plugins
from hydra.plugins.search_path_plugin import SearchPathPlugin


# register plugin configs
class MMLINSERTPLUGINNAMESearchPathPlugin(SearchPathPlugin):
    def manipulate_search_path(self, search_path: ConfigSearchPath) -> None:
        # Sets the search path for mml with copied config files
        search_path.append(
            provider="mml-???", path=f"pkg://mml_???.configs"
        )


Plugins.instance().register(MMLINSERTPLUGINNAMESearchPathPlugin)
  • ofcourse you have to replace INSERTPLUGINNAME and mml-??? / mml_???

These tasks are now always linked when calling mml tasks=example 🎉