mml.core.data_preparation.task_creator

class TaskCreator[source]

Bases: object

Usage:

  1. Creating a new task: (a) instantiate class with all available meta information (b) call find_data to locate data of the task (c) call split_folds (or use_existing_folds) to prepare folds (d) call infer_stats to calc and set means, stds and sizes of train data (use set_stats if already known) (e) call push_and_test to save

  2. Modifying an existing task: (a) instantiate with correct dset_path (b) call load_existent with existing task_path (c) call any of the modification functions (optionally also multiple times) (d) if necessary call infer_stats to set means, stds and sizes of train data (e) call push_and_test to save

  3. Auto creation based on task tags: (a) instantiate with arbitrary dset_path (b) call auto_create_tagged with the full name and also the preprocessing (c) raises a RuntimeError if not possible, else will create the task and return the path

__init__(dset_path: Path, name: str = 'default', task_type: TaskType = TaskType.UNKNOWN, desc: str = '', ref: str = '', url: str = '', instr: str = '', lic: License = License.UNKNOWN, release: str = '', keywords: List[Keyword] | None = None)[source]

Everything it needs to create a task. Use a new instance for each new task.

Parameters:
  • dset_path – path to dataset root

  • name – name of the task to be created

  • task_type – task type of the task

  • desc – a short description of the task

  • ref – a reference (most likely some bibtex citation)

  • url – an url linked to the task

  • instr – instructions to download the task (data)

  • lic – the license corresponding to the data of the task

  • release – either a release date or some version of the task

  • keywords – keywords associated to the task

auto_complete(device: device | None = None) Path[source]

Shortcut for finishing task creation.

Parameters:

device (Optional[torch.device]) – torch device that will be forwarded to infer_stats()

Returns:

the task path as returned by push_and_test().

static auto_create_tagged(full_alias: str, preprocessing: str = 'none') Path[source]
find_data(train_iterator: List[Dict[Modality, int | List[int] | List[float] | str]] | None = None, test_iterator: List[Dict[Modality, int | List[int] | List[float] | str]] | None = None, unlabeled_iterator: List[Dict[Modality, int | List[int] | List[float] | str]] | None = None, idx_to_class: Dict[int, str] | None = None) None[source]

Correctly identifies data tuples yielded by the provided data iterators.

Parameters:
  • train_iterator (Optional[List[SampleDescription]]) – all train data, provided as dicts with (optional) keys from ~mml.core.data_loading.task_attributes.Modality, where Modality.SAMPLE_ID is required and the corresponding value has to be unique. Some further potential entries are Modality.CLASS with int value, and Modality.MASK with Path to some (greyscale) image, both with vals in idx_to_class.

  • test_iterator (Optional[List[SampleDescription]]) – test data iterator, same type as train_iterator, note that MML expects test data to include labels (!), any label-free data is provided with unlabeled_iterator

  • unlabeled_iterator (Optional[List[SampleDescription]]) – unlabeled data iterator, same type as train_iterator, can be used to either provide additional unlabeled “train data” or alternatively as prediction data

  • idx_to_class (Optional[Dict[int, str]]) – dict mapping ints to class names (e.g. {0 -> background, 1 -> instrument}), may also be non-continuous (e.g. {0 -> background, 3 -> instrument}) for subclassing or mapping indices to the same class (e.g. {0 -> background, 1 -> instrument, 3 -> instrument}) for merging classes, please use index 0 for background in segmentation tasks (all unused values will be mapped to 0). In case of only unlabelled data this is not necessary (otherwise it is).

Returns:

None

identity(*args) None[source]

Dummy tag to create identical instances of a task. In contrast to naming the same task twice in the task_list, which will load from the same .json task description the identity tagged version creates a fresh task description.

Parameters:

args – all args are ignored

Returns:

None

infer_stats(sizes: bool = True, mean_and_std: bool = True, const_size: bool = False, device: device | None = None) None[source]

Calculates the training data stats. For improved speed indicating a constant size of images helps.

Parameters:
  • sizes (bool) – if the image sizes should be gathered

  • mean_and_std (bool) – if the mean and standard deviation of color channels should be gathered

  • const_size (bool) – if images are known to have constant size, this helps flag improves speed

  • device (torch.device) – if provided use this device, otherwise infer device based on availability

Returns:

None

load_existent(task_path: Path) None[source]

Loads an existent task .json, useful prior to modifications (e.g. tagging or preprocessing).

Parameters:

task_path (Path) – path to task .json file

Returns:

None

map_tag(tag: str) Callable[source]

Correct way to resolve a task name tag to the corresponding modifier method.

Parameters:

tag (str) – a string that can be appended to a task name e.g. ‘identity’ (appended as ‘+identity’)

Returns:

nested_validation(fold_str: str, new_folds_str: str = '5') None[source]

This tag will create a nested task, useful for cross-validation techniques. It drops any previous test samples, and re-declares the specified fold as new test data. Afterward, the remaining train samples are re-shuffled and distributed into new folds according to the new_folds argument.

Parameters:
  • fold_str (str) – the fold to be re-declared as test data

  • new_folds_str (str) – the number of new folds created from the remaining train data

Returns:

None

protocol(msg: str) None[source]

Method to log any processing to the creation_protocol of the meta information.

Parameters:

msg (str) – message to be logged, will be formatted with datetime and appended to the creation_protocol

Returns:

push_and_test() Path[source]

Final step of task creation. Flushes the created task description and runs a test to load it.

Returns:

the path of the written .json task description

Return type:

Path

set_stats(means: RGBInfo | None = None, stds: RGBInfo | None = None, sizes: Sizes | None = None) None[source]

Alternative to infer_stats() with provided means, stds and sizes. May also only set a subset of those (or even None).

Parameters:
  • means (Optional[RGBInfo]) – task RGB channel means

  • stds (Optional[RGBInfo]) – task RGB channel stds

  • sizes (Optional[Sizes]) – task image dimensions

Returns:

None

split_folds(n_folds: int = 5, ensure_balancing: bool = True, fold_0_fraction: float | None = None, seed: int = 42) None[source]

Splits the found data into folds for cross validation. It is necessary to call either this or the use_existing folds method before infer_stats.

This method requires the following attributes to be set:
  • self.data[DataSplit.FULL_TRAIN]

  • self.current_meta.task_type

  • self.current_meta.class_occ

  • self.current_meta.idx_to_class

This method sets the following attributes:
  • self.current_meta.train_folds

  • self.current_meta.train_samples

  • self.current_meta.test_samples

  • self.current_meta.unlabeled_samples

  • self.data

WARNING: The splitting of folds happens deterministic to ensure reproducibility. One implication of this is that tasks with identical number of training samples (and identical values for n_folds) will also be split identical (with respect to the order of the samples in self.data[DataSplit.FULL_TRAIN]). For classification tasks this can be prohibited by using ensure_balancing (since sampling then also happens at class level) or in general by using the seed parameter.

Parameters:
  • n_folds – number of folds to split into

  • ensure_balancing – indicates if classes should be balanced across folds (only for classification tasks)

  • fold_0_fraction (Optional[float]) – if set the first fold (usually used as validation split) will receive that fraction of samples, the rest will be distributed evenly across remaining folds. If None all folds will have the same size. When a value is provided it must be within (0, 1), but chosen such that least one sample (per class if ensure_balancing is active) is contained in each fold.

  • seed (int) – controls the determinism behind splitting, default: 42

Returns:

None

use_existing_folds(fold_definition: List[List[str]]) None[source]

Replacement for the split_folds function in case there are already predefined folds.

Parameters:

fold_definition (List[List[str]]) – list of lists of data ids, each list within the main list represents one fold, data ids must match the ones provided to find_data

Returns:

None

verify_modality_entry(modality: Modality, value: Any, idx_to_class: Dict[int, str], class_occ: Dict[str, int]) None[source]

Extendable method to verify that the entries of a modality are well formatted. Extracts a potential verificator from the global MODALITY_VERIFIER_MAP dictionary and runs the verificator. To support new modalities modify this global dictionary.

Parameters:
  • modality

  • value

  • idx_to_class

  • class_occ

Returns:

implements_action(action: TaskCreatorActions)[source]

This is a decorator to simplify state management of the task creator. It also adds a “secret” <ignore_state> kwarg to most task creator methods, if this is set no state check is done.

Parameters:

action – the action that the following function implements

Returns:

a decorator

verify_class_modality(creator: TaskCreator, value: Any, idx_to_class: Dict[int, str], class_occ: Dict[str, int]) None[source]
verify_classes_modality(creator: TaskCreator, value: Any, idx_to_class: Dict[int, str], class_occ: Dict[str, int]) None[source]
verify_mask_modality(creator: TaskCreator, value: Any, idx_to_class: Dict[int, str], class_occ: Dict[str, int]) None[source]
verify_softclasses_modality(creator: TaskCreator, value: Any, idx_to_class: Dict[int, str], class_occ: Dict[str, int]) None[source]
verify_value_modality(creator: TaskCreator, value: Any, idx_to_class: Dict[int, str], class_occ: Dict[str, int]) None[source]