mml.core.data_preparation.task_creator
- class TaskCreator[source]
Bases:
objectUsage:
Creating a new task: (a) instantiate class with all available meta information (b) call find_data to locate data of the task (c) call split_folds (or use_existing_folds) to prepare folds (d) call infer_stats to calc and set means, stds and sizes of train data (use set_stats if already known) (e) call push_and_test to save
Modifying an existing task: (a) instantiate with correct dset_path (b) call load_existent with existing task_path (c) call any of the modification functions (optionally also multiple times) (d) if necessary call infer_stats to set means, stds and sizes of train data (e) call push_and_test to save
Auto creation based on task tags: (a) instantiate with arbitrary dset_path (b) call auto_create_tagged with the full name and also the preprocessing (c) raises a RuntimeError if not possible, else will create the task and return the path
- __init__(dset_path: Path, name: str = 'default', task_type: TaskType = TaskType.UNKNOWN, desc: str = '', ref: str = '', url: str = '', instr: str = '', lic: License = License.UNKNOWN, release: str = '', keywords: List[Keyword] | None = None)[source]
Everything it needs to create a task. Use a new instance for each new task.
- Parameters:
dset_path – path to dataset root
name – name of the task to be created
task_type – task type of the task
desc – a short description of the task
ref – a reference (most likely some bibtex citation)
url – an url linked to the task
instr – instructions to download the task (data)
lic – the license corresponding to the data of the task
release – either a release date or some version of the task
keywords – keywords associated to the task
- auto_complete(device: device | None = None) Path[source]
Shortcut for finishing task creation.
- Parameters:
device (Optional[torch.device]) – torch device that will be forwarded to
infer_stats()- Returns:
the task path as returned by
push_and_test().
- find_data(train_iterator: List[Dict[Modality, int | List[int] | List[float] | str]] | None = None, test_iterator: List[Dict[Modality, int | List[int] | List[float] | str]] | None = None, unlabeled_iterator: List[Dict[Modality, int | List[int] | List[float] | str]] | None = None, idx_to_class: Dict[int, str] | None = None) None[source]
Correctly identifies data tuples yielded by the provided data iterators.
- Parameters:
train_iterator (Optional[List[SampleDescription]]) – all train data, provided as dicts with (optional) keys from ~mml.core.data_loading.task_attributes.Modality, where Modality.SAMPLE_ID is required and the corresponding value has to be unique. Some further potential entries are Modality.CLASS with int value, and Modality.MASK with Path to some (greyscale) image, both with vals in idx_to_class.
test_iterator (Optional[List[SampleDescription]]) – test data iterator, same type as train_iterator, note that MML expects test data to include labels (!), any label-free data is provided with unlabeled_iterator
unlabeled_iterator (Optional[List[SampleDescription]]) – unlabeled data iterator, same type as train_iterator, can be used to either provide additional unlabeled “train data” or alternatively as prediction data
idx_to_class (Optional[Dict[int, str]]) – dict mapping ints to class names (e.g. {0 -> background, 1 -> instrument}), may also be non-continuous (e.g. {0 -> background, 3 -> instrument}) for subclassing or mapping indices to the same class (e.g. {0 -> background, 1 -> instrument, 3 -> instrument}) for merging classes, please use index 0 for background in segmentation tasks (all unused values will be mapped to 0). In case of only unlabelled data this is not necessary (otherwise it is).
- Returns:
None
- identity(*args) None[source]
Dummy tag to create identical instances of a task. In contrast to naming the same task twice in the task_list, which will load from the same .json task description the identity tagged version creates a fresh task description.
- Parameters:
args – all args are ignored
- Returns:
None
- infer_stats(sizes: bool = True, mean_and_std: bool = True, const_size: bool = False, device: device | None = None) None[source]
Calculates the training data stats. For improved speed indicating a constant size of images helps.
- Parameters:
sizes (bool) – if the image sizes should be gathered
mean_and_std (bool) – if the mean and standard deviation of color channels should be gathered
const_size (bool) – if images are known to have constant size, this helps flag improves speed
device (torch.device) – if provided use this device, otherwise infer device based on availability
- Returns:
None
- load_existent(task_path: Path) None[source]
Loads an existent task .json, useful prior to modifications (e.g. tagging or preprocessing).
- Parameters:
task_path (Path) – path to task .json file
- Returns:
None
- map_tag(tag: str) Callable[source]
Correct way to resolve a task name tag to the corresponding modifier method.
- Parameters:
tag (str) – a string that can be appended to a task name e.g. ‘identity’ (appended as ‘+identity’)
- Returns:
- nested_validation(fold_str: str, new_folds_str: str = '5') None[source]
This tag will create a nested task, useful for cross-validation techniques. It drops any previous test samples, and re-declares the specified fold as new test data. Afterward, the remaining train samples are re-shuffled and distributed into new folds according to the new_folds argument.
- protocol(msg: str) None[source]
Method to log any processing to the creation_protocol of the meta information.
- Parameters:
msg (str) – message to be logged, will be formatted with datetime and appended to the creation_protocol
- Returns:
- push_and_test() Path[source]
Final step of task creation. Flushes the created task description and runs a test to load it.
- Returns:
the path of the written .json task description
- Return type:
Path
- set_stats(means: RGBInfo | None = None, stds: RGBInfo | None = None, sizes: Sizes | None = None) None[source]
Alternative to
infer_stats()with provided means, stds and sizes. May also only set a subset of those (or even None).
- split_folds(n_folds: int = 5, ensure_balancing: bool = True, fold_0_fraction: float | None = None, seed: int = 42) None[source]
Splits the found data into folds for cross validation. It is necessary to call either this or the use_existing folds method before infer_stats.
- This method requires the following attributes to be set:
self.data[DataSplit.FULL_TRAIN]
self.current_meta.task_type
self.current_meta.class_occ
self.current_meta.idx_to_class
- This method sets the following attributes:
self.current_meta.train_folds
self.current_meta.train_samples
self.current_meta.test_samples
self.current_meta.unlabeled_samples
self.data
WARNING: The splitting of folds happens deterministic to ensure reproducibility. One implication of this is that tasks with identical number of training samples (and identical values for n_folds) will also be split identical (with respect to the order of the samples in self.data[DataSplit.FULL_TRAIN]). For classification tasks this can be prohibited by using ensure_balancing (since sampling then also happens at class level) or in general by using the seed parameter.
- Parameters:
n_folds – number of folds to split into
ensure_balancing – indicates if classes should be balanced across folds (only for classification tasks)
fold_0_fraction (Optional[float]) – if set the first fold (usually used as validation split) will receive that fraction of samples, the rest will be distributed evenly across remaining folds. If None all folds will have the same size. When a value is provided it must be within (0, 1), but chosen such that least one sample (per class if ensure_balancing is active) is contained in each fold.
seed (int) – controls the determinism behind splitting, default: 42
- Returns:
None
- use_existing_folds(fold_definition: List[List[str]]) None[source]
Replacement for the split_folds function in case there are already predefined folds.
- Parameters:
fold_definition (List[List[str]]) – list of lists of data ids, each list within the main list represents one fold, data ids must match the ones provided to find_data
- Returns:
None
- verify_modality_entry(modality: Modality, value: Any, idx_to_class: Dict[int, str], class_occ: Dict[str, int]) None[source]
Extendable method to verify that the entries of a modality are well formatted. Extracts a potential verificator from the global MODALITY_VERIFIER_MAP dictionary and runs the verificator. To support new modalities modify this global dictionary.
- Parameters:
modality
value
idx_to_class
class_occ
- Returns:
- implements_action(action: TaskCreatorActions)[source]
This is a decorator to simplify state management of the task creator. It also adds a “secret” <ignore_state> kwarg to most task creator methods, if this is set no state check is done.
- Parameters:
action – the action that the following function implements
- Returns:
a decorator
- verify_class_modality(creator: TaskCreator, value: Any, idx_to_class: Dict[int, str], class_occ: Dict[str, int]) None[source]
- verify_classes_modality(creator: TaskCreator, value: Any, idx_to_class: Dict[int, str], class_occ: Dict[str, int]) None[source]
- verify_mask_modality(creator: TaskCreator, value: Any, idx_to_class: Dict[int, str], class_occ: Dict[str, int]) None[source]