mml.core.data_loading.file_manager

class MMLFileManager[source]

Bases: Singleton

This class keeps track of the file structure of MML. It ensures a consistent checkpointing and loading strategy, provides aggregated listings, and handles requests for files.

GLOBAL_REUSABLE = '_TOP_LEVEL_'
__init__(data_path: Path, proj_path: Path, log_path: Path, reuse_cfg: DictConfig | ReuseConfig | None = None, remove_cfg: DictConfig | RemoveConfig | None = None)[source]

The file manager is a singleton class and usually is only generated once. Afterward it may be called from anywhere via MMLFileManager.instance(), refer to Singleton for more details on this.

Parameters:
  • data_path (Path) – path to MML data

  • proj_path (Path) – path to current experiment root

  • log_path (Path) – path to current experiment run root

  • reuse_cfg (ReuseConfig) – (optional) a configuration on which files of the project should be reused

  • remove_cfg (RemoveConfig) – (optional) a configuration on which files of the project should be deleted

classmethod add_assignment_path(obj_cls: type | None, key: str, path: Path | str, enable_numbering: bool = True, reusable: bool | str = False) None[source]

Adds a custom path assignment to the file manager. A necessary location to do the assignment is before the initialization of the file manager (note this is a class method), which could be just before starting the main() inside your code or the ‘activate.py’ of your plugin. Once the assignment is done, a new path can be requested via the construct_saving_path() method. Furthermore, the path assignments control the reuse functionality of the file manager.

Parameters:
  • obj_cls (Optional[type]) – the class of objects you want to create a path for, this is used for double-checking during usage, provide None if you want to omit this step

  • key (str) – the key you want to refer your path to, this must be unique, raises KeyError if the key is already in use

  • path (Union[Path, str]) – the desired path to store the data, it must start either with PROJ_PATH or TEMP_PATH, which will later be replaced with the file managers attr:proj_path respectively temp_data otherwise raises a ValueError. The path may use the following further special tokens, that will be replaced during actual path creation: TASK_NAME and FILE_NAME. TASK_NAME is a placeholder that will be replaced during construct_saving_path() and is necessary for the reuse functionality (see reusable below). FILE_NAME will allow naming the file during the actual call to construct_saving_path() and is only allowed as the last part of the path.

  • enable_numbering (bool) – boolean deciding if the path is static or will dynamically increase when a file already exists. Default: True

  • reusable (Union[bool, str]) – if True the paths should be reusable. This allows to set reuse.key=project (or even reuse.key=[project1,project2] to load from multiple projects, where the last found artefact persists) when starting mml and automatically attach the latest path under PROJ_PATH/<ATTRIBUTE>/TASK_NAME to each task structs :attr:`~mml.core.data_loading.task_struct.TaskStruct.paths dictionary with key as a key. This requires path to fit the format PROJ_PATH/<ATTRIBUTE>/TASK_NAME/<some_file_name>, where <ATTRIBUTE> is required to be capitalized by convention, otherwise a raises an ValueError. If the string :attr:`~mml.core.data_loading.file_manager.MMLFileManager.GLOBAL_REUSABLE is used instead any found reusable is attached to the _TOP_LEVEL_ entry to be reached via :attr:`~mml.core.data_loading.file_manager.MMLFileManager.global_reusables property. This does not require the path format previously specified.

Raises:
  • KeyError – if key is already used for a path construction

  • ValueError – If either * path does not start with PROJ_PATH or TEMP_PATH * the path has no suffix to indicate a file type (exception if FILE_NAME is the last path segment) * FILE_NAME is used as a non-final part of the path * the cls argument is neither None nor a class * reusable=True but path does not match the described requirements * .. or ~ in path * plus some more checks

Returns:

None

add_to_task_index(path: Path) None[source]

Adds a single task path to the task index.

Parameters:

path – path to .json to be added

Returns:

None

construct_saving_path(obj: object, key: str, task_name: str | None = None, file_name: str | None = None) Path[source]

All file savings are organised here to avoid unwanted interactions from different applications.

Parameters:
  • obj (Any) – object to be saved (if the object itself are files, simply give None)

  • key (str) – string, must be in the DEFAULT_ASSIGNMENTS or manually assigned previously (see add_assignment_path())

  • task_name (Optional[str]) – (optional) name of the task, only necessary if TASK_NAME in the assignment pattern

  • file_name (Optional[str]) – (optional) file name, only necessary if FILE_NAME in the assignment pattern

Returns:

path to save the object

get_all_dset_names() Dict[str, Dict[str, Path]][source]

Returns all found dataset names. Datasets are clustered by their preprocessing.

Returns:

dict with preprocessing key and dict value that itself corresponds to dataset names as key and root path as value, note that the literal string none is used for not preprocessed data

Return type:

Dict[str, Dict[str, Path]]

get_dataset_path(dset_name: str | None = None, raw_path: Path | None = None, preprocessing: str | None = None) Path[source]

Creates and/or returns a correct dataset directory for the given preprocessing ID. Note that this should only be called once for the raw case, but can be called multiple times for (even identical) preprocessing IDs.

Parameters:
  • dset_name (Optional[str]) – name of the dataset (provide this in case a new raw dataset is created)

  • raw_path (Optional[Path]) – existing path to the raw version of a dataset (provide this in case a preprocessing is desired)

  • preprocessing (Optional[str]) – preprocessing ID, None for raw data

Raises:
  • AssertionError – in case exactly one of preprocessing and raw_path is given

  • AssertionError – in case not exactly one of raw_path and dset_name is given

  • FileExistsError – in case dset_name has been given before

Returns:

path to store dataset files

Return type:

Path

get_download_path(dset_name: str) Path[source]

Creates and returns a download path for some name dataset name. This will point to the same download path if called again, via this mechanism detecting existing downloads and continuing downloads is possible.

Parameters:

dset_name (str) – string (ideally representing the dataset)

Returns:

empty directory to store downloaded data

Return type:

Path

get_pp_definition(preprocessing: str) Path[source]

Return a definition copy of a created preprocessing folder. Can be used to check whether a preprocessing definition has changed since it’s processing. If no file has been created so far a new one will be created.

Parameters:

preprocessing – the ID of the preprocessing

Returns:

path to a file to store a preprocessing definition

get_task_info(task_name: str, preprocess: str) dict[source]

Locates (if possible) the preprocessed task with provided name. Falls back to raw data in case preprocess is not available. Returns dict that can be used to construct a TaskStruct object.

Parameters:
  • task_name (str) – name of the task

  • preprocess (str) – a preprocess id (e.g. ‘none’ for raw task)

Returns:

kwargs required for TaskStruct

get_task_path(dset_path: Path, task_alias: str) Path[source]

Creates and returns a correct task file path.

Parameters:
  • dset_path – dataset path of the task

  • task_alias – name of the task (is used as abbreviation internally and initialises dir name)

Returns:

path to put task .json file

property global_reusables: Dict[str, Path]

Global reusables are not attached to a specific task.

static load_task_description(path: Path) TaskDescription[source]

Returns TaskDescription of given task.

Parameters:

path (Path) – path to .json file

Returns:

loaded TaskDescription with all meta information

static load_task_description_header(path: Path) TaskDescription[source]

Similar to load_meta, but recovers only the header information necessary to construct TaskStruct, without the details regarding folds and sample paths. Useful for large .json files to save time on MML initialisation.

Parameters:

path (Path) – path to .json file

Returns:

TaskDescription with only meta information

reload_task_index() None[source]

Scans self.raw_data and self.preprocessed_data for all available tasks, thereby creates a library concerning aliases and also preprocessings. Task index is a dict with dicts, hierarchy is name -> preprocessing -> (relative) path.

Returns:

None

remove_intermediates() None[source]

Based on the clean_up settings of the reuse_config deletes intermediate results within this project to minimize disk storage. Other intermediates like the checkpoints path and the temp path are also cleared.

Returns:

None

property results_root: Path

The root path of the current systems results.

static undo_prefix(dir_name: str) str[source]

Reverts the application of task/dset prefix adding.

Parameters:

dir_name – string to be applied on

Returns:

non-prefixed string

static write_task_description(path: Path, task_description: TaskDescription, omit_warning: bool = False) None[source]

Stores meta information of a task at the given path.

Parameters:
  • path (Path) – path to store .json file

  • task_description (TaskDescription) – TaskDescription of a task

  • omit_warning (bool) – if True will raise no warning even if the file already exists

Returns:

None

class RemoveConfig[source]

Bases: object

RemoveConfig(img_examples: bool = False, blueprint: bool = False, parameters: bool = False, pipeline: bool = False, predictions: bool = False, models: bool = False, sample_grid: bool = False, backup: bool = False)

__init__(img_examples: bool = False, blueprint: bool = False, parameters: bool = False, pipeline: bool = False, predictions: bool = False, models: bool = False, sample_grid: bool = False, backup: bool = False) None
backup: bool = False
blueprint: bool = False
img_examples: bool = False
models: bool = False
parameters: bool = False
pipeline: bool = False
predictions: bool = False
sample_grid: bool = False
class ReuseConfig[source]

Bases: object

ReuseConfig(blueprint: Union[str, List[str], NoneType] = None, models: Union[str, List[str], NoneType] = None, parameters: Union[str, List[str], NoneType] = None)

__init__(blueprint: str | List[str] | None = None, models: str | List[str] | None = None, parameters: str | List[str] | None = None) None
blueprint: str | List[str] | None = None
models: str | List[str] | None = None
parameters: str | List[str] | None = None