mml.core.data_loading.file_manager
- class MMLFileManager[source]
Bases:
SingletonThis class keeps track of the file structure of MML. It ensures a consistent checkpointing and loading strategy, provides aggregated listings, and handles requests for files.
- GLOBAL_REUSABLE = '_TOP_LEVEL_'
- __init__(data_path: Path, proj_path: Path, log_path: Path, reuse_cfg: DictConfig | ReuseConfig | None = None, remove_cfg: DictConfig | RemoveConfig | None = None)[source]
The file manager is a singleton class and usually is only generated once. Afterward it may be called from anywhere via MMLFileManager.instance(), refer to
Singletonfor more details on this.- Parameters:
data_path (Path) – path to MML data
proj_path (Path) – path to current experiment root
log_path (Path) – path to current experiment run root
reuse_cfg (ReuseConfig) – (optional) a configuration on which files of the project should be reused
remove_cfg (RemoveConfig) – (optional) a configuration on which files of the project should be deleted
- classmethod add_assignment_path(obj_cls: type | None, key: str, path: Path | str, enable_numbering: bool = True, reusable: bool | str = False) None[source]
Adds a custom path assignment to the file manager. A necessary location to do the assignment is before the initialization of the file manager (note this is a class method), which could be just before starting the
main()inside your code or the ‘activate.py’ of your plugin. Once the assignment is done, a new path can be requested via theconstruct_saving_path()method. Furthermore, the path assignments control the reuse functionality of the file manager.- Parameters:
obj_cls (Optional[type]) – the class of objects you want to create a path for, this is used for double-checking during usage, provide None if you want to omit this step
key (str) – the key you want to refer your path to, this must be unique, raises
KeyErrorif the key is already in usepath (Union[Path, str]) – the desired path to store the data, it must start either with PROJ_PATH or TEMP_PATH, which will later be replaced with the file managers attr:proj_path respectively
temp_dataotherwise raises aValueError. The path may use the following further special tokens, that will be replaced during actual path creation: TASK_NAME and FILE_NAME. TASK_NAME is a placeholder that will be replaced duringconstruct_saving_path()and is necessary for the reuse functionality (see reusable below). FILE_NAME will allow naming the file during the actual call toconstruct_saving_path()and is only allowed as the last part of the path.enable_numbering (bool) – boolean deciding if the path is static or will dynamically increase when a file already exists. Default:
Truereusable (Union[bool, str]) – if True the paths should be reusable. This allows to set
reuse.key=project(or evenreuse.key=[project1,project2]to load from multiple projects, where the last found artefact persists) when startingmmland automatically attach the latest path under PROJ_PATH/<ATTRIBUTE>/TASK_NAME to each task structs :attr:`~mml.core.data_loading.task_struct.TaskStruct.paths dictionary withkeyas a key. This requires path to fit the format PROJ_PATH/<ATTRIBUTE>/TASK_NAME/<some_file_name>, where <ATTRIBUTE> is required to be capitalized by convention, otherwise a raises an ValueError. If the string :attr:`~mml.core.data_loading.file_manager.MMLFileManager.GLOBAL_REUSABLE is used instead any found reusable is attached to the _TOP_LEVEL_ entry to be reached via :attr:`~mml.core.data_loading.file_manager.MMLFileManager.global_reusables property. This does not require the path format previously specified.
- Raises:
KeyError – if key is already used for a path construction
ValueError – If either * path does not start with PROJ_PATH or TEMP_PATH * the path has no suffix to indicate a file type (exception if FILE_NAME is the last path segment) * FILE_NAME is used as a non-final part of the path * the cls argument is neither None nor a class * reusable=True but path does not match the described requirements *
..or~in path * plus some more checks
- Returns:
None
- add_to_task_index(path: Path) None[source]
Adds a single task path to the task index.
- Parameters:
path – path to .json to be added
- Returns:
None
- construct_saving_path(obj: object, key: str, task_name: str | None = None, file_name: str | None = None) Path[source]
All file savings are organised here to avoid unwanted interactions from different applications.
- Parameters:
obj (Any) – object to be saved (if the object itself are files, simply give None)
key (str) – string, must be in the DEFAULT_ASSIGNMENTS or manually assigned previously (see
add_assignment_path())task_name (Optional[str]) – (optional) name of the task, only necessary if TASK_NAME in the assignment pattern
file_name (Optional[str]) – (optional) file name, only necessary if FILE_NAME in the assignment pattern
- Returns:
path to save the object
- get_all_dset_names() Dict[str, Dict[str, Path]][source]
Returns all found dataset names. Datasets are clustered by their preprocessing.
- get_dataset_path(dset_name: str | None = None, raw_path: Path | None = None, preprocessing: str | None = None) Path[source]
Creates and/or returns a correct dataset directory for the given preprocessing ID. Note that this should only be called once for the raw case, but can be called multiple times for (even identical) preprocessing IDs.
- Parameters:
- Raises:
AssertionError – in case exactly one of preprocessing and raw_path is given
AssertionError – in case not exactly one of raw_path and dset_name is given
FileExistsError – in case dset_name has been given before
- Returns:
path to store dataset files
- Return type:
- get_download_path(dset_name: str) Path[source]
Creates and returns a download path for some name dataset name. This will point to the same download path if called again, via this mechanism detecting existing downloads and continuing downloads is possible.
- get_pp_definition(preprocessing: str) Path[source]
Return a definition copy of a created preprocessing folder. Can be used to check whether a preprocessing definition has changed since it’s processing. If no file has been created so far a new one will be created.
- Parameters:
preprocessing – the ID of the preprocessing
- Returns:
path to a file to store a preprocessing definition
- get_task_info(task_name: str, preprocess: str) dict[source]
Locates (if possible) the preprocessed task with provided name. Falls back to raw data in case preprocess is not available. Returns dict that can be used to construct a TaskStruct object.
- get_task_path(dset_path: Path, task_alias: str) Path[source]
Creates and returns a correct task file path.
- Parameters:
dset_path – dataset path of the task
task_alias – name of the task (is used as abbreviation internally and initialises dir name)
- Returns:
path to put task .json file
- static load_task_description(path: Path) TaskDescription[source]
Returns TaskDescription of given task.
- Parameters:
path (Path) – path to .json file
- Returns:
loaded TaskDescription with all meta information
- static load_task_description_header(path: Path) TaskDescription[source]
Similar to load_meta, but recovers only the header information necessary to construct TaskStruct, without the details regarding folds and sample paths. Useful for large .json files to save time on MML initialisation.
- Parameters:
path (Path) – path to .json file
- Returns:
TaskDescription with only meta information
- reload_task_index() None[source]
Scans self.raw_data and self.preprocessed_data for all available tasks, thereby creates a library concerning aliases and also preprocessings. Task index is a dict with dicts, hierarchy is name -> preprocessing -> (relative) path.
- Returns:
None
- remove_intermediates() None[source]
Based on the clean_up settings of the reuse_config deletes intermediate results within this project to minimize disk storage. Other intermediates like the checkpoints path and the temp path are also cleared.
- Returns:
None
- static undo_prefix(dir_name: str) str[source]
Reverts the application of task/dset prefix adding.
- Parameters:
dir_name – string to be applied on
- Returns:
non-prefixed string
- static write_task_description(path: Path, task_description: TaskDescription, omit_warning: bool = False) None[source]
Stores meta information of a task at the given path.
- Parameters:
path (Path) – path to store .json file
task_description (TaskDescription) – TaskDescription of a task
omit_warning (bool) – if True will raise no warning even if the file already exists
- Returns:
None
- class RemoveConfig[source]
Bases:
objectConfiguration for removing certain artefacts from at the end of the run. During runtime this is an OmegaConf DictConfig and may contain additional entries! See
remove_intermediates()on how this is used.
- class ReuseConfig[source]
Bases:
objectConfiguration for reusing certain artefacts from other projects or previous runs. During runtime this is an OmegaConf DictConfig and may contain additional entries! See
_find_reusables()on how this is used.