Hydra parameters
These are all the supported Hydra arguments that can be modified via configuration files or overridden directly in the CLI.
CLI usage
You can override any argument by using the ++arg_name=arg_value syntax in the command line.
python -m geoarches.main_hydra ++arg_name=arg_value
Note
If you only need to remember two arguments, the most important are:
modeselects between training (mode=train) and evaluation (mode=test).nameis a unique identifier for your run, make it meaningful (and readable)!
Pipeline
These arguments are used to configure the training or evaluation pipeline.
| arg_name | Default value | Description |
|---|---|---|
mode |
'train' |
train launches training (i.e. LightningModule.fit())test launches evaluation (i.e. LightningModule.test()) |
accumulate_grad_batches |
1 | Number of batches to accumulate before stepping the optimizer (see Lightning API). |
batch_size |
1 | Batch size for train, validation and test dataloaders. |
limit_train_batcheslimit_val_batcheslimit_test_batches |
1.0, Optional | Limit the number of batches loaded in dataloaders (see Lightning API). |
log_freq |
100 | How often to log metrics (in steps). |
max_steps |
300000 | Maximum number of training steps. |
seed |
0 | Seed used for reproducibility. Set via L.seed_everything(seed). |
Checkpointing
These arguments are used to configure how checkpoints are saved and loaded during training or evaluation.
| arg_name | Default value | Description |
|---|---|---|
name |
'default-run' |
A unique ID for your run. Used in checkpointing, W&B logging, etc. We recommend to change it for each new run. |
exp_dir |
'modelstore/${name}' |
Directory for saving/loading checkpoints and configs. Training will resume if a run exist. Evaluation will load checkpoint and config. By default, loads the latest checkpoint, unless ckpt_filename_match is specified. We strongly advise to not change this argument and instead change name for each new run. |
resume |
True |
Whether to resume training from a checkpoint when mode=train. If checkpoint does not exist, a new run will start. If set to False, a new run will always start. |
ckpt_filename_match |
Optional | If specified, loads the checkpoint file whose name contains this substring. If multiple matches are found, the latest checkpoint will be loaded. Not compatible with load_ckpt. |
load_ckpt |
Optional | Path to a PyTorch Lightning checkpoint file to load for evaluation or inference only (does not resume training). Not compatible with ckpt_filename_match. |
save_step_frequency |
50000 | How often to save checkpoints (in steps). |
Logging
These arguments are used to configure experiment logging.
Warning
Currently only supports W&B logging. See User Guide for more information.
| arg_name | Default value | Description |
|---|---|---|
log |
False |
Whether to enable logging. Set to True to log metrics. |
cluster.wandb_mode |
'offline' |
online logs directly to W&B and requires internet connection.offline saves locally and needs manual syncing. |
entity |
Optional | W&B entity, usually your username or team. |
project |
Optional | W&B project name. By default, inferred from '${module.project}'. |
Module and backbones arguments
These arguments define your model configuration, including Lightning modules and backbones. For a comprehensive list and detailed documentation, refer to the source code and class docstrings. To get started, you can review existing configuration files in configs/module/.
Dataloader arguments
These arguments define your dataloader configuration. For a comprehensive list and detailed documentation, refer to the source code and class docstrings. To get started, you can review existing configuration files in configs/dataloader/.
Cluster arguments
These arguments are used to configure the cluster environment for training or evaluation. They are typically set in the configs/cluster/ directory.
| arg_name | Default value | Description |
|---|---|---|
cluster.cpus |
1 | Number of CPUs to use. Used for dataloader multi-threading. |
cluster.gpus |
1 | Number of GPUs to use. Set to 0 for CPU-only training. |
cluster.precision |
'16-mixed' | Lightning precision |
cluster.use_custom_requeue |
False |
Set True to handle job prematurely prempting on computing node. Before exiting, it will save checkpoint and re-enqueue node. |
We use submitit to submit and manage jobs on a SLURM cluster. If you need specific SLURM options, use launcher.arg_name in your configuration file. Here is a small curated list of the most commonly used arguments:
| arg_name | Default value | Description |
|---|---|---|
launcher.cpus_per_task |
1 | Number of CPUs to use. Used for dataloader multi-threading. |
launcher.gpus_per_node |
1 | Number of GPUs to use. |
launcher.nodes |
1 | Number of nodes to use. |
launcher.tasks_per_node |
1 | Number of tasks per node. |
launcher.timeout_min |
60 | Maximum duration of the job in minutes. Used for SLURM --time option. |
launcher.slurm_signal_delay_s |
60 | Delay before exiting after receiving a signal. Used for requeuing jobs. |
launcher.slurm_additional_parameters |
dict, Optional | Additional SLURM parameters to pass, e.g. hint=nomultithread. |
launcher.slurm_setup |
array, Optional | Additional commands to pass before srun, e.g. module loading. |
launcher.slurm_srun_args |
array, Optional | Additional arguments to pass to srun. |