Skip to content

Hydra parameters

These are all the supported Hydra arguments that can be modified via configuration files or overridden directly in the CLI.

CLI usage

You can override any argument by using the ++arg_name=arg_value syntax in the command line.

python -m geoarches.main_hydra ++arg_name=arg_value

Note

If you only need to remember two arguments, the most important are:

  1. mode selects between training (mode=train) and evaluation (mode=test).
  2. name is a unique identifier for your run, make it meaningful (and readable)!

Pipeline

These arguments are used to configure the training or evaluation pipeline.

arg_name Default value Description
mode 'train' train launches training (i.e. LightningModule.fit())
test launches evaluation (i.e. LightningModule.test())
accumulate_grad_batches 1 Number of batches to accumulate before stepping the optimizer (see Lightning API).
batch_size 1 Batch size for train, validation and test dataloaders.
limit_train_batches
limit_val_batches
limit_test_batches
1.0, Optional Limit the number of batches loaded in dataloaders (see Lightning API).
log_freq 100 How often to log metrics (in steps).
max_steps 300000 Maximum number of training steps.
seed 0 Seed used for reproducibility. Set via L.seed_everything(seed).

Checkpointing

These arguments are used to configure how checkpoints are saved and loaded during training or evaluation.

arg_name Default value Description
name 'default-run' A unique ID for your run. Used in checkpointing, W&B logging, etc. We recommend to change it for each new run.
exp_dir 'modelstore/${name}' Directory for saving/loading checkpoints and configs.
Training will resume if a run exist. Evaluation will load checkpoint and config. By default, loads the latest checkpoint, unless ckpt_filename_match is specified. We strongly advise to not change this argument and instead change name for each new run.
resume True Whether to resume training from a checkpoint when mode=train. If checkpoint does not exist, a new run will start. If set to False, a new run will always start.
ckpt_filename_match Optional If specified, loads the checkpoint file whose name contains this substring. If multiple matches are found, the latest checkpoint will be loaded. Not compatible with load_ckpt.
load_ckpt Optional Path to a PyTorch Lightning checkpoint file to load for evaluation or inference only (does not resume training). Not compatible with ckpt_filename_match.
save_step_frequency 50000 How often to save checkpoints (in steps).

Logging

These arguments are used to configure experiment logging.

Warning

Currently only supports W&B logging. See User Guide for more information.

arg_name Default value Description
log False Whether to enable logging. Set to True to log metrics.
cluster.wandb_mode 'offline' online logs directly to W&B and requires internet connection.
offline saves locally and needs manual syncing.
entity Optional W&B entity, usually your username or team.
project Optional W&B project name. By default, inferred from '${module.project}'.

Module and backbones arguments

These arguments define your model configuration, including Lightning modules and backbones. For a comprehensive list and detailed documentation, refer to the source code and class docstrings. To get started, you can review existing configuration files in configs/module/.

Dataloader arguments

These arguments define your dataloader configuration. For a comprehensive list and detailed documentation, refer to the source code and class docstrings. To get started, you can review existing configuration files in configs/dataloader/.

Cluster arguments

These arguments are used to configure the cluster environment for training or evaluation. They are typically set in the configs/cluster/ directory.

arg_name Default value Description
cluster.cpus 1 Number of CPUs to use. Used for dataloader multi-threading.
cluster.gpus 1 Number of GPUs to use. Set to 0 for CPU-only training.
cluster.precision '16-mixed' Lightning precision
cluster.use_custom_requeue False Set True to handle job prematurely prempting on computing node. Before exiting, it will save checkpoint and re-enqueue node.

We use submitit to submit and manage jobs on a SLURM cluster. If you need specific SLURM options, use launcher.arg_name in your configuration file. Here is a small curated list of the most commonly used arguments:

arg_name Default value Description
launcher.cpus_per_task 1 Number of CPUs to use. Used for dataloader multi-threading.
launcher.gpus_per_node 1 Number of GPUs to use.
launcher.nodes 1 Number of nodes to use.
launcher.tasks_per_node 1 Number of tasks per node.
launcher.timeout_min 60 Maximum duration of the job in minutes. Used for SLURM --time option.
launcher.slurm_signal_delay_s 60 Delay before exiting after receiving a signal. Used for requeuing jobs.
launcher.slurm_additional_parameters dict, Optional Additional SLURM parameters to pass, e.g. hint=nomultithread.
launcher.slurm_setup array, Optional Additional commands to pass before srun, e.g. module loading.
launcher.slurm_srun_args array, Optional Additional arguments to pass to srun.