Experiments and Runs
Defining an Experiment
In the DRYTorch framework, an experiment is a fully reproducible execution of code defined entirely by a configuration file. For example, this design implies that:
a result obtained by modifying the configuration file (e.g., changing the optimizer) constitutes a new experiment instance.
a parameter sweep (or grid search), when fully described within the configuration file, is considered a single experiment.
To define an experiment, subclass the DRYTorch’s Experiment class specifying the required specification.
The Experiment class needs a name unique for each instance and accepts tags and a directory for logging as optional arguments, which is communicated to trackers.
! uv pip install drytorch
import dataclasses
from drytorch import Experiment as GenericExperiment
@dataclasses.dataclass(frozen=True)
class SimpleConfig:
"""A simple configuration."""
batch_size: int
class MyExperiment(GenericExperiment[SimpleConfig]):
"""Class for Simple Experiments."""
my_config = SimpleConfig(32)
first_experiment = MyExperiment(
my_config,
name='FirstExp',
par_dir='experiments/',
tags=[],
)
Starting a Run
In the DRYTorch framework, a run is a single execution instance of an experiment’s code. Multiple runs of the same experiment are used to replicate and validate results, often using different seeds for the pseudo number generator. There can only be an active run at once.
You initiate a run instance using the Experiment.create_run method. This instance serves as a context manager for the experiment’s execution code.
The run’s ID is a timestamp by default, but you can specify a unique, descriptive name.
You can resume a run by specifying its name in create_run. If a name is not provided, DRYTorch attempts to resume the last recorded run.
Note: DRYTorch maintains a run registry on the local disk to track and manage all run IDs and states. It also attempts to record the last commit hash when git is available.
def implement_experiment() -> None:
"""Here should be the code for the experiment."""
with first_experiment.create_run() as run:
first_id = run.id
implement_experiment()
with first_experiment.create_run(resume=True) as run:
second_id = run.id
implement_experiment()
if first_id != second_id:
raise AssertionError('The resumed run should keep the id.')
Notebooks
For convenience, especially in interactive environments like notebooks, you can directly start and stop a run, avoiding the context manager.
Alternatively, you can use the Run.start and Run.stop methods directly.
To do this, use the Run.start and method and ensure you explicitly call Run.stop when finished.
It is recommended to stop the run explicitly, otherwise DRYTorch will attempt to clean up the metadata and exit gracefully when the Python session terminates.
run = first_experiment.create_run()
run.start()
run.stop()
Global configuration
It is possible to access the configuration file directly from the Experiment class when a run is on. If no experiment is running, the operation will throw an exception.
from drytorch.core import exceptions
def get_batch() -> int:
"""Retrieve the batch size setting."""
return MyExperiment.get_config().batch_size
with first_experiment.create_run():
get_batch()
try:
get_batch()
except (exceptions.AccessOutsideScopeError, exceptions.NoActiveExperimentError):
pass
else:
raise AssertionError('Configuration accessed when no run is on.')
Registration
Register model
DRYTorch discourages information leakage between runs to ensure reproducibility.
The framework explicitly prevents the construction of a Model instance based on a module registered in a previous run.
This isolation ensures that each run starts from a clean state defined solely by its configuration.
The registration to the current run happens during the Model instantiation. If no experiment is running, the Model class will not be instantiated.
To use the same module, you must first unregister it.
from torch import nn
from drytorch import Model
from drytorch.core import exceptions, registering
second_experiment = MyExperiment(
my_config,
name='SecondExp',
par_dir='experiments/',
tags=[],
)
module = nn.Linear(1, 1)
with first_experiment.create_run():
first_model = Model(module)
try:
second_model = Model(module)
except exceptions.NoActiveExperimentError:
pass
else:
raise AssertionError('Model instantiated when no experiment is running.')
with second_experiment.create_run():
try:
second_model = Model(module)
except exceptions.ModuleAlreadyRegisteredError:
pass
else:
raise AssertionError('Module registered through two Model instances.')
with second_experiment.create_run():
registering.unregister_model(first_model)
second_model = Model(first_model.module)
Register actor
An actor is an object, like a trainer or a test class, that acts upon a model or produces logging from it.
Registration checks that the model and the actor belong to the same experiment. Actors from the library implementation register themselves when called.
import torch
from torch.utils.data import Dataset
from typing_extensions import override
from drytorch.lib.loading import DataLoader
from drytorch.lib.runners import ModelRunner
class MyDataset(Dataset[tuple[torch.Tensor, torch.Tensor]]):
"""Example dataset containing tensor with value one."""
def __init__(self) -> None:
"""Initialize some dummy attributes."""
super().__init__()
self.empty_container = []
self.none = None
def __len__(self) -> int:
"""Size of the dataset."""
return 1
@override
def __getitem__(self, index) -> tuple[torch.Tensor, torch.Tensor]:
return torch.ones(1), torch.ones(1)
one_dataset: Dataset[tuple[torch.Tensor, torch.Tensor]] = MyDataset()
with second_experiment.create_run(resume=True): # correctly resuming run
loader = DataLoader(one_dataset, batch_size=1)
model_caller = ModelRunner(second_model, loader=loader)
model_caller()
with second_experiment.create_run(): # new run
loader = DataLoader(one_dataset, batch_size=1)
model_caller = ModelRunner(second_model, loader=loader)
try:
model_caller()
except exceptions.ModuleNotRegisteredError:
pass
else:
raise AssertionError('Model not registered in the current run')
Metadata Extraction
DRYTorch automatically documents the models and actors during registration by extracting a readable representation at runtime. The metadata is then handled by the tracker. By default, when PyAML 6.0 or later is installed, metadata is dumped in YAML format.
To better visualize it, we create an adhoc tracker for this tutorial.
import functools
import pprint
from drytorch.core import log_events
from drytorch.core.tracking import Tracker
class MetadataVisualizer(Tracker):
"""Tracker that prints the metadata on the console."""
@functools.singledispatchmethod
@override
def notify(self, event: log_events.Event) -> None:
return super().notify(event)
@notify.register
def _(self, event: log_events.ModelRegistrationEvent) -> None:
pprint.pp(event.architecture_repr)
return super().notify(event)
@notify.register
def _(self, event: log_events.ActorRegistrationEvent) -> None:
pprint.pp(event.metadata)
return super().notify(event)
third_experiment = MyExperiment(
my_config,
name='ThirdExp',
par_dir='experiments/',
tags=[],
)
third_experiment.trackers.subscribe(MetadataVisualizer())
Model metadata
The readable representation of a Model is simply the native
representation of the wrapped nn.Module.
with third_experiment.create_run(): # correctly resuming run
third_model = Model(nn.Linear(1, 1))
'Linear(in_features=1, out_features=1, bias=True)'
Actor Metadata
The readable representation of an actor not only documents the actor object but all the public attributes recursively.
Private attributes, that is, attributes that start with an underscore, are ignored, and so are attributes that have not been initialized (evaluating to None, or an empty container).
with third_experiment.create_run(resume=True): # correctly resuming run
loader = DataLoader(one_dataset, batch_size=1)
model_caller = ModelRunner(third_model, loader=loader)
model_caller()
{'class': 'ModelRunner',
'loader': {'class': 'DataLoader',
'batch_size': 1,
'dataset': 'MyDataset',
'dataset_len': 1,
'sampler': {'class': 'RandomSampler',
'data_source': 'range(0, 1)',
'replacement': False}},
'model': {'class': 'Model',
'checkpoint': 'LocalCheckpoint',
'epoch': 0,
'exec_module': {'class': 'Linear',
'in_features': 1,
'out_features': 1,
'training': True},
'mixed_precision': False}}