ML Experiment Tracking: Complete Guide to W&B and Hydra
One of the least taught skill in machine learning is how to manage and track machine learning experiments effectively. Once you get out of the shell of beginner-level projects and get into some serious projects/research, experiment tracking and management become one of the most crucial parts of your project.
However, no course teaches you how to manage your experiments in-depth, so here I am trying to fill in the gap and share my experience on how I track and manage my experiments effectively for all my projects and Kaggle competitions.
In this post, I would like to share knowledge gained from working on several ML and DL projects.
- Need for experiment tracking
- Conventional ways for experiment tracking and configuration management
- Trackables in a machine learning project
- Experiment tracking using Weights and Biases
- Configuration Management with Hydra
This is my way of tracking experiments I’ve developed over the years, some ways might work for you and some won’t, so read the complete article and incorporate the ideas you liked the most and you think would benefit your project.
Let’s get started.
Why do we need experiment tracking?
You might ask this question, why do we need experiment tracking in the first place?
The simple answer is that as machine learning practitioners, we invest a significant amount of time and effort to improve our solutions. Thus, you might iteratively change the model architecture, dataset parameters, evaluation metrics, etc. You can remember the changes you made for at most 4-5 experiments. After that, it becomes a complete mess. It's hard to remember all the configurations that went into a particular run and what changes made positive or negative improvements in your evaluation metrics (unless you are a superhuman :P). Modern Deep Learning architectures have 100s of parameters that the user has to define, making them much harder to remember.
Thus, there’s a need for a system that can manage and track all your experiments, so that you can focus on the task at the hand and not on remembering and worrying about the other stuff.
What are the trackables in a machine learning project?
After understanding the importance of tracking experiments, the next question you might have is what should we track for a machine learning project?
Here are some of the things that you might want to track
- Parameters: model architecture, hyperparameters
- Jobs/Routines: pre-processing routine, training and validation routine, post-processing routine, etc
- Artifacts: datasets, pre-processed datasets, model checkpoints, etc
- Metrics: training and evaluation loss, metrics
- Model-specific parts: optimizer state, gradients over the course of training, etc
- Metadata: experiment name, type of experiment, artifact locations (e.g. S3 bucket), run summary, configuration, etc
- Code: the state of the code when the experiment was run
- Hardware part: type of CPU or GPU instance, CPU/GPU usage, network usage, etc
Before diving into the core of this blog, I would like to discuss the problems with conventional ways of tracking experiments people generally follow, so that you would clearly see the benefits of the ways I would be discussing further in this blog post.
Experiment tracking - conventional techniques
- Writing down with a pen and paper: The most obvious way is to write the configuration and changes on a paper, however, there is a clear disadvantage in this approach, as you might make errors while writing and if you want to search or see a particular experiment, then it’s not at all easily searchable in a pile of pages. Also, it involves a lot of manual work and it’s not easy to keep writing about 100s of experiments.
- Using Excel or Google Sheets: Well, this seems to be a more structured way of tracking your experiments than the previous one, but still, it has the same problem of making errors while writing or copy-pasting the exact configurations in a spreadsheet. Although this stores all your experiment’s configurations digitally and you can easily search through it, this method involves a lot of manual work of typing or copy-pasting the exact configuration and structuring them into a spreadsheet which leads to wastage of time deviating you from the actual task at hand.
Configuration Management - conventional techniques
Tracking experiments is one part and how you pass the configuration to your codebase is another. Configuration management is one of the most crucial parts of managing and tracking experiments effectively. For most of the parts in your machine learning experiments, the code would be almost the same, while the changes would be mostly in the model parameters, dataset parameters, etc.
As there can be 100s of parameters that you might need to pass into your codebase to run a particular experiment, it’s a good idea to separate your configuration (all the tuneable and changeable parameters) from your actual code. I have seen many people making the mistake of hard-coding the parameters, and data paths in the code itself. While this might work for smaller projects, it can become a complete mess as your project size and team size grow. Your code should be portable to any system with minimal changes.
There are many ways of managing configurations in a way that it’s separated from the actual code which allows to easily change parameters to run different experiments.
Here I discuss some of the most commonly used ways of handling configurations in a project and the problems associated with them and how they can be solved by an open-source package by Meta Research called Hydra.
Argparse
One of the most common ways to pass all the configuration to your code-base is with the help of the built-in Python module, argparse
. This module makes it easy to write user-friendly command-line interfaces. Argparse is a great way to pass configuration via command-line for small projects, however, it becomes quite complex and hard to maintain, when the project size increases. And it becomes a pain to write all the arguments in a command line to run your experiment. For instance, see the below code snippet directly taken from PyTorch ImageNet Training Example. Although the example is quite minimal, the number of command-line flags is already high. Ideally, some of these command-line arguments should be grouped together logically, but there's no easy and quick way to do that.
Snippet from PyTorch ImageNet Training Example
YAML files
The other most commonly used format to specify all your run arguments is to write them in a YAML file. If you don't know about YAML, it's just a human-readable data serialization language. To learn more about YAML you can refer to this article.
Here, you can instantly see some advantages compared to the previous argparse based configuration management. The first advantage is that we don’t need to type all the configurations again and again on the command-line or repeatedly make bash scripts to run different settings of experiments. Another advantage is that we can group similar types of items together.
For instance, see the snippet given below, all the data-related stuff goes into the data
key, all the model-related stuff goes into the model
key, etc. Furthermore, we can even nest different types of keys. In the snippet below, model
is a key that has all the model-related stuff, and even in the model
key, we can have separate keys for the encoder
and decoder.
You can pretty much define any type of data structure within YAML files and when read into Python it’s just a dictionary of dictionaries (nested dictionaries). You can refer to this article to learn how to use different data types in YAML files.
data:
train_path: '/path/to/my/train.csv'
valid_path: '/path/to/my/valid.csv'
model:
encoder:
dimension: 256
dropout: 0.2
decoder:
dimension: 512
dropout: 0.1
trainer:
learning_rate: 1e-3
train_batch_size: 16
val_batch_size: 16
However, this is not the end of the story for YAML files. Suppose, you want to run lots of experiments and you want a way to remember or save the YAML file used for each experiment. One way to track all configuration files is to name the experiments meaningfully and accordingly name the YAML file as the experiment name.
For example: let’s say you are training transformer-based models for some NLP problem, then you might name the YAML file for a particular run as experiment-number_model-name_short-run-description.yaml
.
Suppose, I am training a roberta-base model with gradient accumulation steps of 2, then I can name the experiment file as 0_roberta-base_grad-accum2.yaml
. This might seem a good way for now, but you end up with a pile of YAML files and most of them become useless as you update your code continuously. You can’t keep up updating previous configuration files to be in sync with your current code-base. Also, it’s often hard to differentiate between these files as most of them appear to be of the same name and content and it’s hard to determine the change associated with each of them.
This was the approach I used in the past and honestly faced a lot of problems with it. But lucky for us, there's a better and cleaner way to deal with configuration files and that is achieved by Hydra, an open-source framework by Meta Research (formerly FaceBook Research), that aims to simplify configuring complex applications.
First, let’s see how configuration can be managed elegantly with the help of Hydra, and then we will see how the experiments can be tracked and managed with Weights and Biases.