Experiment Tracking with DVC

Leveraging DVC for Streamlined ML Experimentation

In the realm of machine learning, managing and tracking experiments can swiftly become a complex task, especially as projects scale. The introduction of tools like DVC (Data Version Control)¹ has been a game-changer, allowing developers and data scientists to automate and streamline this process. Through a practical example involving a Random Forest Classifier trained on the Iris dataset, we’ll explore how DVC facilitates a seamless experiment tracking system.

Join the AI BootCamp!

Ready to dive into the world of AI and Machine Learning? Join the AI BootCamp to transform your career with the latest skills and hands-on project experience. Learn about LLMs, ML best practices, and much more!

Why Track Experiments?

Imagine you’re working on a machine learning project. As your experiments grow in number, so does the difficulty of tracking each one’s parameters, metrics, and outcomes. Without proper management, this complexity can lead to significant overhead and confusion, potentially stalling project progress.

DVC offers a solution to this problem by integrating code and data version control, making it simpler to reproduce experiments and track changes across project iterations. The dvclive module, in particular, enables automatic metric logging during model training, which is crucial for comparing experiments and understanding model performance over time.

Basic Tracking with DVCLive

DVCLive² is a lightweight Python library that integrates seamlessly with DVC to log metrics during model training. By adding a few lines of code to your training script, you can automatically capture key metrics such as accuracy, precision, and recall, and visualize them in the DVC Experiments View.

app.py


from dvclive import Live
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, average_precision_score, recall_score
from sklearn.model_selection import train_test_split
 
 
def evaluate(model: RandomForestClassifier, X, y, split: str, live: Live):
    y_pred = model.predict(X)
    live.log_metric(f"{split}/accuracy", accuracy_score(y, y_pred))
    live.log_metric(
        f"{split}/average_precision",
        average_precision_score(y, model.predict_proba(X), average="macro"),
    )
    live.log_metric(f"{split}/recall", recall_score(y, y_pred, average="macro"))
 
 
X, y = datasets.load_iris(as_frame=True, return_X_y=True)
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
with Live("experiments") as live:
    model = RandomForestClassifier(n_estimators=5, max_depth=1, random_state=42)
    model.fit(X_train, y_train)
    evaluate(model, X_train, y_train, "train", live)
    evaluate(model, X_test, y_test, "test", live)

We’re training a RandomForestClassifier on the Iris dataset, evaluating its performance, and logging the results using the dvclive library. Key components of the code include:

Loading the Iris dataset: Utilizes sklearn.datasets.load_iris to fetch the Iris dataset, which is divided into training and testing sets
Model Training: A RandomForestClassifier is instantiated with specific hyperparameters (n_estimators=5, max_depth=1) and trained on the training data
Evaluation and Logging: The evaluate function is called for both training and test sets. It logs metrics (accuracy, average precision, and recall) using the dvclive library
dvclive Integration: The Live context manager from dvclive is used to initialize a logging environment, which automatically tracks and saves the evaluation metrics

You can run this script using the command python app.py and view the results with:


dvc exp show --md

Experiment	train.accuracy	train.average_precision	train.recall	test.accuracy	test.average_precision	test.recall
workspace	0.675	0.66667	0.66667	0.63333	0.66667	0.66667
master	-	-	-	-	-	-
└── c3b1f3b [bitty-walk]	0.675	0.66667	0.66667	0.63333	0.66667	0.66667

The table contains our first experiment’s results, showing the accuracy, average precision, and recall scores for both the training and test sets.

Let’s run another experiment by changing the hyperparameters of the model:


model = RandomForestClassifier(n_estimators=5, max_depth=1, random_state=42)

Let’s look at the results:


dvc exp show --md

Experiment	train.accuracy	train.average_precision	train.recall	test.accuracy	test.average_precision	test.recall
workspace	0.95	0.91136	0.94913	1	1	1
master	-	-	-	-	-	-
├── 7397a99 [irony-awns]	0.95	0.91136	0.94913	1	1	1
└── c3b1f3b [bitty-walk]	0.675	0.66667	0.66667	0.63333	0.66667	0.66667

Our second experiment, with different hyperparameters, shows improved metrics. You can commit these changes to git.

Create a Pipeline

To streamline the experiment tracking process, we can create a DVC pipeline that automates the training and evaluation steps. Another great feature of pipelines is that they allow you to run multiple experiments with different hyperparameters in a single command. This is particularly useful when performing hyperparameter tuning or grid search.

Let’s start by creating a params.yaml file to store our model hyperparameters:

params.yaml


model:
  n_estimators: 10
  max_depth: 2

Next, we’ll modify our training script to read these hyperparameters from the file:

app.py


from pathlib import Path
 
from box import ConfigBox
from dvclive import Live
from ruamel.yaml import YAML
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, average_precision_score, recall_score
from sklearn.model_selection import train_test_split
 
config = ConfigBox(YAML(typ="safe").load(Path("params.yaml").open(encoding="utf-8")))
 
 
def evaluate(model: RandomForestClassifier, X, y, split: str, live: Live):
    y_pred = model.predict(X)
    live.log_metric(f"{split}/accuracy", accuracy_score(y, y_pred))
    live.log_metric(
        f"{split}/average_precision",
        average_precision_score(y, model.predict_proba(X), average="macro"),
    )
    live.log_metric(f"{split}/recall", recall_score(y, y_pred, average="macro"))
 
 
X, y = datasets.load_iris(as_frame=True, return_X_y=True)
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
with Live("experiments") as live:
    model = RandomForestClassifier(
        n_estimators=config.model.n_estimators,
        max_depth=config.model.max_depth,
        random_state=42,
    )
    model.fit(X_train, y_train)
    evaluate(model, X_train, y_train, "train", live)
    evaluate(model, X_test, y_test, "test", live)

The script now reads the hyperparameters from the params.yaml file and uses them to train the model. To run the pipeline, we need to define a stage in a dvc.yaml file:

dvc.yaml


stages:
  train_model:
    cmd: python app.py
    deps:
      - app.py
    params:
      - model

This stage runs the app.py script, which trains the model using the hyperparameters specified in params.yaml. To execute the pipeline, run:


dvc exp run

Experiment	train.accuracy	train.average_precision	train.recall	test.accuracy	test.average_precision	test.recall	model.n_estimators	model.max_depth
workspace	0.95	0.98784	0.94955	1	1	1	10	2
master	-	-	-	-	-	-	10	2
└── 04301de [funny-cyma]	0.95	0.98784	0.94955	1	1	1	10	2

Note that the model hyperparameters are now stored in the experiment metadata, making it easier to track and compare different experiments. Let’s try to run another experiment with different hyperparameters:


dvc exp run -S model.n_estimators=30 -S model.max_depth=3

Experiment	train.accuracy	train.average_precision	train.recall	test.accuracy	test.average_precision	test.recall	model.n_estimators	model.max_depth
workspace	0.96667	0.9972	0.96665	1	1	1	30	3
master	-	-	-	-	-	-	10	2
├── 83c1c6b [choky-door]	0.96667	0.9972	0.96665	1	1	1	30	3
└── 04301de [funny-cyma]	0.95	0.98784	0.94955	1	1	1	10	2

Grid Search

Another powerful feature of DVC is the ability to run grid search experiments. You can run multiple experiments in parallel using the --queue flag. Here’s an example:


dvc exp run --queue -S model.n_estimators=8,16,32 -S model.max_depth=2,3,5

This command will queue nine experiments with different combinations of n_estimators and max_depth. To run all the queued experiments, use:


dvc exp run --run-all

Here’s a summary of the grid search results:

Experiment	train.accuracy	train.average_precision	train.recall	test.accuracy	test.average_precision	test.recall	model.n_estimators	model.max_depth
workspace	0.96667	0.9972	0.96665	1	1	1	30	3
master	-	-	-	-	-	-	10	2
├── d37b551 [above-play]	1	1	1	1	1	1	32	5
├── 00d0021 [nicer-cere]	0.975	0.9968	0.97478	1	1	1	32	3
├── 2a77fa6 [wacky-bind]	0.95833	0.99328	0.9581	1	1	1	32	2
├── 004a3e9 [podgy-arcs]	1	1	1	1	1	1	16	5
├── 8236073 [oared-snow]	0.96667	0.99719	0.96623	1	1	1	16	3
├── e3edf66 [flown-nosh]	0.95	0.98885	0.94955	1	1	1	16	2
├── db393d5 [lento-bice]	0.99167	1	0.99145	1	1	1	8	5
├── 861f1f0 [sedgy-over]	0.95833	0.99562	0.95852	1	1	1	8	3
├── 9503b13 [fetal-grub]	0.95833	0.98593	0.95852	1	1	1	8	2
├── 83c1c6b [choky-door]	0.96667	0.9972	0.96665	1	1	1	30	3
└── 04301de [funny-cyma]	0.95	0.98784	0.94955	1	1	1	10	2

You can read about other approaches to hyperparameter tuning, such as ranges as values, on the official DVC documentation .

Conclusion

The shift towards using tools like DVC for experiment tracking represents a significant advancement in the field of machine learning. It not only simplifies the management of experiments but also enhances reproducibility and collaboration among team members. By integrating these tools into your workflow, you can dramatically reduce the time spent on experiment management and focus more on innovation and model improvement.

3,000+ people already joined

Join the The State of AI Newsletter

Every week, receive a curated collection of cutting-edge AI developments, practical tutorials, and analysis, empowering you to stay ahead in the rapidly evolving field of AI.

I won't send you any spam, ever!

References

DVC ↩
DVCLive ↩