Skip to main content
HomeTutorialsMachine Learning

Machine Learning Experimentation: An Introduction to Weights & Biases

Learn how to structure, log, and analyze your machine learning experiments using Weights & Biases.
Jul 2023  · 9 min read

You Can't Analyze What You Don't Measure


Have you ever looked back on a machine learning project and wondered why you made specific decisions, or even what the code is actually doing? For example, why did you choose this subset of features, which version of data did you train the final model on, or how did different sets of model hyperparameters affect performance? Sounds like you need to adopt experimentation best practices into your machine learning projects!

Machine learning experimentation is the process of designing, building, logging, and evaluating a machine learning pipeline to identify the model that achieves a desired performance in terms of a given metric or metrics, e.g., F1 score, RMSE, etc. It is a crucial step in developing machine learning models both prior to deployment and as part of retraining. Just like any other experiment, it should be designed to:

  • Be repeatable
  • Minimize bias
  • Track all required files and artifacts

By keeping these goals in mind you can ensure you can easily find what you need, such as a model's hyperparameters or a version of a dataset, and share information, such as model performance and source code.

Luckily, this is where Weights & Biases comes in! 

Creating a Machine Learning Pipeline With Weights & Biases

Weights & Biases is a platform and Python library for machine learning engineers to run experiments, log artifacts, automate hyperparameter tuning, and evaluate performance. This tutorial will introduce the core functionality by demonstrating a machine learning pipeline where the key elements are logged and subsequently examined in a dashboard!

Installation and login

# Install Python library
!pip install wandb

To login, you'll need to sign up for an account.

Once created, visit the authorize page to access an API key that you can use to login. Make a note of this, but don't share it with anyone else!

To login, either open a new Terminal and enter wandb login or add a new cell to your notebook and execute !wandb login.

You will be prompted to provide your API key, which you can copy and paste, to successfully login. Your credentials will be stored locally in ~/.netrc going forward.

wandb login

Creating an experiment

Weights & Biases has several components including Experiments, Artifacts, Sweeps, and Models. Let's start by looking at the core of the workflow - creating and running experiments.

# Import required modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import wandb
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

In Weights & Biases an Experiment is also known as a project. To create a new project we call the wand.init() function, pass our project name as a string to the project keyword argument. You can assign this to a variable called run, so we can log important files, models, and metrics, to the run.

Optionally, you can also provide a string to the job_type keyword argument, allowing you to easily distinguish between different steps such EDA, feature engineering, or hyperparameter tuning. Additionally, the code you execute can be logged by setting save_code equal to True.

# Initialize a run
run = wandb.init(project="predicting_london_temperature",
                 job_type="Baseline modeling",

Logging an artifact

You can log any important information, such as datasets, variables, or files, as an Artifact. To do this, first create a variable by calling wandb.Artifact(), providing the name and type. Then log this using the wandb.log_artifact() function. Let's start by logging the original dataset.

original_data = "london_weather.csv"
# Create an Artifact
data_artifact = wandb.Artifact(name="original_dataset", 
                               description="London weather dataset")
# Add the file to the Artifact
# Log the Artifact as part of the run

Exploratory data analysis

Exploratory Data Analysis (EDA) is a crucial step in the machine learning workflow, helping you to ensure data conforms to required data types and is free from missing values and outliers. It is also an opportunity to examine distributions and relationships between features.

The added benefit of using Weights & Biases is that we can log the data as we perform preprocessing, along with any visualizations produced.

# Read in the data
london = pd.read_csv(original_data)

# Previewing the first five rows


# Visualize distributions


# View data types and non-null counts

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15341 entries, 0 to 15340
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   date              15341 non-null  int64  
 1   cloud_cover       15322 non-null  float64
 2   sunshine          15341 non-null  float64
 3   global_radiation  15322 non-null  float64
 4   max_temp          15335 non-null  float64
 5   mean_temp         15305 non-null  float64
 6   min_temp          15339 non-null  float64
 7   precipitation     15335 non-null  float64
 8   pressure          15337 non-null  float64
 9   snow_depth        13900 non-null  float64
dtypes: float64(9), int64(1)
memory usage: 1.2 MB

# Convert date to datetime
london["date"] = pd.to_datetime(london["date"], format="%Y%m%d")

# Add a month column
london["month"] = london["date"].dt.month

# Drop missing values

Now the data has been cleaned, you can log it as an Artifact using wand.Artifact().

# Save and log the new version of the data
london.to_csv("london_weather_preprocessed.csv", index=False)
preprocessed_data = "london_weather_preprocessed.csv"
# Create an Artifact
preprocessed_data_artifact = wandb.Artifact(name="london_weather_preprocessed", 
                               description="London weather dataset - preprocessed")
# Add the file to the Artifact
# Log the Artifact as part of the run

<wandb.sdk.artifacts.local_artifact.Artifact at 0x15bfe6710>

Feature selection

The next step in the workflow is to identify potential features for modeling while avoiding multicollinearity. To do this, you can first produce a correlation heatmap, then use Ridge Regression to quantify feature importance in predicting "mean_temperature".

# Correlation heatmap
sns.heatmap(london.corr(), annot=True, center=True)


You can log this plot using wandb.Image()!

# Log the heatmap
wandb.log({"correlation_heatmap": wandb.Image(plt)})

# Split the data
X = london.drop(columns=["mean_temp", "min_temp", "max_temp", "date"])
y = london["mean_temp"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

# Build a Ridge regression model
ridge = Ridge(alpha=0.1), y_train)
y_pred = ridge.predict(X_test)

# Visualize feature importance using the coefficients
fig = plt.figure()
sns.barplot(x=X_train.columns, y=ridge.coef_)


Let's save the model and feature importance plot using pickle.dump() and wandb.log().

import pickle
pickle.dump(ridge, open("feature_selection_model.pkl", "wb"))
feature_selection_model = wandb.Artifact("feature_selection_model", type="model")
wandb.log({"feature_selection_plot": wandb.Image(fig)})

<Figure size 640x480 with 0 Axes>

Training and logging

You made it! Time to train your baseline model using the identified important features, which are any with a Ridge model coefficient of more than zero. You'll also log the new training and test data.

X = london[["cloud_cover", "global_radiation", "precipitation", "month"]]
y = london["mean_temp"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

def log_data(X_train, y_train, X_test, y_test):
    # Log the training and test data
    training_features = wandb.Artifact(name="X_train", 
                               description="Training features")
    X_train.to_csv("X_train.csv", index=False)

    training_labels = wandb.Artifact(name="y_train", 
                               description="Training labels")
    y_train.to_csv("y_train.csv", index=False)


    test_features = wandb.Artifact(name="X_test", 
                               description="Test features")
    X_test.to_csv("X_test.csv", index=False)

    test_labels = wandb.Artifact(name="y_test", 
                               description="Test labels")
    y_test.to_csv("y_test.csv", index=False)

log_data(X_train, y_train, X_test, y_test)

Let's train and evaluate a Decision Tree Regression model, using Root Mean Squared Error (RMSE) and R-Squared as performance metrics.

# Fit and evaluate the model
dt_reg = DecisionTreeRegressor(random_state=42), y_train)
y_pred = dt_reg.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r_squared = r2_score(y_test, y_pred)
print(rmse, r_squared)

3.8627385194713346 0.5399282872125385

# Save and log the model and metrics
pickle.dump(dt_reg, open("baseline_decision_tree.pkl", "wb"))
baseline_model = wandb.Artifact("baseline_decision_tree", type="model")
wandb.log({"rmse": rmse, "r_squared": r_squared})

You can drill down further on model performance by visualizing the model's residuals. Weights & Biases has a function in their sklearn module to produce this!

# Plot training residuals
wandb.log({"baseline_residuals_plot": wandb.sklearn.plot_residuals(dt_reg, X_train, y_train)})


Training and evaluation is complete, so let's finish the current run.

# Finish run


Hyperparameter tuning

Now you've created and logged a baseline model with its associated performance metrics, let’s look at how you can perform hyperparameter tuning and version your models using Weights & Biases and scikit-learn.

We'll use GridSearchCV to search for the best performing model across a hyperparameter space, based on RMSE.

# Initialize a new run
run_v2 = wandb.init(project="predicting_london_temperature",
                 job_type="Hyperparameter tuning",

# Split the data and create a KFold object
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Create the hyperparameter space
params = {"max_depth":[2, 5, 8, 12, 20],
          "min_samples_split": [2, 5, 8, 12, 20],
          "max_features": [1, 2, 3, 4],
          "random_state": [42]

# Create a GridSearchCV object
dt_reg = DecisionTreeRegressor()
dt_reg_cv = GridSearchCV(dt_reg, 

# Fit the data, y_train)

# Evaluate and log performance
## GridSearchCV uses negative RMSE
## so you can convert it to positive
print(f"Best RMSE: {-1 * dt_reg_cv.best_score_})")
print(f"Best model: {dt_reg_cv.best_estimator_}")

# Save and log the model, metrics, and parameters
pickle.dump(dt_reg_cv, open("tuned_decision_tree.pkl", "wb"))
tuned_decision_tree = wandb.Artifact("tuned_decision_tree", type="model")
wandb.log({"rmse": -1 * dt_reg_cv.best_score_})



Visiting the dashboard

To examine runs and artifacts you can go to<your-profile-name>/projects then select the project you have created.

Once inside, you will see each run that you've created, along with any media, such as plots, like this:


By clicking on the Runs icon, you can see all your runs and key information, including metrics logged like RMSE and R-Squared.


Next Steps

Weights & Biases has a lot more functionality that you can explore, such as using existing artifacts as part of new runs, a Command Line Interface tool, and Prompts for the development of LLMs!

To reinforce your knowledge about machine learning experimentation and the wider techniques of MLOps, you might want to check out:

  • MLOps Concepts - a conceptual course covering the machine learning lifecycle from design through to deployment and maintenance, along with highlighting key concepts and stakeholders involved throughout!
  • Introduction to MLFlow - this Python course teaches you how to use the four core components of MLFlow: Tracking, Models, Projects, and Model Registry.

Photo of George Boorman
George Boorman

George is a Curriculum Manager at DataCamp. He holds a PGDip in Exercise for Health and BSc (Hons) in Sports Science and has experience in project management across public health, applied research, and not-for-profit sectors. George is passionate about sports, tech for good, and all things data science.


Learn topics mentioned in this tutorial!


MLOps Concepts

2 hours
Discover how MLOps can take machine learning models from local notebooks to functioning models in production that generate real business value.
See DetailsRight Arrow
Start Course
See MoreRight Arrow


Understanding and Mitigating Bias in Large Language Models (LLMs)

Dive into a comprehensive walk-through on understanding bias in LLMs, the impact it causes, and how to mitigate it to ensure trust and fairness.
Nisha Arya Ahmed's photo

Nisha Arya Ahmed

12 min


Measuring Bias in Machine Learning: The Statistical Bias Test

This tutorial will define statistical bias in a machine learning model and demonstrate how to perform the test on synthetic data.
DataRobot Inc's photo

DataRobot Inc

6 min


An Introduction to Statistical Machine Learning

Discover the powerful fusion of statistics and machine learning. Explore how statistical techniques underpin machine learning models, enabling data-driven decision-making.
Joanne Xiong's photo

Joanne Xiong

11 min


Seeing Like a Machine: A Beginner's Guide to Image Analysis in Machine Learning

Discover how computers ‘see’ and interpret images, techniques used to manipulate images, and how machine learning has changed the game.
Amberle McKee's photo

Amberle McKee

28 min


Introduction to Experiment Tracking

This webinar shows you how to get started with experiment tracking, a pivotal yet frequently overlooked aspect of the data science lifecycle.
Jinen Setpal's photo

Jinen Setpal


Running Machine Learning Experiments in Python

In this webinar, you'll use MLflow to manage a machine learning experiment pipeline. The session will cover model evaluation, hyperparameter tuning, and MLOps strategies, using a London weather dataset.
Folkert Stijnman's photo

Folkert Stijnman

See MoreSee More