Introduction to Data Versioning with DVC
Explore Data Version Control for ML data management. Master setup, automate pipelines, and evaluate models seamlessly.
Start Course for Free3 hours12 videos35 exercises
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.Training 2 or more people?Try DataCamp For Business
Loved by learners at thousands of companies
Course Description
This course offers a comprehensive introduction to Data Version Control (DVC), a tool designed for efficient management and versioning of machine learning data. You will get an understanding of the machine learning product lifecycle, differentiating data versioning from code versioning and exploring DVC’s features and use cases.
Exploring DVC features
You will understand the motivations behind data versioning, the machine learning lifecycle, and DVC’s distinct features and use cases. You will also learn about DVC setup, covering installation, repository initialization, and the .dvcignore file. You will explore DVC cache and staging files, learn to add and remove files, manage caches, and understand the underlying mechanisms. You will learn about DVC remotes, explain the distinction between DVC and Git remotes, add remotes, list them, and modify them. You will learn to interact with remotes, push and pull data, check out specific versions, and fetch data to the cache.Automate and evaluate
You will be motivated to automate ML pipelines, emphasizing modularization of code and the creation of a configuration file. You will be introduced to DVC pipelines as directed acyclic graphs, with hands-on experience in adding stages and their inputs and outputs. You will practice executing these pipelines efficiently to enable different use cases in machine learning model training. The course concludes with a focus on evaluation, showcasing how metrics and plots are tracked in DVC.For Business
Training 2 or more people?
Get your team access to the full DataCamp library, with centralized reporting, assignments, projects and moreIn the following Tracks
Machine Learning Engineer
Go To TrackMachine Learning in Production in Python
Go To Track- 1
Introduction to DVC
FreeThis chapter provides a comprehensive introduction to Data Version Control (DVC), a tool essential for data versioning in machine learning. Learners will explore the motivation behind data versioning, understand its differences from code versioning, and experiment with a simple classification problem. They will review basic Git commands, learn about DVC, and practice setting up a repository. The chapter concludes with an overview of DVC’s features and use cases, including versioning data and models, CI/CD for machine learning, experiment tracking, pipelines, and more.
Data Versioning Motivation50 xpAnatomy of a Machine Learning Model100 xpDifferences Between Data and Code Versioning50 xpUnderstanding Hyperparameters50 xpIntroduction to DVC50 xpWorking with Git CLI100 xpReview DVC CLI50 xpDVC features and use cases50 xpDVC pipelines50 xpCI/CD for machine learning50 xp - 2
DVC Configuration and Data Management
This chapter delves into the setup of DVC, encompassing aspects such as installation, initialization of the repository, and the utilization of the .dvcignore file. It further navigates through the exploration of DVC cache and staging files, imparting knowledge on how to add and remove files, manage caches, and comprehend the underlying mechanisms using the MD5 hash. The chapter also elucidates on DVC remotes, distinguishing them from Git remotes, and guides you on how to add, list, and modify them. Lastly, it teaches you how to interact with these remotes by pushing and pulling data, checking out specific versions, and fetching data to the cache.
DVC Setup and Initialization50 xpSetting up DVC100 xp.dvcignore Patterns100 xpDVC Cache and Staging Files50 xpWorking with DVC Cache100 xpUnderstanding .dvc Files50 xpConfiguring DVC Remotes50 xpPurpose of DVC Remotes50 xpSetup a DVC Remote100 xpInteracting with DVC Remotes50 xpVersioning Data using DVC Remote100 xpChecking out Versioned Data100 xp - 3
Pipelines in DVC
This chapter focuses on automating ML pipelines using DVC. Learners create a configuration file containing settings and hyperparameters. They also learn about pipeline visualization using directed acyclic graphs and use commands to describe dependencies, commands, and outputs. Execution of DVC pipelines is covered, including local model training and how Git tracks DVC metadata. Additionally, learners explore metrics and plots tracking in DVC, including how to print metrics, create plot files, and compare metrics and plots across different pipeline stages.
Code organization and refactoring50 xpUnderstanding parameter files in DVC50 xpWrite a parameter file100 xpWriting and visualizing DVC pipelines50 xpDesigning a DVC pipeline100 xpVisualizing a DVC pipeline100 xpExecuting DVC pipelines50 xpDVC pipeline execution concepts100 xpExecute a ML model training pipeline100 xpEvaluation: Metrics and plots in DVC50 xpTracking DVC Metrics100 xpAdding plots to dvc.yaml100 xpCongratulations!50 xp
For Business
Training 2 or more people?
Get your team access to the full DataCamp library, with centralized reporting, assignments, projects and moreIn the following Tracks
Machine Learning Engineer
Go To TrackMachine Learning in Production in Python
Go To Trackcollaborators
Ravi Bhadauria
See MoreSenior Machine Learning Engineer
Ravi is a senior ML Engineer at Etsy where he is focused on solving problems at the intersection of Machine Learning and Distributed Systems. Previously, he has worked on healthcare and computational lithography domains. He holds a PhD specializing in Computational Chemical Physics and Mechanical Engineering.
What do other learners have to say?
Join over 14 million learners and start Introduction to Data Versioning with DVC today!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.