9 Min

Data versioning and data pipelines with dvc

At SMS digital, we work in Scrum teams consisting of about 3-5 developers from different backgrounds. These include backend and frontend developers, data engineers, and data scientists. For our data engineers and data scientists, it is essential to share not only code but also data in order to work on products collaboratively.

Sharing code works well using established technologies such as Git. Developers work on a local copy of a repository on their machines and push their code to GitHub if they are satisfied with the quality of their work. On GitHub, they open pull requests to obtain reviews on their work from other developers who, in turn, can pull the code to review it or add more features. When the pull request has been approved, the code of the feature branch is merged into the main branch.

Sharing and collaborating are no longer trivial factors when it comes to data. One significant difference from script files is the size of data files. Data files can easily be several gigabytes, whereas script files are usually a few kilobytes. This makes data files inconvenient and expensive to store on GitHub.

Consequently, some questions are left unanswered: Where do we store data? How do we version data files? How can we share large data files? Who has what kind of access to the data files?

In the following article, we will answer these questions and give you insights into how we solved these challenges at SMS digital.

Problems we faced before introducing dvc

When working with data files (e.g., text files, .csv files, labeled images) in development teams, it is not uncommon to have multiple versions of the same file. Previously, the data files that were necessary for our model development were stored on a Nextcloud server. There, they could be modified by almost anyone with access to the server – intentionally or unintentionally. Just opening a file and modifying one cell could corrupt the whole .csv file (e.g., add an "unnamed: 0" column). Furthermore, creating data pipelines was a hodgepodge of different Python scripts, Jupyter Notebooks, and data files.

Collaborating on data pipelines was difficult, and keeping track of file versions was nearly impossible.

Two developers who thought they had used the same version of scripts and data files could get different results from the same pipeline. It was almost impossible to reproduce data handling steps, and sharing the results from data pipelines was difficult. Furthermore, data science experiments could not be tracked in a structured way. This included the tracking of hyperparameters or model outputs (e.g., model weights). Therefore, we decided that we had to find a solution that met our needs.

Solution

To start keeping track of data and to build data pipelines, we decided to use dvc (https://dvc.org/features).

At first, we used dvc for versioning the data files. This helped us to keep track of the versions of our data files. We introduced AWS S3 as the shared remote data storage, where each project has one bucket for single file storage. The advantages of the S3 bucket are its low price and that it fits into our IT infrastructure landscape, as we use other AWS products too. Thanks to the versioned data in the S3 buckets, it is possible to control who has access to the data files as well as to share files with colleagues. Later, we also started to use dvc to build data pipelines using their pipeline functionality.

These pipelines eliminated the patchwork of scripts and files and made data handling transparent and reproducible.

Basically, dvc is to data what git is to code. It is an open-source tool, developed by iterative (https://iterative.ai). The project is hosted on GitHub (https://github.com/iterative/dvc).

We were confronted with some issues during the implementation of dvc in our teams. We raised those on the dvc GitHub issue tracker or started a discussion on the dvc discord server. The dvc developers always responded quickly and helped us to find a solution. One particular issue we had was the integration of session tokens when using a S3 bucket as remote storage. Together with the dvc developers, we could solve this problem and add the fix to the dvc code base.

A brief overview of how dvc works

dvc works very similar to git and even uses similar commands. Like with git, dvc has to be initialized in a git repo before it is used for the first time. This step only has to be done once and can be performed with the following command:

dvc init

Now, files can be tracked with dvc using:
 

dvc add data.csv

In this case, the data.csv file is tracked by dvc, and an additional file data.csv.dvc is created. This newly created file contains information about the data file in the form of an md5 hash and acts as its placeholder. It serves as a link between git and the data storage, see Figure 1:

Data and code versioning and storage.

The md5 hash is unique to this file and version. If the content of the data file changes, dvc detects this change and creates a new md5 hash when running the dvc add data.csv command again. The size of the .dvc file itself is only a few kB and can be easily tracked by git and shared via GitHub. Additionally, data.csv is added to the .gitignore file, with the result that it is ignored by git.

The data and code versions are linked through the git commits, as shown in Figure 2:

Commit history including data, code, and model.

By checking out previous commits, it is possible to restore an older version of the data and model.

The data.csv.dvc file is versioned with git using the following commands:

git add data.csv.dvc .gitignore
git commit -m 'Add data'

The data.csv file can now be pushed to a remote for large file storage, while the data.csv.dvc file can be pushed to GitHub.

In our case, data.csv is pushed to a S3 bucket with the following command:

dvc push

In order for developers in our teams to use dvc, the following prerequisites must be satisfied:

  • A git repository is located locally on the developer's machine and in a remote (currently we use GitHub as the remote).
  • The AWS CLI is installed locally.
  • Access to a S3 bucket. This is controlled centrally by our infrastructure team. The S3 buckets are project-specific. It is therefore possible to control access to the data for every developer on a project level.

We use dvc for several purposes. dvc's capability to version data files is used in every machine-learning project. Whenever we need to keep track of data files, we use dvc and add each file to it. We also push the data files to the central storage to share data with colleagues. This is directly linked to the second purpose for which we use dvc: sharing data. When all data files are stored in a central location and are versioned, it is easy to keep track of modifications to those data files. It is also easy to control access to the data files. Access to the project's data can be regulated by simply adding or removing a developer to the S3 bucket. Sharing between data scientists is visualized in Figure 3:

Sharing of code and data between data scientists.

We also use dvc's functionality to build data processing pipelines. These can include one or more stages such as importing data, data preprocessing, modeling, and postprocessing results. They may also include multiple input files, intermediate results files, and output files. An example of a data pipeline definition file is shown in Figure 4:

Sample pipeline script. Two stages are shown here: `importing data` and `preprocessing`. The `data/dataset.json` file is generated during the `importing data` stage and tracked by dvc. It is then used as an input in the `preprocessing` stage.

By using dvc, the intermediate results and the output files are automatically versioned and saved in the remote storage. This means we can always keep track of those files.

The hyperparameters that we use for each run of the pipeline are also stored. Another advantage of the dvc pipeline is caching. Stages of the pipeline that have not changed are skipped and the results loaded from the cache. Only the stages where the inputs have changed are rerun. This makes iterating over the pipeline fast and saves unnecessary computations. A sample data pipeline is shown in Figure 5:

Directed acyclic graph of the pipeline shown in Figure 4 (with additional stages).

The first stage is when the data is imported. It is then preprocessed in the second stage. The results of the preprocessing are plotted in the plotting stage. After preprocessing, part of the data is sent directly to the predict stage (test data that is not involved in training) and the rest of the data is sent to the training stage.

Up to now, we have used dvc predominantly during the scoping phase of projects. For one of our products, we defined a benchmark pipeline with dvc. This pipeline enables us to run the benchmark for every potential customer in a fast and standardized way. As a result, we can quickly evaluate the benefits of this particular product for the customer.

Introducing dvc to all teams is part of our AI tool stack project. In this project, we compile a set of recommendations with tools for our data scientists to use. The AI tool stack project will be covered in more detail in a future blog post.

Another purpose for which we use dvc is the data registry. A data registry is a separate git repository that contains the raw data from a project. Each project has its own data registry. Therefore, access to the raw data can be controlled for each individual project. The data registry serves as the central storage option for raw data, from which other project repos can import data. This is especially useful for projects with multiple git repositories. We only store raw data in the data registry. The preprocessing of data is done in the project repos. The advantage of the data registry is that the raw data is stored in a central place. This means that updates to the raw data only have to be run once.

Conclusion

In this blog post, we have given you an insight into how we version and track our data files. We decided to use dvc for this purpose. The data is now stored in a central place and can be shared easily among colleagues.

We also use dvc to build data preprocessing pipelines. This helps us to make our machine learning experiments reproducible and to track and store all intermediate results.

Dr. Christoph Kirmse
Senior Data Scientist
SMS digital Group