Data Versioning: Git for Your Data

Code versioning with Git is standard practice, but data changes too. A model trained on January's data produces different results than one trained on March's data. If you cannot reproduce the exact dataset a model was trained on, you cannot debug production issues, compare experiments fairly, or satisfy audit requirements.

This lesson covers DVC (the most popular open-source data versioning tool), LakeFS, and best practices for data reproducibility.

Why Git Cannot Version Data

Git stores every version of every file. A 5GB dataset with 100 versions would require 500GB of Git storage (and make clone/push/pull unusable). DVC solves this by storing data in external storage (S3, GCS, local) and tracking only lightweight metadata files (.dvc) in Git. The .dvc file contains a hash pointer to the actual data, similar to Git LFS but designed for ML workflows.

DVC (Data Version Control)

DVC is an open-source tool built on top of Git that adds data and model versioning, pipeline management, and experiment tracking.

Core Concepts

Concept	Description
dvc init	Initialize DVC in a Git repository
dvc add	Track a data file or directory (creates a .dvc file)
dvc push	Upload data to remote storage (S3, GCS, Azure, SSH)
dvc pull	Download data from remote storage
dvc checkout	Sync local data with the current Git commit's .dvc files
dvc repro	Reproduce a pipeline (re-run only changed stages)
dvc dag	Visualize the pipeline as a directed acyclic graph

How DVC Tracking Works

$ dvc add data/training.csv
Creates data/training.csv.dvc:
outs:
md5: a1b2c3d4e5f6...
  size: 52428800
  path: training.csv
Adds data/training.csv to .gitignore
You commit the .dvc file to Git; DVC stores the actual data

When you change the data and run dvc add again, the hash changes. Committing the new .dvc file creates a new version. You can switch between versions with git checkout && dvc checkout.

python

1# === DVC Workflow Simulation ===
2# (Full DVC requires CLI; this simulates the concepts in Python)
3
4import hashlib
5import json
6import os
7import numpy as np
8import pandas as pd
9
10class SimpleDVC:
11    """Minimal DVC simulation for learning purposes."""
12
13    def __init__(self, repo_dir="/tmp/dvc_demo"):
14        self.repo_dir = repo_dir
15        self.cache_dir = os.path.join(repo_dir, ".dvc_cache")
16        self.tracking_dir = os.path.join(repo_dir, "tracking")
17        os.makedirs(self.cache_dir, exist_ok=True)
18        os.makedirs(self.tracking_dir, exist_ok=True)
19        self.versions = {}  # Simulated Git commits
20
21    def _compute_hash(self, data_bytes):
22        return hashlib.md5(data_bytes).hexdigest()
23
24    def add(self, name, dataframe):
25        """Track a DataFrame (like 'dvc add')."""
26        data_bytes = dataframe.to_csv(index=False).encode()
27        data_hash = self._compute_hash(data_bytes)
28
29        # Store data in cache (keyed by hash)
30        cache_path = os.path.join(self.cache_dir, data_hash)
31        with open(cache_path, "wb") as f:
32            f.write(data_bytes)
33
34        # Create .dvc tracking file
35        dvc_meta = {
36            "name": name,
37            "md5": data_hash,
38            "size": len(data_bytes),
39            "rows": len(dataframe),
40            "columns": list(dataframe.columns),
41        }
42
43        tracking_path = os.path.join(self.tracking_dir, f"{name}.dvc")
44        with open(tracking_path, "w") as f:
45            json.dump(dvc_meta, f, indent=2)
46
47        print(f"Tracked '{name}': hash={data_hash[:12]}..., "
48              f"size={len(data_bytes)/1024:.1f}KB, "
49              f"rows={len(dataframe)}")
50        return data_hash
51
52    def commit(self, version_name):
53        """Simulate git commit (snapshot all .dvc files)."""
54        snapshot = {}
55        for fname in os.listdir(self.tracking_dir):
56            with open(os.path.join(self.tracking_dir, fname)) as f:
57                snapshot[fname] = json.load(f)
58        self.versions[version_name] = snapshot
59        print(f"Committed version '{version_name}' "
60              f"({len(snapshot)} tracked files)")
61
62    def checkout(self, version_name):
63        """Restore .dvc files from a version."""
64        if version_name not in self.versions:
65            print(f"Version '{version_name}' not found!")
66            return None
67        snapshot = self.versions[version_name]
68        print(f"Checked out '{version_name}':")
69        for name, meta in snapshot.items():
70            print(f"  {meta['name']}: hash={meta['md5'][:12]}..., "
71                  f"rows={meta['rows']}")
72        return snapshot
73
74    def get_data(self, version_name, dataset_name):
75        """Retrieve actual data for a version (like 'dvc pull')."""
76        snapshot = self.versions.get(version_name, {})
77        dvc_file = f"{dataset_name}.dvc"
78        if dvc_file not in snapshot:
79            return None
80        data_hash = snapshot[dvc_file]["md5"]
81        cache_path = os.path.join(self.cache_dir, data_hash)
82        return pd.read_csv(cache_path)
83
84    def diff(self, v1, v2):
85        """Compare two versions."""
86        s1 = self.versions.get(v1, {})
87        s2 = self.versions.get(v2, {})
88        print(f"\nDiff: '{v1}' vs '{v2}'")
89        all_files = set(list(s1.keys()) + list(s2.keys()))
90        for f in sorted(all_files):
91            m1 = s1.get(f, {})
92            m2 = s2.get(f, {})
93            if m1.get("md5") == m2.get("md5"):
94                print(f"  {f}: unchanged")
95            elif f not in s1:
96                print(f"  {f}: ADDED (rows={m2.get('rows')})")
97            elif f not in s2:
98                print(f"  {f}: DELETED")
99            else:
100                print(f"  {f}: CHANGED "
101                      f"(rows: {m1.get('rows')} -> {m2.get('rows')})")
102
103
104# === Demo ===
105dvc = SimpleDVC()
106
107# Version 1: Initial dataset
108np.random.seed(42)
109df_v1 = pd.DataFrame({
110    "feature_1": np.random.randn(1000),
111    "feature_2": np.random.randn(1000),
112    "label": np.random.randint(0, 2, 1000),
113})
114dvc.add("training_data", df_v1)
115dvc.commit("v1-initial")
116
117# Version 2: More data collected
118df_v2 = pd.concat([df_v1, pd.DataFrame({
119    "feature_1": np.random.randn(500),
120    "feature_2": np.random.randn(500),
121    "label": np.random.randint(0, 2, 500),
122})], ignore_index=True)
123dvc.add("training_data", df_v2)
124dvc.commit("v2-more-data")
125
126# Version 3: Data cleaned
127mask = df_v2["feature_1"].abs() < 3  # Remove outliers
128df_v3 = df_v2[mask].reset_index(drop=True)
129dvc.add("training_data", df_v3)
130dvc.commit("v3-cleaned")
131
132# Compare versions
133dvc.diff("v1-initial", "v3-cleaned")
134
135# Checkout and retrieve old data
136dvc.checkout("v1-initial")
137old_data = dvc.get_data("v1-initial", "training_data")
138print(f"\nRetrieved v1 data: {old_data.shape}")

DVC Pipelines

DVC pipelines define multi-stage ML workflows. Each stage specifies its dependencies (inputs), outputs, and the command to run. DVC tracks all of these and can intelligently re-run only the stages that need updating.

# dvc.yaml stages: preprocess: cmd: python src/preprocess.py deps: - src/preprocess.py - data/raw.csv outs: - data/processed.csv train: cmd: python src/train.py deps: - src/train.py - data/processed.csv outs: - models/model.pkl metrics: - metrics.json: cache: false

evaluate: cmd: python src/evaluate.py deps: - src/evaluate.py - models/model.pkl - data/test.csv metrics: - results.json: cache: false

Running dvc repro will only re-execute stages whose dependencies have changed. This saves time in iterative development.

LakeFS: Git-like Branching for Data

LakeFS provides Git-like operations (branch, commit, merge, diff) directly on your data lake (S3, GCS, Azure Blob). Unlike DVC which tracks pointers, LakeFS operates at the storage layer.

Feature	DVC	LakeFS
Approach	Track metadata in Git	Git-like operations on object storage
Branching	Via Git branches	Native data branches
Merge	Manual	Automatic merge with conflict detection
Atomicity	File-level	Commit-level (all or nothing)
Best for	ML teams, small-medium data	Data engineering, large data lakes

Data Lineage

Data lineage tracks the full provenance chain: which raw data was transformed by which code to produce which features that trained which model. DVC pipelines provide lineage automatically. For custom pipelines, log the Git commit hash, data version hash, and all hyperparameters alongside your model artifacts. This makes any model fully reproducible.