Data Versioning: Git for Your Data
Code versioning with Git is standard practice, but data changes too. A model trained on January's data produces different results than one trained on March's data. If you cannot reproduce the exact dataset a model was trained on, you cannot debug production issues, compare experiments fairly, or satisfy audit requirements.
This lesson covers DVC (the most popular open-source data versioning tool), LakeFS, and best practices for data reproducibility.
Why Git Cannot Version Data
DVC (Data Version Control)
DVC is an open-source tool built on top of Git that adds data and model versioning, pipeline management, and experiment tracking.
Core Concepts
| Concept | Description |
|---|---|
| dvc init | Initialize DVC in a Git repository |
| dvc add | Track a data file or directory (creates a .dvc file) |
| dvc push | Upload data to remote storage (S3, GCS, Azure, SSH) |
| dvc pull | Download data from remote storage |
| dvc checkout | Sync local data with the current Git commit's .dvc files |
| dvc repro | Reproduce a pipeline (re-run only changed stages) |
| dvc dag | Visualize the pipeline as a directed acyclic graph |
How DVC Tracking Works
$ dvc add data/training.csvCreates data/training.csv.dvc:
outs:
md5: a1b2c3d4e5f6...
size: 52428800
path: training.csvAdds data/training.csv to .gitignore
You commit the .dvc file to Git; DVC stores the actual data
When you change the data and run dvc add again, the hash changes. Committing the new .dvc file creates a new version. You can switch between versions with git checkout .
1# === DVC Workflow Simulation ===
2# (Full DVC requires CLI; this simulates the concepts in Python)
3
4import hashlib
5import json
6import os
7import numpy as np
8import pandas as pd
9
10class SimpleDVC:
11 """Minimal DVC simulation for learning purposes."""
12
13 def __init__(self, repo_dir="/tmp/dvc_demo"):
14 self.repo_dir = repo_dir
15 self.cache_dir = os.path.join(repo_dir, ".dvc_cache")
16 self.tracking_dir = os.path.join(repo_dir, "tracking")
17 os.makedirs(self.cache_dir, exist_ok=True)
18 os.makedirs(self.tracking_dir, exist_ok=True)
19 self.versions = {} # Simulated Git commits
20
21 def _compute_hash(self, data_bytes):
22 return hashlib.md5(data_bytes).hexdigest()
23
24 def add(self, name, dataframe):
25 """Track a DataFrame (like 'dvc add')."""
26 data_bytes = dataframe.to_csv(index=False).encode()
27 data_hash = self._compute_hash(data_bytes)
28
29 # Store data in cache (keyed by hash)
30 cache_path = os.path.join(self.cache_dir, data_hash)
31 with open(cache_path, "wb") as f:
32 f.write(data_bytes)
33
34 # Create .dvc tracking file
35 dvc_meta = {
36 "name": name,
37 "md5": data_hash,
38 "size": len(data_bytes),
39 "rows": len(dataframe),
40 "columns": list(dataframe.columns),
41 }
42
43 tracking_path = os.path.join(self.tracking_dir, f"{name}.dvc")
44 with open(tracking_path, "w") as f:
45 json.dump(dvc_meta, f, indent=2)
46
47 print(f"Tracked '{name}': hash={data_hash[:12]}..., "
48 f"size={len(data_bytes)/1024:.1f}KB, "
49 f"rows={len(dataframe)}")
50 return data_hash
51
52 def commit(self, version_name):
53 """Simulate git commit (snapshot all .dvc files)."""
54 snapshot = {}
55 for fname in os.listdir(self.tracking_dir):
56 with open(os.path.join(self.tracking_dir, fname)) as f:
57 snapshot[fname] = json.load(f)
58 self.versions[version_name] = snapshot
59 print(f"Committed version '{version_name}' "
60 f"({len(snapshot)} tracked files)")
61
62 def checkout(self, version_name):
63 """Restore .dvc files from a version."""
64 if version_name not in self.versions:
65 print(f"Version '{version_name}' not found!")
66 return None
67 snapshot = self.versions[version_name]
68 print(f"Checked out '{version_name}':")
69 for name, meta in snapshot.items():
70 print(f" {meta['name']}: hash={meta['md5'][:12]}..., "
71 f"rows={meta['rows']}")
72 return snapshot
73
74 def get_data(self, version_name, dataset_name):
75 """Retrieve actual data for a version (like 'dvc pull')."""
76 snapshot = self.versions.get(version_name, {})
77 dvc_file = f"{dataset_name}.dvc"
78 if dvc_file not in snapshot:
79 return None
80 data_hash = snapshot[dvc_file]["md5"]
81 cache_path = os.path.join(self.cache_dir, data_hash)
82 return pd.read_csv(cache_path)
83
84 def diff(self, v1, v2):
85 """Compare two versions."""
86 s1 = self.versions.get(v1, {})
87 s2 = self.versions.get(v2, {})
88 print(f"\nDiff: '{v1}' vs '{v2}'")
89 all_files = set(list(s1.keys()) + list(s2.keys()))
90 for f in sorted(all_files):
91 m1 = s1.get(f, {})
92 m2 = s2.get(f, {})
93 if m1.get("md5") == m2.get("md5"):
94 print(f" {f}: unchanged")
95 elif f not in s1:
96 print(f" {f}: ADDED (rows={m2.get('rows')})")
97 elif f not in s2:
98 print(f" {f}: DELETED")
99 else:
100 print(f" {f}: CHANGED "
101 f"(rows: {m1.get('rows')} -> {m2.get('rows')})")
102
103
104# === Demo ===
105dvc = SimpleDVC()
106
107# Version 1: Initial dataset
108np.random.seed(42)
109df_v1 = pd.DataFrame({
110 "feature_1": np.random.randn(1000),
111 "feature_2": np.random.randn(1000),
112 "label": np.random.randint(0, 2, 1000),
113})
114dvc.add("training_data", df_v1)
115dvc.commit("v1-initial")
116
117# Version 2: More data collected
118df_v2 = pd.concat([df_v1, pd.DataFrame({
119 "feature_1": np.random.randn(500),
120 "feature_2": np.random.randn(500),
121 "label": np.random.randint(0, 2, 500),
122})], ignore_index=True)
123dvc.add("training_data", df_v2)
124dvc.commit("v2-more-data")
125
126# Version 3: Data cleaned
127mask = df_v2["feature_1"].abs() < 3 # Remove outliers
128df_v3 = df_v2[mask].reset_index(drop=True)
129dvc.add("training_data", df_v3)
130dvc.commit("v3-cleaned")
131
132# Compare versions
133dvc.diff("v1-initial", "v3-cleaned")
134
135# Checkout and retrieve old data
136dvc.checkout("v1-initial")
137old_data = dvc.get_data("v1-initial", "training_data")
138print(f"\nRetrieved v1 data: {old_data.shape}")DVC Pipelines
DVC pipelines define multi-stage ML workflows. Each stage specifies its dependencies (inputs), outputs, and the command to run. DVC tracks all of these and can intelligently re-run only the stages that need updating.
# dvc.yaml
stages:
preprocess:
cmd: python src/preprocess.py
deps:
- src/preprocess.py
- data/raw.csv
outs:
- data/processed.csv train:
cmd: python src/train.py
deps:
- src/train.py
- data/processed.csv
outs:
- models/model.pkl
metrics:
- metrics.json:
cache: false
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- models/model.pkl
- data/test.csv
metrics:
- results.json:
cache: false
Running dvc repro will only re-execute stages whose dependencies have changed. This saves time in iterative development.
LakeFS: Git-like Branching for Data
LakeFS provides Git-like operations (branch, commit, merge, diff) directly on your data lake (S3, GCS, Azure Blob). Unlike DVC which tracks pointers, LakeFS operates at the storage layer.
| Feature | DVC | LakeFS |
|---|---|---|
| Approach | Track metadata in Git | Git-like operations on object storage |
| Branching | Via Git branches | Native data branches |
| Merge | Manual | Automatic merge with conflict detection |
| Atomicity | File-level | Commit-level (all or nothing) |
| Best for | ML teams, small-medium data | Data engineering, large data lakes |