Skip to main content

Data Versioning

DVC (tracking data, pipelines, experiments), LakeFS (git-like branching for data), data lineage, reproducibility, and dataset registries

~40 min
Listen to this lesson

Data Versioning: Git for Your Data

Code versioning with Git is standard practice, but data changes too. A model trained on January's data produces different results than one trained on March's data. If you cannot reproduce the exact dataset a model was trained on, you cannot debug production issues, compare experiments fairly, or satisfy audit requirements.

This lesson covers DVC (the most popular open-source data versioning tool), LakeFS, and best practices for data reproducibility.

Why Git Cannot Version Data

Git stores every version of every file. A 5GB dataset with 100 versions would require 500GB of Git storage (and make clone/push/pull unusable). DVC solves this by storing data in external storage (S3, GCS, local) and tracking only lightweight metadata files (.dvc) in Git. The .dvc file contains a hash pointer to the actual data, similar to Git LFS but designed for ML workflows.

DVC (Data Version Control)

DVC is an open-source tool built on top of Git that adds data and model versioning, pipeline management, and experiment tracking.

Core Concepts

ConceptDescription
dvc initInitialize DVC in a Git repository
dvc addTrack a data file or directory (creates a .dvc file)
dvc pushUpload data to remote storage (S3, GCS, Azure, SSH)
dvc pullDownload data from remote storage
dvc checkoutSync local data with the current Git commit's .dvc files
dvc reproReproduce a pipeline (re-run only changed stages)
dvc dagVisualize the pipeline as a directed acyclic graph

How DVC Tracking Works

$ dvc add data/training.csv

Creates data/training.csv.dvc:

outs:
  • md5: a1b2c3d4e5f6...
  • size: 52428800 path: training.csv

    Adds data/training.csv to .gitignore

    You commit the .dvc file to Git; DVC stores the actual data

    When you change the data and run dvc add again, the hash changes. Committing the new .dvc file creates a new version. You can switch between versions with git checkout && dvc checkout.

    python
    1# === DVC Workflow Simulation ===
    2# (Full DVC requires CLI; this simulates the concepts in Python)
    3
    4import hashlib
    5import json
    6import os
    7import numpy as np
    8import pandas as pd
    9
    10class SimpleDVC:
    11    """Minimal DVC simulation for learning purposes."""
    12
    13    def __init__(self, repo_dir="/tmp/dvc_demo"):
    14        self.repo_dir = repo_dir
    15        self.cache_dir = os.path.join(repo_dir, ".dvc_cache")
    16        self.tracking_dir = os.path.join(repo_dir, "tracking")
    17        os.makedirs(self.cache_dir, exist_ok=True)
    18        os.makedirs(self.tracking_dir, exist_ok=True)
    19        self.versions = {}  # Simulated Git commits
    20
    21    def _compute_hash(self, data_bytes):
    22        return hashlib.md5(data_bytes).hexdigest()
    23
    24    def add(self, name, dataframe):
    25        """Track a DataFrame (like 'dvc add')."""
    26        data_bytes = dataframe.to_csv(index=False).encode()
    27        data_hash = self._compute_hash(data_bytes)
    28
    29        # Store data in cache (keyed by hash)
    30        cache_path = os.path.join(self.cache_dir, data_hash)
    31        with open(cache_path, "wb") as f:
    32            f.write(data_bytes)
    33
    34        # Create .dvc tracking file
    35        dvc_meta = {
    36            "name": name,
    37            "md5": data_hash,
    38            "size": len(data_bytes),
    39            "rows": len(dataframe),
    40            "columns": list(dataframe.columns),
    41        }
    42
    43        tracking_path = os.path.join(self.tracking_dir, f"{name}.dvc")
    44        with open(tracking_path, "w") as f:
    45            json.dump(dvc_meta, f, indent=2)
    46
    47        print(f"Tracked '{name}': hash={data_hash[:12]}..., "
    48              f"size={len(data_bytes)/1024:.1f}KB, "
    49              f"rows={len(dataframe)}")
    50        return data_hash
    51
    52    def commit(self, version_name):
    53        """Simulate git commit (snapshot all .dvc files)."""
    54        snapshot = {}
    55        for fname in os.listdir(self.tracking_dir):
    56            with open(os.path.join(self.tracking_dir, fname)) as f:
    57                snapshot[fname] = json.load(f)
    58        self.versions[version_name] = snapshot
    59        print(f"Committed version '{version_name}' "
    60              f"({len(snapshot)} tracked files)")
    61
    62    def checkout(self, version_name):
    63        """Restore .dvc files from a version."""
    64        if version_name not in self.versions:
    65            print(f"Version '{version_name}' not found!")
    66            return None
    67        snapshot = self.versions[version_name]
    68        print(f"Checked out '{version_name}':")
    69        for name, meta in snapshot.items():
    70            print(f"  {meta['name']}: hash={meta['md5'][:12]}..., "
    71                  f"rows={meta['rows']}")
    72        return snapshot
    73
    74    def get_data(self, version_name, dataset_name):
    75        """Retrieve actual data for a version (like 'dvc pull')."""
    76        snapshot = self.versions.get(version_name, {})
    77        dvc_file = f"{dataset_name}.dvc"
    78        if dvc_file not in snapshot:
    79            return None
    80        data_hash = snapshot[dvc_file]["md5"]
    81        cache_path = os.path.join(self.cache_dir, data_hash)
    82        return pd.read_csv(cache_path)
    83
    84    def diff(self, v1, v2):
    85        """Compare two versions."""
    86        s1 = self.versions.get(v1, {})
    87        s2 = self.versions.get(v2, {})
    88        print(f"\nDiff: '{v1}' vs '{v2}'")
    89        all_files = set(list(s1.keys()) + list(s2.keys()))
    90        for f in sorted(all_files):
    91            m1 = s1.get(f, {})
    92            m2 = s2.get(f, {})
    93            if m1.get("md5") == m2.get("md5"):
    94                print(f"  {f}: unchanged")
    95            elif f not in s1:
    96                print(f"  {f}: ADDED (rows={m2.get('rows')})")
    97            elif f not in s2:
    98                print(f"  {f}: DELETED")
    99            else:
    100                print(f"  {f}: CHANGED "
    101                      f"(rows: {m1.get('rows')} -> {m2.get('rows')})")
    102
    103
    104# === Demo ===
    105dvc = SimpleDVC()
    106
    107# Version 1: Initial dataset
    108np.random.seed(42)
    109df_v1 = pd.DataFrame({
    110    "feature_1": np.random.randn(1000),
    111    "feature_2": np.random.randn(1000),
    112    "label": np.random.randint(0, 2, 1000),
    113})
    114dvc.add("training_data", df_v1)
    115dvc.commit("v1-initial")
    116
    117# Version 2: More data collected
    118df_v2 = pd.concat([df_v1, pd.DataFrame({
    119    "feature_1": np.random.randn(500),
    120    "feature_2": np.random.randn(500),
    121    "label": np.random.randint(0, 2, 500),
    122})], ignore_index=True)
    123dvc.add("training_data", df_v2)
    124dvc.commit("v2-more-data")
    125
    126# Version 3: Data cleaned
    127mask = df_v2["feature_1"].abs() < 3  # Remove outliers
    128df_v3 = df_v2[mask].reset_index(drop=True)
    129dvc.add("training_data", df_v3)
    130dvc.commit("v3-cleaned")
    131
    132# Compare versions
    133dvc.diff("v1-initial", "v3-cleaned")
    134
    135# Checkout and retrieve old data
    136dvc.checkout("v1-initial")
    137old_data = dvc.get_data("v1-initial", "training_data")
    138print(f"\nRetrieved v1 data: {old_data.shape}")

    DVC Pipelines

    DVC pipelines define multi-stage ML workflows. Each stage specifies its dependencies (inputs), outputs, and the command to run. DVC tracks all of these and can intelligently re-run only the stages that need updating.

    # dvc.yaml
    stages:
      preprocess:
        cmd: python src/preprocess.py
        deps:
          - src/preprocess.py
          - data/raw.csv
        outs:
          - data/processed.csv

    train: cmd: python src/train.py deps: - src/train.py - data/processed.csv outs: - models/model.pkl metrics: - metrics.json: cache: false

    evaluate: cmd: python src/evaluate.py deps: - src/evaluate.py - models/model.pkl - data/test.csv metrics: - results.json: cache: false

    Running dvc repro will only re-execute stages whose dependencies have changed. This saves time in iterative development.

    LakeFS: Git-like Branching for Data

    LakeFS provides Git-like operations (branch, commit, merge, diff) directly on your data lake (S3, GCS, Azure Blob). Unlike DVC which tracks pointers, LakeFS operates at the storage layer.

    FeatureDVCLakeFS
    ApproachTrack metadata in GitGit-like operations on object storage
    BranchingVia Git branchesNative data branches
    MergeManualAutomatic merge with conflict detection
    AtomicityFile-levelCommit-level (all or nothing)
    Best forML teams, small-medium dataData engineering, large data lakes

    Data Lineage

    Data lineage tracks the full provenance chain: which raw data was transformed by which code to produce which features that trained which model. DVC pipelines provide lineage automatically. For custom pipelines, log the Git commit hash, data version hash, and all hyperparameters alongside your model artifacts. This makes any model fully reproducible.