Caching & Skip Detection¶
Pivot's cache ensures you never re-run a stage whose inputs haven't changed. It is content-addressed (same inputs → same outputs), per-file (not per-directory), and uses a three-tier skip algorithm to decide as fast as possible.
Content-Addressable Cache¶
Every output file is hashed with xxhash64 and stored under
.pivot/cache/<hash[:2]>/<hash[2:]> (a two-level directory structure using
the first two hex characters as a prefix). The per-stage lock file records a
manifest mapping each output path to its content hash.
.pivot/
cache/
a1/
b2c3d4e5f6g7h8 # cached file (content-addressed)
...
stages/
clean.lock # lock file for "clean" stage
train.lock # lock file for "train" stage
state.db # LMDB database
Because the cache is content-addressed:
- Identical files are stored once, regardless of which stage produced them
- Reverting a parameter change can restore outputs from cache without re-executing
- The cache is shared across all pipelines in a project
Per-Stage Lock Files¶
Each stage has a .lock file that records everything about its last
successful run:
| Lock file field | Contents |
|---|---|
code_manifest |
Fingerprint dict (key → hash) |
params |
Serialised parameters dict |
dep_hashes |
Input file hashes (path → xxhash64) |
output_hashes |
Output file hashes (path → xxhash64) |
This is the ground truth for "what did the stage look like last time it ran?" The skip algorithm compares current state against these fields.
Three-Tier Skip Algorithm¶
When a stage is about to execute, the worker decides whether to skip it. The algorithm is designed to answer as quickly as possible, falling through to more expensive checks only when needed.
Tier 1: Generation Tracking — O(1)¶
StateDB maintains a monotonic generation counter for every output
file. When a stage runs and produces output.csv, the generation for
that path is incremented. The stage also records the generation of each
dependency at the time it ran.
On the next run, the worker checks: "are my dependency generations the same as when I last ran?" This is a handful of integer comparisons — no file I/O at all.
If generations match and the code fingerprint and params haven't changed, the stage is skipped immediately.
Tier 2: Lock File Comparison — O(n)¶
If generation tracking can't confirm a skip (first run, StateDB cleared, generations diverged), the worker falls through to a full comparison:
- Recompute the code fingerprint manifest
- Serialise current parameters
- Hash all dependency files (xxhash64, with StateDB-cached results)
- Compare all three against the lock file
If everything matches, the stage is skipped. The worker also restores any missing output files from the cache at this point.
Tier 3: Run Cache — O(1) lookup¶
If tier 2 says the stage should run (something changed), there's one more chance to skip. Pivot computes an input hash from the combination of:
- Code fingerprint
- Current parameters
- Dependency hashes
- Output path specs
This input hash is looked up in the run cache
(runcache:<stage>:<input_hash> in StateDB). If a previous run with
identical inputs exists, Pivot restores its outputs from cache and writes
a new lock file — without executing the stage function.
This handles cases like: "I changed a parameter, ran the pipeline, then changed it back." The original outputs are still in cache and can be restored.
Decision Flow¶
┌─────────────────────────┐
│ Tier 1: Generation check│──match──→ SKIP
│ (O(1), no file I/O) │
└────────┬────────────────┘
│ miss
┌────────▼────────────────┐
│ Tier 2: Lock comparison │──match──→ SKIP
│ (O(n), hash deps) │
└────────┬────────────────┘
│ changed
┌────────▼────────────────┐
│ Tier 3: Run cache │──hit────→ SKIP (restore outputs)
│ (O(1), StateDB lookup) │
└────────┬────────────────┘
│ miss
▼
EXECUTE
File Change Detection¶
Pivot uses a two-level strategy for detecting file changes:
- Stat check — compare
(mtime_ns, size, inode)against the StateDB cache. If all three match, reuse the cached hash (no I/O). - Content hash — if stat differs, read the file and compute
xxhash64. Store the new stat + hash in StateDB for next time.
This gives O(1) skip detection for unchanged files while remaining correct when files do change (content hash is the source of truth).
Checkout Modes¶
When restoring outputs from cache, Pivot supports three strategies, tried in order of preference:
| Mode | How it works | Trade-off |
|---|---|---|
hardlink |
Hard link to cache file | Zero copy, but editing the file modifies the cache |
symlink |
Symbolic link to cache file | Zero copy, visibly a link |
copy |
Full file copy | Safe but uses disk space |
The default order is hardlink → symlink → copy. Configure it in
.pivot/config.yaml:
Or override per-command:
CLI Commands¶
pivot status --explain¶
Shows why each stage would run or skip, breaking down the three-tier decision:
$ pivot status --explain
train: stale (params changed: learning_rate 0.01 → 0.005)
clean: up to date (generation match)
pivot checkout¶
Restore tracked files and stage outputs from cache:
$ pivot checkout # restore all
$ pivot checkout train # restore specific stage outputs
$ pivot checkout --only-missing # skip files that already exist
$ pivot checkout --force # overwrite existing files
pivot commit¶
Record current outputs in lock files and cache without re-running:
Relationship to Other Concepts¶
The skip algorithm combines three input signals:
- Code — tracked by fingerprinting
- Parameters — tracked via parameters
- Data — tracked via dependency hashes (see dependencies)
A stage is skipped only when all three match. Any single change is enough to trigger re-execution.