Clew Module (stx.clew)
Hash-based provenance tracking for reproducible science. Clew (Ariadne’s
thread) records file hashes during @stx.session runs and traces
dependency chains back to source.
How It Works
@stx.sessionstarts a tracking sessionstx.io.load()records input file hashesstx.io.save()records output file hashesSession close computes a combined hash of all inputs/outputs
Later,
stx.clewcan verify nothing has changed
import scitex as stx
# Automatic -- just use @stx.session + stx.io
@stx.session
def main():
data = stx.io.load("input.csv") # Tracked as input
result = process(data)
stx.io.save(result, "output.png") # Tracked as output
return 0
# Verify later
stx.clew.status() # Like git status
stx.clew.run("session_id") # Verify by hash
stx.clew.chain("output.png") # Trace to source
CLI Commands
scitex clew status # Show changed files
scitex clew list # List all tracked runs
scitex clew run <session_id> # Verify a specific run
scitex clew chain <file> # Trace dependency chain
scitex clew stats # Database statistics
Verification Levels
CACHE – Hash comparison only (fast). Checks if files match stored hashes.
RERUN – Re-execute scripts and compare outputs (thorough). Catches logic errors.
# Fast: hash comparison
result = stx.clew.run("session_id")
# Thorough: re-execute and compare
result = stx.clew.run("session_id", from_scratch=True)
Dependency Chains
Clew traces parent_session links to build a DAG from final output
back to original source:
chain = stx.clew.chain("final_figure.png")
# Shows: source.py → intermediate.csv → analysis.py → final_figure.png
# Visualize as Mermaid DAG
stx.clew.mermaid("session_id")
Verification Statuses
VERIFIED– Files match expected hashesMISMATCH– Files differ from stored hashesMISSING– Files no longer existUNKNOWN– No prior tracking data
Key Functions
status()– Show changed items (likegit status)run(session_id)– Verify a specific runchain(target_file)– Trace dependency chainlist_runs(limit, status)– List tracked runsstats()– Database statistics
API Reference
scitex-clew — Hash-based verification for reproducible science.
Standalone package. Zero dependencies (pure stdlib + sqlite3). When used with scitex, integration is automatic via @stx.session + stx.io.
Public API:
import scitex_clew as clew
# Verification
clew.status() # git-status-like overview
clew.run(session_id) # verify one run (hash check)
clew.chain(target_file) # trace file → source chain
clew.dag(targets) # verify full DAG
clew.rerun(target) # re-execute & compare (sandbox)
clew.rerun_dag(targets) # rerun full DAG in topo order
clew.rerun_claims() # rerun all claim-backing sessions
clew.list_runs(limit=100) # list tracked runs
clew.stats() # database statistics
# Claims
clew.add_claim(...) # register manuscript assertion
clew.list_claims(...) # list registered claims
clew.verify_claim(...) # verify a specific claim
# Stamping
clew.stamp(...) # create temporal proof
clew.list_stamps(...) # list stamps
clew.check_stamp(...) # verify a stamp
# Hashing
clew.hash_file(path) # SHA256 of a file
clew.hash_directory(path) # SHA256 of all files in dir
# Visualization
clew.mermaid(...) # generate Mermaid DAG diagram
# Examples
clew.init_examples(dest) # scaffold example pipeline
# Session lifecycle hooks (invoked by @scitex.session)
clew.on_session_start(session_id) # open a tracked run
clew.on_session_close(status=...) # finalize run + combined hash
- scitex.clew.dag(targets=None, claims=False, strict=False)[source]
Verify the DAG for multiple targets or all claims.
- Parameters:
targets (list of str or Path, optional) – Target files to verify (mutually exclusive with
claims).claims (bool, optional) – If True, build the DAG from every registered claim.
strict (bool, optional) – If True (F2), return a failure-attribution dict with
failed_node/root_cause/invalidated_claims/still_valid_claimsinstead of aDAGVerification.
- scitex.clew.rerun(target, timeout=300, cleanup=True)[source]
Re-execute a session in a sandbox and compare outputs.
- scitex.clew.rerun_dag(targets=None, timeout=300, cleanup=True)[source]
Rerun-verify an entire DAG in topological order.
Each session is re-executed in a sandbox against its ORIGINAL stored inputs (not freshly rerun outputs from upstream), then compared to the original outputs.
- Parameters:
targets (list of str, optional) – Target output files whose upstream DAG should be rerun. If None, all runs in the database are used and their output files become the targets.
timeout (int, optional) – Maximum execution time per session in seconds (default: 300).
cleanup (bool, optional) – Whether to remove sandbox output directories after each rerun.
- Returns:
Unified verification result for the entire DAG.
- Return type:
DAGVerification
- scitex.clew.rerun_claims(file_path=None, claim_type=None, timeout=300, cleanup=True)[source]
Rerun-verify all sessions that produced files referenced by claims.
Collects unique source files from matching claims, then delegates to
rerun_dagwith those files as targets.- Parameters:
file_path (str, optional) – Filter claims by manuscript file path.
claim_type (str, optional) – Filter claims by type (statistic, figure, table, text, value).
timeout (int, optional) – Maximum execution time per session in seconds (default: 300).
cleanup (bool, optional) – Whether to remove sandbox output directories after each rerun.
- Returns:
Unified verification result for the upstream DAG of all source files referenced by the matching claims.
- Return type:
DAGVerification
- scitex.clew.add_claim(file_path, claim_type, line_number=None, claim_value=None, source_file=None, source_session=None)[source]
Register a claim linking a manuscript assertion to the verification chain.
- Parameters:
file_path (str) – Path to the manuscript file (e.g., paper.tex).
claim_type (str) – One of: statistic, figure, table, text, value.
line_number (int, optional) – Line number in the manuscript.
claim_value (str, optional) – The asserted value (e.g., “p = 0.003”).
source_file (str, optional) – Path to the source file that produced this claim.
source_session (str, optional) – Session ID that produced the source.
- Returns:
The registered claim object.
- Return type:
Claim
- scitex.clew.list_claims(file_path=None, claim_type=None, status=None, limit=100)[source]
List registered claims with optional filters.
- scitex.clew.verify_claim(claim_id_or_location)[source]
Verify a specific claim by checking its source against the verification chain.
- scitex.clew.export_claims_json(path=None, *, file_path_filter=None, read_only=True)[source]
Export every registered claim to a canonical JSON artifact.
The exported file is the single human-readable + machine-consumable view of the claims table in
db.sqlite. The DB remains the source of truth; this JSON is a regenerable artifact.Path resolution (mirrors
scitex_clew._db._core._default_db_path()):1. Explicit ``path`` argument. 2. ``$SCITEX_CLEW_CLAIMS_JSON`` env var (escape hatch). 3. ``<project_root>/.scitex/clew/runtime/claims.json`` (project root = nearest ancestor dir with ``.git`` or ``pyproject.toml``; falls back to cwd if none found).
- Parameters:
path (str | Path, optional) – Override the resolved path. Useful for tests / one-off dumps.
file_path_filter (str, optional) – When set, only claims registered against this manuscript file path are exported. Default: every claim in the DB.
read_only (bool, optional) – After writing,
chmod 0o444the file so accidental edits fail loudly at the OS layer. Default True (the file IS derived). Set False for tests that need to mutate the file.
- Returns:
The path the artifact was written to (absolute).
- Return type:
Path
Examples
>>> import scitex_clew as clew >>> clew.add_claim("paper.tex", "value", 42, "0.94", source_file="r.csv") >>> # claims.json now auto-exported under ./.scitex/clew/runtime/ >>> clew.export_claims_json() # idempotent — re-emit on demand PosixPath('.../.scitex/clew/runtime/claims.json')
- scitex.clew.register_intermediate(name, value, supports=None, session_id=None, claim_type='value')[source]
Register a computed intermediate as a Clew claim.
Use this from inside a @stx.session script (or from an agent loop) to record any non-trivial intermediate value with explicit upstream support. The claim becomes part of the DAG and can be queried via clew.chain, clew.dag, or the MCP clew_chain / clew_dag tools.
- Parameters:
name (
str) – Descriptive identifier (e.g. “acute_n_sig_pathways”). Avoid generic names like “result_3” — the id is the only handle a future inspector has on the value.value (
Any) – The computed result. Coerced to string for storage; the hash chain sees repr(value) so types matter.supports (
Optional[List[str]]) – List of upstream claim ids or session ids that this value depends on. Stored as JSON in the claim’s value field for retrieval. None means no explicit upstream (use sparingly).session_id (
Optional[str]) – The session this value belongs to. If None, read from the SCITEX_SESSION_ID env var that @stx.session sets at start.claim_type (
str) – One of statistic, figure, table, text, value. Defaults to value since intermediates are usually scalar / categorical results.
- Returns:
The registered claim object.
- Return type:
Claim
- Raises:
ValueError – If no session_id can be determined (env var unset and not passed).
Examples
Inside a @stx.session script:
>>> from scitex_clew import register_intermediate >>> n_sig = sum(1 for p in pathways if p.padj < 0.05) >>> register_intermediate( ... name="chronic_r2_n_sig_pathways", ... value=n_sig, ... supports=["chronic_r2_min_pvals", "reactome_pathways_v2024"], ... )
- scitex.clew.stamp(backend='file', service_url=None, session_ids=None, output_dir=None)[source]
Record root hash with external timestamp.
- Parameters:
backend (str) – One of: file, rfc3161, zenodo.
service_url (str, optional) – URL for RFC 3161 TSA or Zenodo API.
session_ids (list of str, optional) – Specific sessions to stamp. If None, stamps all successful runs.
output_dir (str, optional) – Directory for file-based stamps (default: <db_dir>/stamps, i.e. .scitex/clew/runtime/stamps/).
- Returns:
The timestamp proof record.
- Return type:
Stamp
- scitex.clew.hash_file(path, algorithm='sha256', chunk_size=8192)[source]
Compute hash of a file.
- Parameters:
- Returns:
Hexadecimal hash string (first 32 characters)
- Return type:
Examples
>>> hash_file("data.csv") 'a1b2c3d4e5f6...'
- scitex.clew.hash_directory(path, pattern='*', recursive=True, algorithm='sha256')[source]
Compute hashes for all files in a directory.
- Parameters:
- Returns:
Mapping of relative paths to hashes
- Return type:
Examples
>>> hash_directory("./data/") {'input.csv': 'a1b2...', 'config.yaml': 'c3d4...'}
- scitex.clew.mermaid(session_id=None, target_file=None, target_files=None, claims=False, grouper=None, **kwargs)[source]
Generate a Mermaid DAG diagram.
- Parameters:
session_id (str, optional) – Start from this session.
target_file (str, optional) – Start from the session that produced this file.
target_files (list of str, optional) – Multiple target files (multi-target DAG).
claims (bool, optional) – If True, build DAG from all registered claims.
grouper (callable | dict | None, optional) – File grouping strategy. Callable or JSON/dict spec (see
scitex_clew.groupers.resolve_spec). IfNone, falls back to.scitex/clew/config.yaml(keygrouper) if present.
- scitex.clew.init_examples(dest, variant='sequential', *, find_examples_dir=<function _find_examples_dir>)[source]
Copy Clew example scripts to a destination directory.
Copies only the runnable scripts (.py, .sh) and README — not the output directories. Users run
00_run_all.shthemselves to generate outputs and populate the verification database.- Parameters:
dest (str or Path) – Destination directory. Created if it does not exist. Existing script files are overwritten.
variant (str, optional) – Example variant: “sequential” (default) or “multi_parent”.
find_examples_dir (callable, optional) – Locator callable
(variant: str) -> Optional[Path]used to resolve the bundled examples source. Production callers should not pass this; it is the canonical PA-306 §1 DI seam — tests inject a hand-rolled fake that returns atmp_path-rooted directory orNone.
- Returns:
{"path": str, "files": list[str], "file_count": int, "variant": str}- Return type:
- Raises:
FileNotFoundError – If the bundled examples cannot be located.
ValueError – If variant is not recognized.
- scitex.clew.on_session_start(session_id, script_path=None, parent_session=None, verbose=False, metadata=None)[source]
Hook called when a session starts.
- Parameters:
session_id (str) – Unique session identifier
script_path (str, optional) – Path to the script being run
parent_session (str, optional) – Parent session ID for chain tracking
verbose (bool, optional) – Whether to log status messages
metadata (dict, optional) – Additional metadata (e.g. notebook_path, cell_index)
- Return type:
- scitex.clew.on_session_close(status='success', exit_code=0, verbose=False, register=None)[source]
Hook called when a session closes.
- Parameters:
status (str, optional) – Final status (success, failed, error)
exit_code (int, optional) – Exit code of the script
verbose (bool, optional) – Whether to log status messages
register (bool, optional) – If True, register session hashes with remote Clew Registry. If None, checks SCITEX_AUTO_REGISTER environment variable.
- Return type: