Clew Module (`stx.clew`)

Hash-based provenance tracking for reproducible science. Clew (Ariadne’s thread) records file hashes during @stx.session runs and traces dependency chains back to source.

How It Works

@stx.session starts a tracking session
stx.io.load() records input file hashes
stx.io.save() records output file hashes
Session close computes a combined hash of all inputs/outputs
Later, stx.clew can verify nothing has changed

import scitex as stx

# Automatic -- just use @stx.session + stx.io
@stx.session
def main():
    data = stx.io.load("input.csv")      # Tracked as input
    result = process(data)
    stx.io.save(result, "output.png")     # Tracked as output
    return 0

# Verify later
stx.clew.status()                         # Like git status
stx.clew.run("session_id")                # Verify by hash
stx.clew.chain("output.png")              # Trace to source

CLI Commands

scitex clew status                  # Show changed files
scitex clew list                    # List all tracked runs
scitex clew run <session_id>        # Verify a specific run
scitex clew chain <file>            # Trace dependency chain
scitex clew stats                   # Database statistics

Verification Levels

CACHE – Hash comparison only (fast). Checks if files match stored hashes.
RERUN – Re-execute scripts and compare outputs (thorough). Catches logic errors.

# Fast: hash comparison
result = stx.clew.run("session_id")

# Thorough: re-execute and compare
result = stx.clew.run("session_id", from_scratch=True)

Dependency Chains

Clew traces parent_session links to build a DAG from final output back to original source:

chain = stx.clew.chain("final_figure.png")
# Shows: source.py → intermediate.csv → analysis.py → final_figure.png

# Visualize as Mermaid DAG
stx.clew.mermaid("session_id")

Verification Statuses

VERIFIED – Files match expected hashes
MISMATCH – Files differ from stored hashes
MISSING – Files no longer exist
UNKNOWN – No prior tracking data

Key Functions

status() – Show changed items (like git status)
run(session_id) – Verify a specific run
chain(target_file) – Trace dependency chain
list_runs(limit, status) – List tracked runs
stats() – Database statistics

API Reference

scitex-clew — Hash-based verification for reproducible science.

Standalone package. Zero dependencies (pure stdlib + sqlite3). When used with scitex, integration is automatic via @stx.session + stx.io.

Public API:

import scitex_clew as clew

# Verification
clew.status()                      # git-status-like overview
clew.run(session_id)               # verify one run (hash check)
clew.chain(target_file)            # trace file → source chain
clew.dag(targets)                  # verify full DAG
clew.rerun(target)                 # re-execute & compare (sandbox)
clew.rerun_dag(targets)            # rerun full DAG in topo order
clew.rerun_claims()                # rerun all claim-backing sessions
clew.list_runs(limit=100)          # list tracked runs
clew.stats()                       # database statistics

# Claims
clew.add_claim(...)                # register manuscript assertion
clew.list_claims(...)              # list registered claims
clew.verify_claim(...)             # verify a specific claim
clew.verify_all_claims(...)        # verify every claim -> fail-loud code

# Stamping
clew.stamp(...)                    # create temporal proof
clew.list_stamps(...)              # list stamps
clew.check_stamp(...)              # verify a stamp

# Hashing
clew.hash_file(path)               # SHA256 of a file
clew.hash_directory(path)          # SHA256 of all files in dir

# Visualization
clew.mermaid(...)                  # generate Mermaid DAG diagram

# Examples
clew.init_examples(dest)           # scaffold example pipeline

# Session lifecycle hooks (invoked by @scitex.session)
clew.on_session_start(session_id)  # open a tracked run
clew.on_session_close(status=...)  # finalize run + combined hash

Implementation note (audit-all §10 cold-start):

This module uses the PEP 562 ``__getattr__`` lazy-import pattern. All
submodules and re-exports below ``__version__`` are loaded on first
access only, so ``import scitex_clew`` stays well under the 500ms
cold-start threshold. The public attribute names listed above (and in
``__all__``) resolve exactly as before — no caller-visible change.

scitex.clew.status()[source]: Get verification status summary (like git status).

scitex.clew.run(session_id: str, from_scratch: bool = False)[source]

Verify a specific run.

Parameters:

session_id (str) – Session identifier
from_scratch (bool, optional) – If True, re-execute the script and verify outputs (slow but thorough). If False, only compare hashes (fast).

scitex.clew.chain(target: str)[source]: Verify the dependency chain for a target file.

scitex.clew.dag(targets=None, claims=False, strict=False)[source]

Verify the DAG for multiple targets or all claims.

Parameters:

targets (list of str or Path, optional) – Target files to verify (mutually exclusive with claims).
claims (bool, optional) – If True, build the DAG from every registered claim.
strict (bool, optional) – If True (F2), return a failure-attribution dict with failed_node / root_cause / invalidated_claims / still_valid_claims instead of a DAGVerification.

scitex.clew.rerun(target, timeout: int = 300, cleanup: bool = True)[source]

Re-execute a session in a sandbox and compare outputs.

Parameters:

target (str or list[str]) – Session ID, script path, or artifact path.
timeout (int, optional) – Maximum execution time in seconds (default: 300).
cleanup (bool, optional) – Remove sandbox outputs after verification (default: True).

scitex.clew.rerun_dag(targets=None, timeout=300, cleanup=True)[source]

Rerun-verify an entire DAG in topological order.

Each session is re-executed in a sandbox against its ORIGINAL stored inputs (not freshly rerun outputs from upstream), then compared to the original outputs.

Parameters:

targets (list of str, optional) – Target output files whose upstream DAG should be rerun. If None, all runs in the database are used and their output files become the targets.
timeout (int, optional) – Maximum execution time per session in seconds (default: 300).
cleanup (bool, optional) – Whether to remove sandbox output directories after each rerun.

Returns:

Unified verification result for the entire DAG.

Return type:

DAGVerification

scitex.clew.rerun_claims(file_path=None, claim_type=None, timeout=300, cleanup=True)[source]

Rerun-verify all sessions that produced files referenced by claims.

Collects unique source files from matching claims, then delegates to rerun_dag with those files as targets.

Parameters:

file_path (str, optional) – Filter claims by manuscript file path.
claim_type (str, optional) – Filter claims by type (statistic, figure, table, text, value).
timeout (int, optional) – Maximum execution time per session in seconds (default: 300).
cleanup (bool, optional) – Whether to remove sandbox output directories after each rerun.

Returns:

Unified verification result for the upstream DAG of all source files referenced by the matching claims.

Return type:

DAGVerification

scitex.clew.list_runs(limit: int = 100, status: str = None)[source]: List tracked runs.

scitex.clew.stats()[source]: Get database statistics.

scitex.clew.add_claim(file_path, claim_type, line_number=None, claim_value=None, source_file=None, source_session=None)[source]

Register a claim linking a manuscript assertion to the verification chain.

Parameters:

file_path (str) – Path to the manuscript file (e.g., paper.tex).
claim_type (str) – One of: statistic, figure, table, text, value.
line_number (int, optional) – Line number in the manuscript.
claim_value (str, optional) – The asserted value (e.g., “p = 0.003”).
source_file (str, optional) – Path to the source file that produced this claim.
source_session (str, optional) – Session ID that produced the source.

Returns:

The registered claim object.

Return type:

Claim

scitex.clew.list_claims(file_path=None, claim_type=None, status=None, limit=100)[source]

List registered claims with optional filters.

Parameters:

file_path (str, optional) – Filter by manuscript file path.
claim_type (str, optional) – Filter by claim type.
status (str, optional) – Filter by verification status.
limit (int) – Maximum number of claims to return.

Return type:

list of Claim

scitex.clew.verify_claim(claim_id_or_location)[source]

Verify a specific claim by checking its source against the verification chain.

Parameters:: claim_id_or_location (str) – Either a claim_id or a location string like “paper.tex:L42”.
Returns:: Verification result with claim details and chain status.
Return type:: dict

scitex.clew.verify_all_claims(file_path=None, claim_type=None, *, strict=False, config=None)[source]

Verify every registered claim and reduce to a fail-loud result.

This is the reusable core behind clew verify (claim-set mode). It re-verifies each claim (re-hashing its source and, in strict mode, checking upstream @stx.session lineage), updates each claim’s stored status as a side effect (via verify_claim()), and reduces the per-claim outcomes to a single VerificationResult.

Parameters:

file_path (str, optional) – Restrict to claims registered against this manuscript path.
claim_type (str, optional) – Restrict to claims of this type.
strict (bool, optional) – When True, a claim only passes if its source additionally has upstream computation lineage (its provenance chain verifies). A hand-written leaf (no @stx.session behind it) fails with NO_LINEAGE even though its hash matches. strict also promotes NO_LINEAGE to ERROR severity regardless of config. Default False.
config (str or pathlib.Path, optional) – Explicit .scitex/clew config file/dir overriding the resolved user/project severity map (see scitex_clew._core._config).

Returns:

Structured outcome. result.exit_code == 0 (result.ok) is the DONE-gate; any nonzero code means the agent MUST abstain honestly instead of claiming success. Per-pattern severity (configurable via .scitex/clew) decides which fired patterns are errors (fail) vs warnings (tolerated). See scitex_clew._cli._exit_codes.

Return type:

VerificationResult

scitex.clew.export_claims_json(path=None, *, file_path_filter=None, read_only=True)[source]

Export every registered claim to a canonical JSON artifact.

The exported file is the single human-readable + machine-consumable view of the claims table in db.sqlite. The DB remains the source of truth; this JSON is a regenerable artifact.

Path resolution (mirrors scitex_clew._db._core._default_db_path()):

1. Explicit ``path`` argument.
2. ``$SCITEX_CLEW_CLAIMS_JSON`` env var (escape hatch).
3. ``<project_root>/.scitex/clew/runtime/claims.json``
   (project root = nearest ancestor dir with ``.git`` or
   ``pyproject.toml``; falls back to cwd if none found).

Parameters:

path (str | Path, optional) – Override the resolved path. Useful for tests / one-off dumps.
file_path_filter (str, optional) – When set, only claims registered against this manuscript file path are exported. Default: every claim in the DB.
read_only (bool, optional) – After writing, chmod 0o444 the file so accidental edits fail loudly at the OS layer. Default True (the file IS derived). Set False for tests that need to mutate the file.

Returns:

The path the artifact was written to (absolute).

Return type:

Path

Examples

>>> import scitex_clew as clew
>>> clew.add_claim("paper.tex", "value", 42, "0.94", source_file="r.csv")
>>> # claims.json now auto-exported under ./.scitex/clew/runtime/
>>> clew.export_claims_json()  # idempotent — re-emit on demand
PosixPath('.../.scitex/clew/runtime/claims.json')

scitex.clew.register_intermediate(name, value, supports=None, session_id=None, claim_type='value')[source]

Register a computed intermediate as a Clew claim.

Use this from inside a @stx.session script (or from an agent loop) to record any non-trivial intermediate value with explicit upstream support. The claim becomes part of the DAG and can be queried via clew.chain, clew.dag, or the MCP clew_chain / clew_dag tools.

Parameters:

name (str) – Descriptive identifier (e.g. “acute_n_sig_pathways”). Avoid generic names like “result_3” — the id is the only handle a future inspector has on the value.
value (Any) – The computed result. Coerced to string for storage; the hash chain sees repr(value) so types matter.
supports (Optional[List[str]]) – List of upstream claim ids or session ids that this value depends on. Stored as JSON in the claim’s value field for retrieval. None means no explicit upstream (use sparingly).
session_id (Optional[str]) – The session this value belongs to. If None, read from the SCITEX_SESSION_ID env var that @stx.session sets at start.
claim_type (str) – One of statistic, figure, table, text, value. Defaults to value since intermediates are usually scalar / categorical results.

Returns:

The registered claim object.

Return type:

Claim

Raises:

ValueError – If no session_id can be determined (env var unset and not passed).

Examples

Inside a @stx.session script:

>>> from scitex_clew import register_intermediate
>>> n_sig = sum(1 for p in pathways if p.padj < 0.05)
>>> register_intermediate(
...     name="chronic_r2_n_sig_pathways",
...     value=n_sig,
...     supports=["chronic_r2_min_pvals", "reactome_pathways_v2024"],
... )

scitex.clew.stamp(backend='file', service_url=None, session_ids=None, output_dir=None)[source]

Record root hash with external timestamp.

Parameters:

backend (str) – One of: file, rfc3161, zenodo.
service_url (str, optional) – URL for RFC 3161 TSA or Zenodo API.
session_ids (list of str, optional) – Specific sessions to stamp. If None, stamps all successful runs.
output_dir (str, optional) – Directory for file-based stamps (default: <db_dir>/stamps, i.e. .scitex/clew/runtime/stamps/).

Returns:

The timestamp proof record.

Return type:

Stamp

scitex.clew.list_stamps(limit=20)[source]

List all stamps.

Return type:: List[Stamp]

scitex.clew.check_stamp(stamp_id=None)[source]

Verify a stamp against current verification state.

Parameters:: stamp_id (str, optional) – Specific stamp to check. If None, checks the latest stamp.
Returns:: {stamp, current_root_hash, matches, details}
Return type:: dict

scitex.clew.hash_file(path, algorithm='sha256', chunk_size=8192)[source]

Compute hash of a file.

Parameters:

path (str or Path) – Path to the file to hash
algorithm (str, optional) – Hash algorithm (default: sha256)
chunk_size (int, optional) – Size of chunks to read (default: 8192)

Returns:

Hexadecimal hash string (first 32 characters)

Return type:

str

Examples

>>> hash_file("data.csv")
'a1b2c3d4e5f6...'

scitex.clew.hash_directory(path, pattern='*', recursive=True, algorithm='sha256')[source]

Compute hashes for all files in a directory.

Parameters:

path (str or Path) – Directory path
pattern (str, optional) – Glob pattern for files (default: “*”)
recursive (bool, optional) – Whether to search recursively (default: True)
algorithm (str, optional) – Hash algorithm (default: sha256)

Returns:

Mapping of relative paths to hashes

Return type:

dict

Examples

>>> hash_directory("./data/")
{'input.csv': 'a1b2...', 'config.yaml': 'c3d4...'}

Notes

Transparently accepts a compressed session archive: if path is a <dir>.tar.gz file (or a directory whose <dir>.tar.gz sibling exists because it was archived away), the members are hashed in place and returned with the same {relpath: hash} shape a loose dir would yield.

scitex.clew.mermaid(session_id=None, target_file=None, target_files=None, claims=False, grouper=None, **kwargs)[source]

Generate a Mermaid DAG diagram.

Parameters:

session_id (str, optional) – Start from this session.
target_file (str, optional) – Start from the session that produced this file.
target_files (list of str, optional) – Multiple target files (multi-target DAG).
claims (bool, optional) – If True, build DAG from all registered claims.
grouper (callable | dict | None, optional) – File grouping strategy. Callable or JSON/dict spec (see scitex_clew.groupers.resolve_spec). If None, falls back to .scitex/clew/config.yaml (key grouper) if present.

scitex.clew.init_examples(dest, variant='sequential', *, find_examples_dir=<function _find_examples_dir>)[source]

Copy Clew example scripts to a destination directory.

Copies only the runnable scripts (.py, .sh) and README — not the output directories. Users run 00_run_all.sh themselves to generate outputs and populate the verification database.

Parameters:

dest (str or Path) – Destination directory. Created if it does not exist. Existing script files are overwritten.
variant (str, optional) – Example variant: “sequential” (default) or “multi_parent”.
find_examples_dir (callable, optional) – Locator callable (variant: str) -> Optional[Path] used to resolve the bundled examples source. Production callers should not pass this; it is the canonical PA-306 §1 DI seam — tests inject a hand-rolled fake that returns a tmp_path-rooted directory or None.

Returns:

{"path": str, "files": list[str], "file_count": int, "variant": str}

Return type:

dict

Raises:

FileNotFoundError – If the bundled examples cannot be located.
ValueError – If variant is not recognized.

scitex.clew.on_session_start(session_id, script_path=None, parent_session=None, verbose=False, metadata=None)[source]

Hook called when a session starts.

Parameters:

session_id (str) – Unique session identifier
script_path (str, optional) – Path to the script being run
parent_session (str, optional) – Parent session ID for chain tracking
verbose (bool, optional) – Whether to log status messages
metadata (dict, optional) – Additional metadata (e.g. notebook_path, cell_index)

Return type:

None

scitex.clew.on_session_close(status='success', exit_code=0, verbose=False, register=None)[source]

Hook called when a session closes.

Parameters:

status (str, optional) – Final status (success, failed, error)
exit_code (int, optional) – Exit code of the script
verbose (bool, optional) – Whether to log status messages
register (bool, optional) – If True, register session hashes with remote Clew Registry. If None, checks SCITEX_AUTO_REGISTER environment variable.

Return type:

None

Clew Module (stx.clew)

How It Works

CLI Commands

Verification Levels

Dependency Chains

Verification Statuses

Key Functions

API Reference

Clew Module (`stx.clew`)