Run Logs & Reproducibility: Scripted Builds, Env Hashes, Params

Published on 21/12/2025

Run Logs and Reproducibility That Hold Up: Scripted Builds, Environment Hashes, and Parameter Files Done Right

Table of Contents

Outcome-aligned reproducibility: why scripted builds and evidence-grade run logs matter in US/UK/EU reviews

Define “reproducible” the way inspectors do

To a regulator, reproducibility isn’t an academic virtue—it’s operational proof that the same inputs, code, and assumptions generate the same numbers on demand. In clinical submissions, that means a scripted build with zero hand edits, a run log that captures decisions and versions at execution time, parameter files controlling every knob humans might forget, and environment hashes that fingerprint the computational stack. When a reviewer points to a number, you should traverse output → run log → parameters → program → lineage in seconds and regenerate the value without improvisation.

State one compliance backbone—once, then reuse everywhere

Anchor your reproducibility posture with a portable paragraph and paste it across plans, shells, and reviewer guides: inspection expectations align with FDA BIMO; electronic records/signatures comply with 21 CFR Part 11 and map to EU’s Annex 11; oversight follows ICH E6(R3); estimands and analysis labeling reflect ICH E9(R1); safety data exchange respects ICH E2B(R3); public transparency is consistent with ClinicalTrials.gov and EU status

under EU-CTR via CTIS; privacy adheres to HIPAA. Every execution leaves a searchable audit trail; systemic defects route via CAPA; risk thresholds are governed as QTLs within RBM; artifacts file to the TMF/eTMF. Data standards follow CDISC conventions with lineage from SDTM to ADaM, definitions are machine-readable in Define.xml, and narratives live in ADRG/SDRG. Cite authorities once in-line—FDA, EMA, MHRA, ICH, WHO, PMDA, TGA—then keep this article operational.

Three outcome targets (and a stopwatch)

Publish measurable goals that you can demonstrate at will: (1) Traceability—two-click drill from a number to the program, parameters, and dataset lineage; (2) Reproducibility—byte-identical rebuild for the same cut, parameters, and environment; (3) Retrievability—ten results drilled and re-run in ten minutes. File the stopwatch drill once a quarter so teams practice retrieval under time pressure and inspectors see a living control, not an aspirational policy.

Regulatory mapping: US-first clarity with EU/UK portability

US (FDA) angle—event → evidence in minutes

US assessors start from an output value and ask: which script produced it, which parameter file controlled the windows and populations, what versions of libraries were in play, and where the proof of an identical rerun lives. They expect deterministic retrieval and role attribution in run logs. If your build is button-based or manual, you’ll burn time proving negative facts (“we did not change anything”). A scripted pipeline with explicit logs flips the default: you show what did happen, not what didn’t.

EU/UK (EMA/MHRA) angle—same truth, local wrappers

EU/UK reviewers pull the same thread but probe accessibility (plain-language footnotes), governance (who approved parameter changes and when), and alignment with registered narratives. The reproducibility engine is the same; wrappers differ. Keep a translation table for labels (e.g., IRB → REC/HRA) so the same facts travel cross-region without edits to the underlying scripts or logs.

Dimension	US (FDA)	EU/UK (EMA/MHRA)
Electronic records	Part 11 validation; role attribution in logs	Annex 11 controls; supplier qualification
Transparency	Consistency with ClinicalTrials.gov narratives	EU-CTR status via CTIS; UK registry alignment
Privacy	Minimum necessary; PHI minimization	GDPR/UK GDPR minimization & residency notes
Re-run proof	Script + params + env hash → identical outputs	Same, plus governance minutes on parameter changes
Inspection lens	Event→evidence speed; reproducible math	Completeness & portability of rationale

Process & evidence: build once, run anywhere, prove everything

Scripted builds beat checklists

Replace manual sequences with a single orchestrator script for each build target (ADaM, listings, TLFs). The orchestrator loads a parameter file, prints a header with environment fingerprint and seed values, runs unit/integration tests, generates artifacts, and writes a trailer with row counts and output hashes. The script should fail fast if preconditions aren’t met (missing parameters, illegal windows, absent seeds), and it should emit human-readable, grep-friendly lines for investigators and QA.

Environment hashing prevents “works on my machine”

Fingerprint your computational environment with a lockfile or manifest that lists interpreter/compiler versions, package names and versions, and OS details. Compute a short hash of the manifest and print it into the run log and output footers. When a new server image or container rolls out, the manifest—and therefore the hash—changes, creating visible evidence of the upgrade. If results shift, you can tie the change to a specific environment delta rather than chasing ghosts.

Parameter files externalize memory

All human-tunable choices—analysis sets, windows, reference dates, censoring rules, dictionary versions, seeds—belong in a version-controlled parameter file, not hard-coded inside macros. The orchestrator should echo parameter values verbatim into the run log and provenance footers. A formal change record should connect parameter edits to governance minutes so reviewers see who changed what, when, why, and with what effect.

Create an orchestrator script per build target (ADaM, listings, TLFs) with start/end banners.
Hash the environment; print the manifest and hash into the run log and output footers.
Load parameters from a single file; echo all values into the run log.
Seed all random processes; print seeds and PRNG details.
Fail fast on missing/illegal parameters and out-of-date manifests.
Run unit tests before building; abort on failures with explicit messages.
Emit row counts and summary stats; record output file hashes for integrity.
Archive run logs, parameters, and manifests together for two-click retrieval.
Tag releases semantically (MAJOR.MINOR.PATCH); summarize changes at the top of logs.
File artifacts to the TMF with cross-references from CTMS portfolio tiles.

Decision Matrix: pick the right path for reruns, upgrades, and late-breaking changes

Scenario	Option	When to choose	Proof required	Risk if wrong
Minor parameter tweak (e.g., visit window ±1 day)	Parameter-only rerun	Logic unchanged; governance approved	Run log shows new params; unchanged code/env hash	Hidden logic drift if code was edited informally
Library/security patch upgrade	Environment refresh + validation rerun	Manifest changed; code/params stable	Before/after output hashes; validation report	Unexplained numeric drift; audit finding
Algorithm clarification (baseline hunt rule)	Code change with targeted tests	Spec amended; impact scoped	Unit tests added/updated; diff exhibit	Widespread rework if change undocumented
Late database cut (new subjects)	Full rebuild	Inputs changed materially	Fresh manifest/params; new output hashes	Partial rebuild creating mismatched outputs
Macro upgrade across studies	Branch & compare; staged rollout	Portfolio-wide impact likely	Golden study comparison; rollout minutes	Cross-study inconsistency; query spike

Document decisions where inspectors actually look

Maintain a short “Reproducibility Decision Log”: scenario → chosen path → rationale → artifacts (run log IDs, parameter files, diff reports) → owner → effective date → measured effect (e.g., number of outputs impacted, time-to-rerun). File in Sponsor Quality and cross-link from specs and program headers so the path from a number to the change is obvious.

QC / Evidence Pack: the minimum, complete set that proves reproducibility

Orchestrator scripts and wrappers with headers describing scope and dependencies.
Environment manifest (package versions, interpreters, OS details) and the computed hash.
Version-controlled parameter files (analysis sets, windows, dates, seeds, dictionaries).
Run logs with start/end banners, parameter echoes, seeds, row counts, and output hashes.
Unit and integration test reports; coverage by business rule, not just code lines.
Change summaries for scripts, manifests, and parameters with governance references.
Before/after exhibits when any numeric drift occurs (with agreed tolerances).
Provenance footers on datasets and outputs echoing manifest hash and parameter file name.
Stopwatch drill artifacts (timestamps, screenshots) for retrieval drills.
TMF filing map with two-click retrieval from CTMS portfolio tiles.

Vendor oversight & privacy (US/EU/UK)

Qualify external programming teams against your scripting and logging standards; enforce least-privilege access; store interface logs and incident reports alongside build artifacts. For EU/UK subject-level debugging, document minimization, residency, and transfer safeguards; retain sample redactions and privacy review minutes with the evidence pack.

Templates reviewers appreciate: paste-ready run log headers, footers, and parameter tokens

Run log header (copy/paste)

Run log footer (copy/paste)

Parameter file tokens (copy/paste)

analysis_set: ITT
baseline_window: [-7,0]
visit_window: ±3d
censoring_rule: admin_lock
dictionary_versions: meddra:26.1, whodrug:B3-Apr-2025
seeds: tlf:314159, bootstrap:271828
reference_dates: fpfv:2024-03-01, lpfv:2025-06-15, dbl:2025-10-20

Operating cadence: version discipline, CI, and drills that keep you ahead of audits

Semantic versions with human-readable change notes

Apply semantic versioning to scripts, manifests, and parameter files. Require a top-of-file change summary (what changed, why with governance reference, how to retest). A one-line version bump without rationale is invisible debt; a brief narrative prevents archaeology during inspection and accelerates “why did this move?” conversations.

CI pipelines for clinical builds

Treat statistical builds like software: trigger on parameter or code changes, run tests, create artifacts in an isolated workspace, and publish a signed bundle with run logs and hashes. Promote bundles from dev → QA → release using the same scripts and parameters so you test the exact path you will use for submission.

Stopwatch and recovery drills

Schedule quarterly drills: (1) Trace—randomly pick five numbers and open scripts, parameters, and manifests in under five minutes; (2) Rebuild—rerun a prior cut and compare output hashes; (3) Recover—simulate a corrupted environment and rebuild from the manifest. File timestamps and lessons learned; convert repeat slowdowns into CAPA with effectiveness checks.

Common pitfalls & quick fixes: stop reproducibility leaks before they become findings

Pitfall 1: hidden assumptions in code

Fix: move every human-tunable decision to a parameter file; check for undocumented constants with linters; add a failing test when a hard-coded value is detected. Echo parameters into run logs and footers so reviewers never guess what was in effect.

Pitfall 2: silent environment drift

Fix: forbid ad hoc library updates; require manifest changes via pull requests; compute and display environment hashes on every run. When output hashes shift, you now have a single variable to examine—the manifest—rather than hunting across code and data.

Pitfall 3: button-driven builds

Fix: replace GUIs with scripts; retain GUIs only as thin launchers that call the same scripts. If a person can click differently, they will—scripted execution ensures consistent steps and inspectable logs.

FAQs

What must every run log include to satisfy reviewers?

At minimum: start/end banners, study ID and cut date, user/host, environment manifest and hash, echoed parameter values, seed values, unit test results, row counts and summary stats, output filenames with integrity hashes, and the filing location. With those, a reviewer can reconstruct the build without calling engineering.

How do environment hashes help during inspection?

They fingerprint the computational stack—interpreter, packages, OS—so you can prove that a rerun used the same environment as the original. If numbers differ and the hash changed, you know to examine package changes; if the hash is identical, you focus on inputs or parameters. Hashes shrink the search space from “everything” to “one of three.”

What’s the best way to manage randomization or bootstrap seeds?

Set seeds in the parameter file and print them into the run log and output footers. Use deterministic PRNGs and record their algorithm/version. If a sensitivity requires multiple seeds, include a seed list and roll through them in a controlled loop, storing each run as a distinct bundle with its own hashes.

Do we need different run log formats for US vs EU/UK?

No. Keep one truth. You may add a short label translation sheet (e.g., IRB → REC/HRA) to your reviewer guides, but the log structure, parameters, and manifests remain identical. This avoids drift and simplifies cross-region maintenance.

How do we prove a number changed only due to a parameter tweak?

Show two run logs with identical environment hashes and code versions but different parameter files; display the diff on the parameter file and the before/after output hashes. Add a short narrative and governance reference to close the loop. That chain is usually sufficient to resolve the query.

Where should run logs and manifests live?

Alongside the outputs in a predictable directory structure, cross-linked from CTMS portfolio tiles and filed to the TMF. Store the parameter file and manifest with each log so retrieval is two clicks: from output to its run bundle, then to the specific artifact (script, params, or manifest).