Published on 21/12/2025
Run Logs and Reproducibility That Hold Up: Scripted Builds, Environment Hashes, and Parameter Files Done Right
Outcome-aligned reproducibility: why scripted builds and evidence-grade run logs matter in US/UK/EU reviews
Define “reproducible” the way inspectors do
To a regulator, reproducibility isn’t an academic virtue—it’s operational proof that the same inputs, code, and assumptions generate the same numbers on demand. In clinical submissions, that means a scripted build with zero hand edits, a run log that captures decisions and versions at execution time, parameter files controlling every knob humans might forget, and environment hashes that fingerprint the computational stack. When a reviewer points to a number, you should traverse output → run log → parameters → program → lineage in seconds and regenerate the value without improvisation.
State one compliance backbone—once, then reuse everywhere
Anchor your reproducibility posture with a portable paragraph and paste it across plans, shells, and reviewer guides: inspection expectations align with FDA BIMO; electronic records/signatures comply with 21 CFR Part 11 and map to EU’s Annex 11; oversight follows ICH E6(R3); estimands and analysis labeling reflect ICH E9(R1); safety data exchange respects ICH E2B(R3); public transparency is consistent with ClinicalTrials.gov and EU status
Three outcome targets (and a stopwatch)
Publish measurable goals that you can demonstrate at will: (1) Traceability—two-click drill from a number to the program, parameters, and dataset lineage; (2) Reproducibility—byte-identical rebuild for the same cut, parameters, and environment; (3) Retrievability—ten results drilled and re-run in ten minutes. File the stopwatch drill once a quarter so teams practice retrieval under time pressure and inspectors see a living control, not an aspirational policy.
Regulatory mapping: US-first clarity with EU/UK portability
US (FDA) angle—event → evidence in minutes
US assessors start from an output value and ask: which script produced it, which parameter file controlled the windows and populations, what versions of libraries were in play, and where the proof of an identical rerun lives. They expect deterministic retrieval and role attribution in run logs. If your build is button-based or manual, you’ll burn time proving negative facts (“we did not change anything”). A scripted pipeline with explicit logs flips the default: you show what did happen, not what didn’t.
EU/UK (EMA/MHRA) angle—same truth, local wrappers
EU/UK reviewers pull the same thread but probe accessibility (plain-language footnotes), governance (who approved parameter changes and when), and alignment with registered narratives. The reproducibility engine is the same; wrappers differ. Keep a translation table for labels (e.g., IRB → REC/HRA) so the same facts travel cross-region without edits to the underlying scripts or logs.
| Dimension | US (FDA) | EU/UK (EMA/MHRA) |
|---|---|---|
| Electronic records | Part 11 validation; role attribution in logs | Annex 11 controls; supplier qualification |
| Transparency | Consistency with ClinicalTrials.gov narratives | EU-CTR status via CTIS; UK registry alignment |
| Privacy | Minimum necessary; PHI minimization | GDPR/UK GDPR minimization & residency notes |
| Re-run proof | Script + params + env hash → identical outputs | Same, plus governance minutes on parameter changes |
| Inspection lens | Event→evidence speed; reproducible math | Completeness & portability of rationale |
Process & evidence: build once, run anywhere, prove everything
Scripted builds beat checklists
Replace manual sequences with a single orchestrator script for each build target (ADaM, listings, TLFs). The orchestrator loads a parameter file, prints a header with environment fingerprint and seed values, runs unit/integration tests, generates artifacts, and writes a trailer with row counts and output hashes. The script should fail fast if preconditions aren’t met (missing parameters, illegal windows, absent seeds), and it should emit human-readable, grep-friendly lines for investigators and QA.
Environment hashing prevents “works on my machine”
Fingerprint your computational environment with a lockfile or manifest that lists interpreter/compiler versions, package names and versions, and OS details. Compute a short hash of the manifest and print it into the run log and output footers. When a new server image or container rolls out, the manifest—and therefore the hash—changes, creating visible evidence of the upgrade. If results shift, you can tie the change to a specific environment delta rather than chasing ghosts.
Parameter files externalize memory
All human-tunable choices—analysis sets, windows, reference dates, censoring rules, dictionary versions, seeds—belong in a version-controlled parameter file, not hard-coded inside macros. The orchestrator should echo parameter values verbatim into the run log and provenance footers. A formal change record should connect parameter edits to governance minutes so reviewers see who changed what, when, why, and with what effect.
- Create an orchestrator script per build target (ADaM, listings, TLFs) with start/end banners.
- Hash the environment; print the manifest and hash into the run log and output footers.
- Load parameters from a single file; echo all values into the run log.
- Seed all random processes; print seeds and PRNG details.
- Fail fast on missing/illegal parameters and out-of-date manifests.
- Run unit tests before building; abort on failures with explicit messages.
- Emit row counts and summary stats; record output file hashes for integrity.
- Archive run logs, parameters, and manifests together for two-click retrieval.
- Tag releases semantically (MAJOR.MINOR.PATCH); summarize changes at the top of logs.
- File artifacts to the TMF with cross-references from CTMS portfolio tiles.
Decision Matrix: pick the right path for reruns, upgrades, and late-breaking changes
| Scenario | Option | When to choose | Proof required | Risk if wrong |
|---|---|---|---|---|
| Minor parameter tweak (e.g., visit window ±1 day) | Parameter-only rerun | Logic unchanged; governance approved | Run log shows new params; unchanged code/env hash | Hidden logic drift if code was edited informally |
| Library/security patch upgrade | Environment refresh + validation rerun | Manifest changed; code/params stable | Before/after output hashes; validation report | Unexplained numeric drift; audit finding |
| Algorithm clarification (baseline hunt rule) | Code change with targeted tests | Spec amended; impact scoped | Unit tests added/updated; diff exhibit | Widespread rework if change undocumented |
| Late database cut (new subjects) | Full rebuild | Inputs changed materially | Fresh manifest/params; new output hashes | Partial rebuild creating mismatched outputs |
| Macro upgrade across studies | Branch & compare; staged rollout | Portfolio-wide impact likely | Golden study comparison; rollout minutes | Cross-study inconsistency; query spike |
Document decisions where inspectors actually look
Maintain a short “Reproducibility Decision Log”: scenario → chosen path → rationale → artifacts (run log IDs, parameter files, diff reports) → owner → effective date → measured effect (e.g., number of outputs impacted, time-to-rerun). File in Sponsor Quality and cross-link from specs and program headers so the path from a number to the change is obvious.
QC / Evidence Pack: the minimum, complete set that proves reproducibility
- Orchestrator scripts and wrappers with headers describing scope and dependencies.
- Environment manifest (package versions, interpreters, OS details) and the computed hash.
- Version-controlled parameter files (analysis sets, windows, dates, seeds, dictionaries).
- Run logs with start/end banners, parameter echoes, seeds, row counts, and output hashes.
- Unit and integration test reports; coverage by business rule, not just code lines.
- Change summaries for scripts, manifests, and parameters with governance references.
- Before/after exhibits when any numeric drift occurs (with agreed tolerances).
- Provenance footers on datasets and outputs echoing manifest hash and parameter file name.
- Stopwatch drill artifacts (timestamps, screenshots) for retrieval drills.
- TMF filing map with two-click retrieval from CTMS portfolio tiles.
Vendor oversight & privacy (US/EU/UK)
Qualify external programming teams against your scripting and logging standards; enforce least-privilege access; store interface logs and incident reports alongside build artifacts. For EU/UK subject-level debugging, document minimization, residency, and transfer safeguards; retain sample redactions and privacy review minutes with the evidence pack.
Templates reviewers appreciate: paste-ready run log headers, footers, and parameter tokens
Run log header (copy/paste)
[START] Build: TLF Bundle 2.4 | Study: ABC-123 | Cut: 2025-11-01T00:00 | User: j.smith | Host: build01
Manifest: env.lock hash=9f7c2a1 | Interpreter=R 4.3.2 | OS=Linux 5.15 | Packages: dplyr=1.1.4, haven=2.5.4, sas7bdat=0.5
Params: set=ITT; windows=baseline[-7,0],visit±3d; dict=MedDRA 26.1, WHODrug B3 Apr-2025; seeds=TLF=314159, NPB=271828
Run log footer (copy/paste)
[END] Duration=00:12:31 | ADaM: 14 datasets (rows=1,242,118) | Listings: 43 | Tables: 57 | Figures: 18
Output hashes: t_prim_eff.tab=4be1…; f_km_os.pdf=77c9…; l_ae_serious.csv=aa21…
Status=SUCCESS | Tests=passed:132 failed:0 skipped:6 | Notes=none | Filed=/tmf/builds/ABC-123/2025-11-01
Parameter file tokens (copy/paste)
analysis_set: ITT
baseline_window: [-7,0]
visit_window: ±3d
censoring_rule: admin_lock
dictionary_versions: meddra:26.1, whodrug:B3-Apr-2025
seeds: tlf:314159, bootstrap:271828
reference_dates: fpfv:2024-03-01, lpfv:2025-06-15, dbl:2025-10-20
Operating cadence: version discipline, CI, and drills that keep you ahead of audits
Semantic versions with human-readable change notes
Apply semantic versioning to scripts, manifests, and parameter files. Require a top-of-file change summary (what changed, why with governance reference, how to retest). A one-line version bump without rationale is invisible debt; a brief narrative prevents archaeology during inspection and accelerates “why did this move?” conversations.
CI pipelines for clinical builds
Treat statistical builds like software: trigger on parameter or code changes, run tests, create artifacts in an isolated workspace, and publish a signed bundle with run logs and hashes. Promote bundles from dev → QA → release using the same scripts and parameters so you test the exact path you will use for submission.
Stopwatch and recovery drills
Schedule quarterly drills: (1) Trace—randomly pick five numbers and open scripts, parameters, and manifests in under five minutes; (2) Rebuild—rerun a prior cut and compare output hashes; (3) Recover—simulate a corrupted environment and rebuild from the manifest. File timestamps and lessons learned; convert repeat slowdowns into CAPA with effectiveness checks.
Common pitfalls & quick fixes: stop reproducibility leaks before they become findings
Pitfall 1: hidden assumptions in code
Fix: move every human-tunable decision to a parameter file; check for undocumented constants with linters; add a failing test when a hard-coded value is detected. Echo parameters into run logs and footers so reviewers never guess what was in effect.
Pitfall 2: silent environment drift
Fix: forbid ad hoc library updates; require manifest changes via pull requests; compute and display environment hashes on every run. When output hashes shift, you now have a single variable to examine—the manifest—rather than hunting across code and data.
Pitfall 3: button-driven builds
Fix: replace GUIs with scripts; retain GUIs only as thin launchers that call the same scripts. If a person can click differently, they will—scripted execution ensures consistent steps and inspectable logs.
FAQs
What must every run log include to satisfy reviewers?
At minimum: start/end banners, study ID and cut date, user/host, environment manifest and hash, echoed parameter values, seed values, unit test results, row counts and summary stats, output filenames with integrity hashes, and the filing location. With those, a reviewer can reconstruct the build without calling engineering.
How do environment hashes help during inspection?
They fingerprint the computational stack—interpreter, packages, OS—so you can prove that a rerun used the same environment as the original. If numbers differ and the hash changed, you know to examine package changes; if the hash is identical, you focus on inputs or parameters. Hashes shrink the search space from “everything” to “one of three.”
What’s the best way to manage randomization or bootstrap seeds?
Set seeds in the parameter file and print them into the run log and output footers. Use deterministic PRNGs and record their algorithm/version. If a sensitivity requires multiple seeds, include a seed list and roll through them in a controlled loop, storing each run as a distinct bundle with its own hashes.
Do we need different run log formats for US vs EU/UK?
No. Keep one truth. You may add a short label translation sheet (e.g., IRB → REC/HRA) to your reviewer guides, but the log structure, parameters, and manifests remain identical. This avoids drift and simplifies cross-region maintenance.
How do we prove a number changed only due to a parameter tweak?
Show two run logs with identical environment hashes and code versions but different parameter files; display the diff on the parameter file and the before/after output hashes. Add a short narrative and governance reference to close the loop. That chain is usually sufficient to resolve the query.
Where should run logs and manifests live?
Alongside the outputs in a predictable directory structure, cross-linked from CTMS portfolio tiles and filed to the TMF. Store the parameter file and manifest with each log so retrieval is two clicks: from output to its run bundle, then to the specific artifact (script, params, or manifest).
