Published on 21/12/2025
Step-by-Step Guide to Preparing Clinical Trial Data for Public Repositories
Introduction: Why Proper Data Preparation Matters
As global regulations and journal policies increasingly demand open access to clinical trial data, researchers and sponsors must prepare datasets in formats suitable for public repositories. Improper or incomplete preparation can lead to regulatory delays, data misuse, or breaches of participant confidentiality. Therefore, data preparation is not just a technical step — it’s a regulatory, ethical, and scientific responsibility.
Preparing data for public sharing involves several critical activities: de-identification, metadata annotation, format conversion, documentation, and repository selection. This guide provides a detailed, compliant approach tailored to global expectations, including FDA, EMA, WHO, and ICMJE requirements.
Step 1: Define the Scope of Data for Sharing
The first step is identifying which components of the clinical trial dataset will be shared. Typical elements include:
- De-identified patient-level datasets (e.g., demographic, baseline, outcomes)
- Study protocol and statistical analysis plan (SAP)
- Case Report Forms (CRFs) or annotated CRFs
- Clinical Study Report (CSR)
- Data dictionaries and codebooks
- Data sharing plan and user guides
Ensure that shared data aligns with what was described in the trial’s data sharing statement and informed consent documents.
Step 2: Anonymize or De-Identify the Dataset
To
- Removing direct identifiers (e.g., name, phone number, social security number)
- Generalizing or binning date-of-birth, geographic location, or visit dates
- Replacing identifiers with subject IDs
- Using controlled randomization for sensitive categories (e.g., rare diseases)
De-identification must be irreversible. It’s best practice to document the method and date of anonymization in a separate file.
Sample De-Identification Table
| Original Field | De-Identification Method | Notes |
|---|---|---|
| Patient Name | Removed | Direct identifier |
| Date of Birth | Converted to age group | Avoids re-identification |
| City | Region only | Limits geographic precision |
| Visit Date | Offset by X days | Relative timeline preserved |
Step 3: Format the Data for Compatibility
Public repositories often require datasets in specific formats. Common formats include:
- CSV or TSV for tabular datasets
- XML or JSON for structured submissions (e.g., to CTRI)
- SAS XPORT or CDISC-compliant SDTM/ADaM files for FDA submissions
All files should be checked for readability, encoding compatibility (e.g., UTF-8), and must exclude macros or embedded formulas.
Step 4: Create a Comprehensive Data Dictionary
A data dictionary explains every variable in the dataset, including its format, possible values, units, and logic. It ensures data usability for secondary researchers. A basic structure might include:
| Variable Name | Description | Type | Permissible Values |
|---|---|---|---|
| AGE | Age in years | Numeric | 18–99 |
| SEX | Biological sex | Text | Male, Female, Other |
| AE_SEV | Adverse event severity | Ordinal | 1=Mild, 2=Moderate, 3=Severe |
Step 5: Prepare Metadata and Documentation
Metadata is machine-readable information that describes the dataset. It includes trial identifiers, data collection dates, responsible parties, and sharing conditions. Recommended metadata standards include:
- Dublin Core: for basic bibliographic metadata
- DataCite: for DOI-based repositories
- Clinical Data Interchange Standards Consortium (CDISC): for FDA/EMA submissions
Also include README files explaining file structure, naming conventions, and how to interpret the dataset.
Step 6: Review Legal, Ethical, and Policy Considerations
Before uploading, review institutional, national, and funder requirements. Confirm that:
- Ethics Committee/IRB approval covers data sharing
- Participant informed consent permits secondary use
- Any data transfer agreements (DTAs) are executed if required
- Embargoes or publication rights are respected
Include a plain language data sharing statement in the documentation pack.
Step 7: Choose and Upload to the Appropriate Repository
Repository selection depends on the trial type, sponsor policy, and access model:
- Open Repositories: Dryad, Figshare, Zenodo
- Controlled Repositories: Vivli, YODA Project, EMA Data Portal
- Regulatory Registries: ClinicalTrials.gov, EU CTR, ISRCTN
Ensure that files are uploaded with the correct metadata, license, and access controls. For example, CSVs should be accompanied by data dictionaries and README files.
Step 8: Assign Persistent Identifiers and License
Assigning a DOI (Digital Object Identifier) ensures that your dataset can be cited and tracked. Choose an appropriate license such as:
- CC BY 4.0: Permits sharing and reuse with attribution
- CC0: Public domain dedication
- Restricted use: With justified embargoes
Use repositories that support DOI minting and license tagging.
Step 9: Validate Data Before Submission
Perform internal validation checks to ensure data completeness, readability, and compliance:
- File naming matches SOP convention
- No missing columns or variables
- Consistency with the Clinical Study Report
- Compatibility with statistical software (e.g., R, SAS)
Include a final checklist in the submission folder for review before public release.
Conclusion: Building a Culture of Responsible Data Sharing
Well-prepared data sets enable meaningful secondary research, reinforce transparency, and meet growing global expectations. By integrating good data stewardship practices into clinical trial workflows, sponsors and investigators contribute to reproducibility, ethical research use, and patient trust. Following the steps above ensures data is not only shared — but shared responsibly and usefully for global health advancement.
