WHO data transparency – Clinical Research Made Simple

How to Prepare Data for Public Sharing Repositories in Clinical Trials

digi — Sun, 24 Aug 2025 15:54:22 +0000

How to Prepare Data for Public Sharing Repositories in Clinical Trials

Step-by-Step Guide to Preparing Clinical Trial Data for Public Repositories

Introduction: Why Proper Data Preparation Matters

As global regulations and journal policies increasingly demand open access to clinical trial data, researchers and sponsors must prepare datasets in formats suitable for public repositories. Improper or incomplete preparation can lead to regulatory delays, data misuse, or breaches of participant confidentiality. Therefore, data preparation is not just a technical step — it’s a regulatory, ethical, and scientific responsibility.

Preparing data for public sharing involves several critical activities: de-identification, metadata annotation, format conversion, documentation, and repository selection. This guide provides a detailed, compliant approach tailored to global expectations, including FDA, EMA, WHO, and ICMJE requirements.

Step 1: Define the Scope of Data for Sharing

The first step is identifying which components of the clinical trial dataset will be shared. Typical elements include:

De-identified patient-level datasets (e.g., demographic, baseline, outcomes)
Study protocol and statistical analysis plan (SAP)
Case Report Forms (CRFs) or annotated CRFs
Clinical Study Report (CSR)
Data dictionaries and codebooks
Data sharing plan and user guides

Ensure that shared data aligns with what was described in the trial’s data sharing statement and informed consent documents.

Step 2: Anonymize or De-Identify the Dataset

To comply with privacy regulations like GDPR and HIPAA, data must be fully anonymized or de-identified. Techniques include:

Removing direct identifiers (e.g., name, phone number, social security number)
Generalizing or binning date-of-birth, geographic location, or visit dates
Replacing identifiers with subject IDs
Using controlled randomization for sensitive categories (e.g., rare diseases)

De-identification must be irreversible. It’s best practice to document the method and date of anonymization in a separate file.

Sample De-Identification Table

Original Field	De-Identification Method	Notes
Patient Name	Removed	Direct identifier
Date of Birth	Converted to age group	Avoids re-identification
City	Region only	Limits geographic precision
Visit Date	Offset by X days	Relative timeline preserved

Step 3: Format the Data for Compatibility

Public repositories often require datasets in specific formats. Common formats include:

CSV or TSV for tabular datasets
XML or JSON for structured submissions (e.g., to CTRI)
SAS XPORT or CDISC-compliant SDTM/ADaM files for FDA submissions

All files should be checked for readability, encoding compatibility (e.g., UTF-8), and must exclude macros or embedded formulas.

Step 4: Create a Comprehensive Data Dictionary

A data dictionary explains every variable in the dataset, including its format, possible values, units, and logic. It ensures data usability for secondary researchers. A basic structure might include:

Variable Name	Description	Type	Permissible Values
AGE	Age in years	Numeric	18–99
SEX	Biological sex	Text	Male, Female, Other
AE_SEV	Adverse event severity	Ordinal	1=Mild, 2=Moderate, 3=Severe

Step 5: Prepare Metadata and Documentation

Metadata is machine-readable information that describes the dataset. It includes trial identifiers, data collection dates, responsible parties, and sharing conditions. Recommended metadata standards include:

Dublin Core: for basic bibliographic metadata
DataCite: for DOI-based repositories
Clinical Data Interchange Standards Consortium (CDISC): for FDA/EMA submissions

Also include README files explaining file structure, naming conventions, and how to interpret the dataset.

Step 6: Review Legal, Ethical, and Policy Considerations

Before uploading, review institutional, national, and funder requirements. Confirm that:

Ethics Committee/IRB approval covers data sharing
Participant informed consent permits secondary use
Any data transfer agreements (DTAs) are executed if required
Embargoes or publication rights are respected

Include a plain language data sharing statement in the documentation pack.

Step 7: Choose and Upload to the Appropriate Repository

Repository selection depends on the trial type, sponsor policy, and access model:

Open Repositories: Dryad, Figshare, Zenodo
Controlled Repositories: Vivli, YODA Project, EMA Data Portal
Regulatory Registries: ClinicalTrials.gov, EU CTR, ISRCTN

Ensure that files are uploaded with the correct metadata, license, and access controls. For example, CSVs should be accompanied by data dictionaries and README files.

Step 8: Assign Persistent Identifiers and License

Assigning a DOI (Digital Object Identifier) ensures that your dataset can be cited and tracked. Choose an appropriate license such as:

CC BY 4.0: Permits sharing and reuse with attribution
CC0: Public domain dedication
Restricted use: With justified embargoes

Use repositories that support DOI minting and license tagging.

Step 9: Validate Data Before Submission

Perform internal validation checks to ensure data completeness, readability, and compliance:

File naming matches SOP convention
No missing columns or variables
Consistency with the Clinical Study Report
Compatibility with statistical software (e.g., R, SAS)

Include a final checklist in the submission folder for review before public release.

Conclusion: Building a Culture of Responsible Data Sharing

Well-prepared data sets enable meaningful secondary research, reinforce transparency, and meet growing global expectations. By integrating good data stewardship practices into clinical trial workflows, sponsors and investigators contribute to reproducibility, ethical research use, and patient trust. Following the steps above ensures data is not only shared — but shared responsibly and usefully for global health advancement.

Importance of Open Data in Clinical Trial Transparency

digi — Sun, 24 Aug 2025 00:53:47 +0000

Importance of Open Data in Clinical Trial Transparency

Why Open Data Is Critical for Trust and Transparency in Clinical Trials

Introduction: The Need for Transparency in Clinical Research

Open access to clinical trial data is a cornerstone of scientific integrity and public trust. In recent years, regulatory agencies, journal editors, and patient advocacy groups have increasingly emphasized the importance of making clinical trial data publicly available. Open data promotes reproducibility, allows secondary analyses, and exposes selective reporting or misconduct.

Without open data, results may remain inaccessible or selectively published, skewing evidence for clinicians, regulators, and policymakers. Transparency reduces bias and enhances accountability in research practices, especially when trials inform public health interventions or global treatment guidelines.

Defining Open Data in Clinical Trials

Open data in the context of clinical trials refers to anonymized, de-identified datasets and trial-level metadata that are made publicly accessible. These may include:

Protocol and statistical analysis plans (SAPs)
Baseline characteristics of enrolled participants
Outcome measures and raw data files (e.g., CSV, XML)
Adverse event logs
Supplementary analysis results

These are typically hosted in recognized repositories such as ClinicalTrials.gov, Vivli, or the YODA Project.

Regulatory Drivers for Open Data Mandates

Several global regulatory frameworks now mandate or strongly encourage trial data sharing. For instance:

EMA Policy 0070: Requires publication of clinical data submitted in regulatory dossiers, including anonymized patient-level data and CSRs.
FDA Final Rule (42 CFR Part 11): Mandates summary results and certain dataset elements for applicable trials on ClinicalTrials.gov.
NIH Data Management and Sharing Policy: Effective January 2023, this policy requires NIH-funded studies to share data via recognized platforms.

These frameworks aim to uphold principles of accountability, public benefit, and efficient scientific progress.

Scientific Value of Open Data: Reproducibility and Meta-Analysis

Open datasets allow for independent verification of results, which is critical in an era of reproducibility crises across medical disciplines. For example, a 2021 meta-analysis re-analyzed 38 open-access cancer trial datasets and found that 18% had significant deviations from published outcomes, including inconsistent statistical interpretations.

Moreover, large-scale meta-analyses and network meta-analyses (NMA) rely on access to granular data from multiple studies. These pooled analyses shape global health guidelines and payer decisions.

Ethical Justification: Public Right to Access Research Data

Trial participants contribute their data altruistically, often at personal risk. Ethically, researchers and sponsors have a responsibility to ensure that the knowledge derived benefits society. Open data enables this by ensuring the broadest possible use of trial outcomes — for academic research, innovation, policy development, and educational use.

Transparency also supports patient advocacy. Groups representing rare disease populations or underrepresented communities use open data to campaign for targeted research and better access to therapies.

Open Data and Informed Consent: Ethical Balancing

While data sharing supports transparency, it must not compromise participant confidentiality. Informed consent documents must now incorporate clauses explaining how and where data may be shared. Ethical review boards must assess data sharing plans to ensure:

Risks of re-identification are minimized
Consent is voluntary and revocable
Shared data adheres to applicable laws like GDPR or HIPAA

Institutions often use data transfer agreements (DTAs) and controlled-access models for sensitive data types.

Practical Tools and Repositories for Open Data Submission

Several repositories support open data access:

Repository	Scope	Access Type
ClinicalTrials.gov	All interventional trials	Open
Vivli.org	Industry-sponsored trials	Controlled
Dryad	General scientific data	Open
EU Clinical Trials Register	EU-regulated studies	Open

Some sponsors also maintain institutional repositories with anonymized datasets linked to publication DOI numbers.

FAIR Principles and Trial Data Management

FAIR data principles — Findable, Accessible, Interoperable, and Reusable — guide modern data sharing strategies. Clinical trial data must be labeled with appropriate metadata, coded using global vocabularies (e.g., CDISC, MedDRA), and stored in machine-readable formats to facilitate downstream use.

Compliance with FAIR enhances the utility and visibility of datasets, enabling integration with electronic health records (EHRs), registries, and AI models for trial design prediction.

Case Study: Open Data Impact in COVID-19 Research

During the COVID-19 pandemic, rapid sharing of trial protocols, interim analyses, and patient-level data enabled real-time decision-making. The Solidarity Trial, launched by WHO, made trial updates and outcomes publicly available across countries. This transparency accelerated regulatory approvals, public acceptance, and international collaboration.

Similarly, open access to data from vaccine trials enabled multiple secondary analyses related to efficacy in subpopulations, safety across age groups, and long-term effects.

Risks and Concerns Associated with Open Data

Despite its benefits, open data sharing poses risks such as:

Data misuse or misinterpretation by non-experts
Competitive disadvantage for sponsors sharing proprietary data
Legal exposure from privacy breaches

Risk mitigation strategies include data anonymization protocols, controlled access models, and clear data use agreements (DUAs).

Conclusion: Open Data as a Pillar of Research Integrity

Open data is not just a regulatory expectation — it is a moral and scientific imperative. By promoting reproducibility, enhancing public trust, and enabling innovation, it strengthens the credibility of the clinical research enterprise. Institutions, investigators, and sponsors must align their policies and systems to ensure seamless, ethical, and effective data sharing. In doing so, they uphold the social contract between science and society.