trial data curation – Clinical Research Made Simple

How to Prepare Data for Public Sharing Repositories in Clinical Trials

digi — Sun, 24 Aug 2025 15:54:22 +0000

How to Prepare Data for Public Sharing Repositories in Clinical Trials

Step-by-Step Guide to Preparing Clinical Trial Data for Public Repositories

Introduction: Why Proper Data Preparation Matters

As global regulations and journal policies increasingly demand open access to clinical trial data, researchers and sponsors must prepare datasets in formats suitable for public repositories. Improper or incomplete preparation can lead to regulatory delays, data misuse, or breaches of participant confidentiality. Therefore, data preparation is not just a technical step — it’s a regulatory, ethical, and scientific responsibility.

Preparing data for public sharing involves several critical activities: de-identification, metadata annotation, format conversion, documentation, and repository selection. This guide provides a detailed, compliant approach tailored to global expectations, including FDA, EMA, WHO, and ICMJE requirements.

Step 1: Define the Scope of Data for Sharing

The first step is identifying which components of the clinical trial dataset will be shared. Typical elements include:

De-identified patient-level datasets (e.g., demographic, baseline, outcomes)
Study protocol and statistical analysis plan (SAP)
Case Report Forms (CRFs) or annotated CRFs
Clinical Study Report (CSR)
Data dictionaries and codebooks
Data sharing plan and user guides

Ensure that shared data aligns with what was described in the trial’s data sharing statement and informed consent documents.

Step 2: Anonymize or De-Identify the Dataset

To comply with privacy regulations like GDPR and HIPAA, data must be fully anonymized or de-identified. Techniques include:

Removing direct identifiers (e.g., name, phone number, social security number)
Generalizing or binning date-of-birth, geographic location, or visit dates
Replacing identifiers with subject IDs
Using controlled randomization for sensitive categories (e.g., rare diseases)

De-identification must be irreversible. It’s best practice to document the method and date of anonymization in a separate file.

Sample De-Identification Table

Original Field	De-Identification Method	Notes
Patient Name	Removed	Direct identifier
Date of Birth	Converted to age group	Avoids re-identification
City	Region only	Limits geographic precision
Visit Date	Offset by X days	Relative timeline preserved

Step 3: Format the Data for Compatibility

Public repositories often require datasets in specific formats. Common formats include:

CSV or TSV for tabular datasets
XML or JSON for structured submissions (e.g., to CTRI)
SAS XPORT or CDISC-compliant SDTM/ADaM files for FDA submissions

All files should be checked for readability, encoding compatibility (e.g., UTF-8), and must exclude macros or embedded formulas.

Step 4: Create a Comprehensive Data Dictionary

A data dictionary explains every variable in the dataset, including its format, possible values, units, and logic. It ensures data usability for secondary researchers. A basic structure might include:

Variable Name	Description	Type	Permissible Values
AGE	Age in years	Numeric	18–99
SEX	Biological sex	Text	Male, Female, Other
AE_SEV	Adverse event severity	Ordinal	1=Mild, 2=Moderate, 3=Severe

Step 5: Prepare Metadata and Documentation

Metadata is machine-readable information that describes the dataset. It includes trial identifiers, data collection dates, responsible parties, and sharing conditions. Recommended metadata standards include:

Dublin Core: for basic bibliographic metadata
DataCite: for DOI-based repositories
Clinical Data Interchange Standards Consortium (CDISC): for FDA/EMA submissions

Also include README files explaining file structure, naming conventions, and how to interpret the dataset.

Step 6: Review Legal, Ethical, and Policy Considerations

Before uploading, review institutional, national, and funder requirements. Confirm that:

Ethics Committee/IRB approval covers data sharing
Participant informed consent permits secondary use
Any data transfer agreements (DTAs) are executed if required
Embargoes or publication rights are respected

Include a plain language data sharing statement in the documentation pack.

Step 7: Choose and Upload to the Appropriate Repository

Repository selection depends on the trial type, sponsor policy, and access model:

Open Repositories: Dryad, Figshare, Zenodo
Controlled Repositories: Vivli, YODA Project, EMA Data Portal
Regulatory Registries: ClinicalTrials.gov, EU CTR, ISRCTN

Ensure that files are uploaded with the correct metadata, license, and access controls. For example, CSVs should be accompanied by data dictionaries and README files.

Step 8: Assign Persistent Identifiers and License

Assigning a DOI (Digital Object Identifier) ensures that your dataset can be cited and tracked. Choose an appropriate license such as:

CC BY 4.0: Permits sharing and reuse with attribution
CC0: Public domain dedication
Restricted use: With justified embargoes

Use repositories that support DOI minting and license tagging.

Step 9: Validate Data Before Submission

Perform internal validation checks to ensure data completeness, readability, and compliance:

File naming matches SOP convention
No missing columns or variables
Consistency with the Clinical Study Report
Compatibility with statistical software (e.g., R, SAS)

Include a final checklist in the submission folder for review before public release.

Conclusion: Building a Culture of Responsible Data Sharing

Well-prepared data sets enable meaningful secondary research, reinforce transparency, and meet growing global expectations. By integrating good data stewardship practices into clinical trial workflows, sponsors and investigators contribute to reproducibility, ethical research use, and patient trust. Following the steps above ensures data is not only shared — but shared responsibly and usefully for global health advancement.

Understanding Types of Missing Data in Clinical Trials

digi — Mon, 21 Jul 2025 13:45:09 +0000

Understanding Types of Missing Data in Clinical Trials

Types of Missing Data in Clinical Trials: MCAR, MAR, and MNAR Explained

Missing data is an unavoidable issue in clinical trials. Whether due to patient dropouts, missed visits, or data entry errors, incomplete datasets can significantly impact the reliability of statistical results. Understanding the types of missing data is crucial for developing appropriate handling strategies and ensuring data integrity.

In clinical research, missing data can be classified into three categories: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). Each type carries different implications for analysis and interpretation. This tutorial offers clear guidance on recognizing these types and integrating effective strategies in alignment with regulatory expectations from bodies such as the USFDA.

Why It’s Critical to Address Missing Data in Clinical Trials

Incomplete data can:

Introduce bias and reduce statistical power
Complicate efficacy and safety assessments
Lead to invalid conclusions and regulatory setbacks
Trigger additional scrutiny during pharma regulatory reviews

Proactively identifying the type of missing data allows statisticians to implement effective imputation and analysis techniques. These practices should be well-documented in the Statistical Analysis Plan (SAP) and standard operating procedures (SOPs).

1. Missing Completely at Random (MCAR):

MCAR means that the probability of data being missing is unrelated to any observed or unobserved data. In other words, the missingness occurs entirely by chance and does not depend on patient characteristics, treatment, or outcomes.

Example:

A lab sample was lost in transit randomly and has no relation to the patient’s health or treatment.

Implications:

MCAR is the least problematic missing data type
Statistical analyses remain unbiased if cases with missing data are excluded (complete-case analysis)
Very rare in real-world clinical trials

2. Missing at Random (MAR):

MAR occurs when the probability of missing data is related to observed data, but not the missing data itself. This allows the missingness to be predicted and modeled using existing variables.

Example:

Patients with higher baseline blood pressure are more likely to miss follow-up visits, but blood pressure data is still available for those patients.

Implications:

MAR is more common and manageable using statistical methods like multiple imputation
Valid inferences can be drawn if the missingness mechanism is modeled correctly
Requires careful planning and transparent documentation in the SAP

Incorporating auxiliary variables during imputation can improve accuracy under MAR assumptions, ensuring better support during stability studies and interim analyses.

3. Missing Not at Random (MNAR):

MNAR occurs when the probability of missing data is related to the unobserved (missing) value itself. This creates significant bias because the reason for the missing data is inherently linked to the data itself.

Example:

Patients experiencing severe side effects may be more likely to drop out, and their adverse event data is missing.

Implications:

Most challenging to handle because standard models may produce biased estimates
Requires sensitivity analyses or modeling the missingness mechanism explicitly (e.g., selection models, pattern-mixture models)
Often subject to regulatory concern if not addressed properly

Visual Summary of Missing Data Types

Type	Missingness Depends On	Analytical Approach
MCAR	Neither observed nor unobserved data	Complete-case analysis, listwise deletion
MAR	Observed data	Multiple imputation, mixed-effects models
MNAR	Unobserved (missing) data	Sensitivity analysis, modeling missingness explicitly

Identifying Missing Data Mechanisms

Statistical methods help infer the type of missingness, though exact classification is often untestable:

Little’s MCAR test: Tests for MCAR, available in R and SPSS
Descriptive analysis: Compare missing vs. non-missing groups across baseline variables
Graphical diagnostics: Heatmaps, pattern plots, and missing data matrices

These assessments should be included in trial data review plans and referenced in validation master plans or similar documentation.

Regulatory Expectations for Missing Data

Agencies such as CDSCO and EMA expect sponsors to:

Define missing data handling strategies in the protocol and SAP
Use appropriate imputation techniques based on missingness type
Conduct sensitivity analyses to assess robustness of results
Discuss limitations of missing data in Clinical Study Reports

The ICH E9(R1) guideline encourages clear definition of the estimand, particularly considering intercurrent events that cause missing data. This clarity is vital for trials involving patient-reported outcomes or long-term survival endpoints.

Best Practices in Handling Missing Data

Plan for missing data at the design stage, not post hoc
Collect auxiliary variables that may predict missingness
Avoid excessive imputation; apply methods suited to data type
Use software packages (e.g., R’s mice, SAS PROC MI, STATA mi) validated for imputation
Document all assumptions in alignment with GMP SOPs

Conclusion

Missing data is a complex but manageable challenge in clinical trials. By understanding the three types—MCAR, MAR, and MNAR—researchers can adopt informed statistical methods that minimize bias and maintain regulatory credibility. Clear planning, proper diagnostics, and transparency in documentation are essential for trustworthy trial results. With rigorous handling, missing data need not compromise the integrity or success of your study.