Nam Nguyen
02 Jun 2026
Public omics repositories like NCBI GEO hold millions of datasets with rich clinical annotations, but leveraging this data for large-scale cancer genomics meta-analysis is still hard. Three persistent problems stand in the way:
er_status, ER Status, estrogen_receptor, and ER_IHC all refer to the same estrogen receptor status.In this post we walk through a config-driven pipeline that solves these problems for cancer genomics cohorts, using breast cancer (BRCA-mini) as a worked example.
The pipeline operates in five stages, cleanly separating cohort-specific logic (regex patterns, schema YAML) from generic, reusable tooling (fetchers, harmonizer, review UI):
The full implementation is open source: omicslab-datasets on GitHub.
The entry point is a cohort-specific Python script that queries OmicIDX Parquet files via DuckDB to identify relevant GEO studies. For BRCA-mini, it filters for:
The regex sets are cohort-specific and live next to the script (cohorts/BRCA-mini/scripts/filtering_datasets.py). An abridged view:
# cohorts/BRCA-mini/scripts/filtering_datasets.py
BREAST_CANCER_RE = (
r"breast\s?cancer|breast\s?carcinoma|breast\s?tumou?r"
r"|mammary\s?cancer|mammary\s?carcinoma"
r"|triple.negative\s?breast|invasive\s?breast"
r"|ductal\s?carcinoma|lobular\s?carcinoma"
r"|TNBC|BRCA.mutant|BRCA1|BRCA2"
)
TREATMENT_RE = (
r"treatment|chemotherapy|adjuvant|neoadjuvant"
r"|tamoxifen|trastuzumab|radiotherapy"
r"|hormone\s?therapy|aromatase\s?inhibitor"
r"|pembrolizumab|immunotherapy"
)
SURVIVAL_RE = (
r"survival|outcome|prognosis|recurrence"
r"|overall\s?survival|disease\s?free"
r"|kaplan\s?meier|cox\s?regression|hazard\s?ratio"
r"|death|mortality|relapse|metastasis"
)
The full sets (with around 30 drug names, 20 survival terms, and 40 cell-line patterns) are combined into a single DuckDB query that joins geo_series and geo_samples Parquet files. The query returns 785 GSE accessions written to gse_ids.txt.
Filtering on a single platform helps harmonize the datasets into a coherent cohort by reducing technical batch effects.
Two generic scripts consume the filtered GSE list and pull raw clinical data:
fetch_platforms.py — Resolves platform accession, technology, and manufacturer for each GSE from the OmicIDX Parquet store. Outputs gse_platforms.csv with platform frequency summaries across the cohort. Also supports optional platform-based filtering via --platform-ids (comma-separated GPL accessions) or --platform-pattern (regex matching platform title), writing a filtered gse_ids_filtered.txt.
fetch_clinical.py — Downloads raw clinical metadata for all GSM samples in each GSE, writing per-study CSV files (<GSE>_clinical.csv). Uses DuckDB's HTTPFS extension to query remote Parquet files directly without local downloads.
The LLM auto-curator is not yet published on the repository.
Curation transforms raw, inconsistent column names and values into a standardized format. The pipeline supports two paths:
This section covers harmonizing the metadata. Combining the omics measurements themselves introduces batch effects and requires a separate step (ComBat, limma, or a deep-learning method). We will cover batch-effect correction in the next post in this series.
The pipeline ships two complementary tools for the harmonization step:
Interactive Review UI (curation_app.py) — a Streamlit dashboard with two pages:
Harmonizer (embedded in curation_app.py as build_after_df) — consumes the LLM-generated column_mappings.json and produces standardized output datasets. It:
A typical column_mappings.json for one curated dataset looks like this (excerpt from GSE103091.json):
{
"file": "GSE103091_clinical.csv",
"n_rows": 238,
"n_cols": 16,
"columns": {
"adjuvant chemotherapy": {
"curation": {
"action": "rename",
"maps_to": "chemotherapy",
"type": "categorical",
"value_mapping": { "1.0": "yes", "0.0": "no" }
}
},
"age at diag": {
"curation": {
"action": "rename",
"maps_to": "age_at_diagnosis",
"type": "numeric"
}
},
"os (days)": {
"curation": {
"action": "rename",
"maps_to": "os_time_months",
"type": "numeric",
"post_process": "to_months"
}
},
"er-ihc": {
"curation": {
"action": "rename",
"maps_to": "er_status",
"type": "categorical",
"value_mapping": { "0": "negative", "1": "positive" }
}
}
}
}
The action, maps_to, and value_mapping keys are the contract between the LLM auto-curator and the build_after_df function in curation_app.py — schemas for new cancer types only need to declare new target variables, not new mapping logic.
Automated quality reports provide:
The full BRCA-mini pipeline (785 datasets, 238K+ samples) produces a harmonized cohort with per-variable coverage shown below:
Each cancer type defines a YAML schema with variable groups. The BRCA schema covers:
| Group | Variables |
|---|---|
| Receptor Status | ER, PR, HER2 status |
| Survival Endpoints | OS, DFS, DMFS, RFS (event plus time in months) |
| Tumor Characteristics | Histological grade, tumor size (mm), AJCC, T, N, M stage, lymph node status |
| Treatment | Chemotherapy, hormone therapy, radiotherapy, pathologic complete response |
| Demographics | Age at diagnosis, menopausal status, gender |
| Molecular | PAM50 subtype, Ki67 proliferation index |
| Other | Histological type, p53 status, tissue origin |
The YAML schema lives at cohorts/BRCA-mini/config/brca_schema.yaml and is the single source of truth for both the LLM curator and the harmonizer.
Underpinning the curation pipeline is OmicIDX, a cloud-native replacement for the legacy SRAdb and GEOmetadb. It comprises:
omicidx-parsers) — type-safe XML and SOFT format parsers for NCBI SRA, GEO, BioSample, and PubMed, producing Pydantic v2 models.omicidx-etl) — extract-transform-load pipelines that convert raw NCBI data to partitioned Parquet files on S3-compatible storage (Cloudflare R2), accessible via DuckDB with HTTPFS.omicidx-api) — read-only FastAPI REST API deployed at api-omicidx.cancerdatasci.org with keyset cursor pagination and consistent response envelopes.omicidx-dagster) — Dagster assets scheduling daily ETL runs, syncing to both DuckDB (for querying) and PostgreSQL (for the API).This architecture enables querying 80M+ SRA runs and 8M+ GEO samples in milliseconds using local DuckDB, without requiring a running database server — ideal for offline and reproducible research workflows.
git clone --recursive https://github.com/omicslab/omicslab-datasets.git
cd omicslab-datasets
pixi shell
# 1. Filter datasets for breast cancer (cohort-specific)
pixi run python cohorts/BRCA-mini/scripts/filtering_datasets.py \
--parquet-dir cohorts/BRCA-mini/parquet \
--output cohorts/BRCA-mini/datasets/gse_ids.txt
# 2. Fetch platform and clinical data
pixi run python apps/fetch_platforms.py \
--parquet-dir cohorts/BRCA-mini/parquet \
--gse-ids cohorts/BRCA-mini/datasets/gse_ids.txt \
--output-dir cohorts/BRCA-mini/datasets/raw
pixi run python apps/fetch_clinical.py \
-i cohorts/BRCA-mini/datasets/gse_ids.txt \
-o cohorts/BRCA-mini/datasets/raw \
--parquet-dir cohorts/BRCA-mini/parquet
# 3. LLM auto-curation (requires OPENAI_API_KEY)
pixi run python apps/agent_curate.py BRCA-mini
# 4. Launch interactive curation review UI
pixi run streamlit run apps/curation_app.py -- \
--raw-dir cohorts/BRCA-mini/datasets/raw \
--curated-dir cohorts/BRCA-mini/datasets/curated_json \
--output-dir cohorts/BRCA-mini/output/json
To add support for a new cancer type, for example LUAD (lung adenocarcinoma):
cohorts/cancers.yaml with regex patterns for cancer type, treatment keywords, survival keywords, and exclusion patterns.cohorts/LUAD/config/luad_schema.yaml defining target variables and value mappings.cohorts/LUAD/scripts/filtering_datasets.py with the LUAD-specific regex filters.This pipeline is a good fit if you are:
It is not the right tool if you only need a single dataset, or if your data is already in a clean tabular format with a known schema.
Curation maps raw, dataset-specific column names to standardized variable names defined in the cohort schema (for example, er-ihc to er_status). Harmonization then normalizes the values themselves (for example, 0/1 to negative/positive) and converts units (for example, days to months). The pipeline keeps these as two distinct stages so the LLM can focus on the harder mapping problem while the deterministic harmonizer handles unit conversions.
DuckDB reads Parquet files directly via HTTPFS without loading them into a database server, which means the entire 8M-sample GEO index is queryable from a laptop in seconds. The same query that takes minutes in PostgreSQL runs in under a second in DuckDB for this workload. PostgreSQL is still used for the OmicIDX API tier, where transactional pagination matters more than raw scan speed.
Yes. The curator reads its configuration from the schema YAML's llm_config block (provider, model, api_key_env, max_columns_per_batch). Swap the provider and model to point at OpenAI, Anthropic, or a local model — the contract is the column-mapping JSON, which is model-agnostic.
On a single laptop the five stages take roughly: filtering 30 seconds, platform fetch 1 minute, clinical fetch 5 to 10 minutes (depending on batch size), LLM curation 10 to 30 minutes (depending on token limits and approval rate), and harmonization 1 minute. Most of the time is the LLM step, which can be parallelized across datasets.
This pipeline focuses on metadata harmonization, not measurement-level batch correction. Combining the omics matrices themselves (for example, expression counts) requires additional steps such as ComBat, limma, or a more recent deep-learning method.
The next post in this series will cover how to harmonize the omics data (expression matrices, methylation beta values, and so on) across the curated cohort — addressing batch-effect removal without losing biological signal, preserving group structure for downstream AI/ML modeling, and producing analysis-ready merged datasets.
Yes. The harmonized schema includes os_status and os_time_months, dfs_status and dfs_time_months, dmfs_status and dmfs_time_months, and rfs_status and rfs_time_months in the standard (event, time) format expected by lifelines, the R survival package, and scikit-survival.