Foundation Models for Chronic Disease

national
deep-learning
clinical-ai
Transformer-based models trained on large-scale electronic health record data to predict disease trajectories in Type 2 Diabetes and COPD — towards a reusable AI foundation for chronic disease management.
Keywords

foundation model, EHR, electronic health records, Type 2 Diabetes, COPD, transformer, DARE, deep learning, SIDIAP, clinical AI, UPC, Ministerio de Ciencia

Funding Ministerio de Ciencia e Innovación (AEI)
Programme Proyectos de Generación de Conocimiento / Proyectos I+D+i
Status Active
Data source SIDIAP — Catalan primary care database
Clinical focus Type 2 Diabetes · COPD
B2SLab PI Alexandre Perera Lluna

Context: the scale problem in chronic disease AI

Chronic diseases — diabetes, chronic obstructive pulmonary disease (COPD), heart failure — account for the majority of healthcare costs, hospitalisations, and premature deaths across Europe. The clinical information needed to predict, prevent, and manage these conditions exists: health systems generate massive longitudinal records documenting each patient’s diagnoses, prescriptions, laboratory values, care interactions, and comorbidities over years or decades.

The challenge is making that information actionable. Traditional risk scores (FINDRISC, GOLD staging, Charlson index) are hand-crafted from a handful of variables and capture only a fraction of the signal embedded in a full patient record. Machine learning models trained for specific tasks can improve on these scores, but they are brittle — each new prediction target requires a new model, new labels, and a new training run.

Foundation models offer a different architecture: train once on the full longitudinal record, learn rich general representations of patient state, then adapt those representations efficiently to any downstream clinical question. This is the approach that transformed language understanding (BERT, GPT) and is now being applied to clinical sequence data.


The SIDIAP database

The computational platform for this research is SIDIAP (Sistema d’Informació per al Desenvolupament de la Investigació en Atenció Primària) — one of Europe’s largest real-world health databases, covering primary care records for the entire population of Catalonia (~7 million individuals) from 2006 onwards.

Note

Data scale

  • Over 200,000 patients with Type 2 Diabetes, with longitudinal records spanning diagnosis through complications
  • Over 150,000 patients with COPD, including linked hospitalisation and mortality data
  • Diagnoses, prescriptions, laboratory values, specialist referrals, and healthcare utilisation — all time-stamped and individual-linked
  • Annual follow-up extending to 15+ years for older cohorts

This scale and longitudinal depth is what makes foundation model training feasible: the model can learn from the temporal ordering, co-occurrence, and absence of clinical events in a way that would be impossible with the sample sizes available in most hospital cohorts.


Research outputs

DARE: transformer encoder for Type 2 Diabetes

DARE (Diabetes Adaptive Risk Encoder) is a transformer-based encoder pre-trained on the SIDIAP Type 2 Diabetes cohort. It represents each patient as a contextualised embedding of their full clinical sequence — diagnoses, medications, lab values, care interactions — and can be fine-tuned for downstream prediction tasks with minimal additional labelled data.

Validated on the SIDIAP cohort, DARE achieved state-of-the-art performance on predicting clinical outcomes (HbA1c trajectory, microvascular complications, hospitalisation) in held-out patient populations. Its attention mechanism provides interpretable insight into which clinical events the model weights most heavily for each prediction.

Published: Expert Systems with Applications, 2025.

Deep survival analysis for COPD

Applying joint modelling of competing risks to the SIDIAP COPD cohort, this work developed a deep learning approach to simultaneously predict hospitalisation and mortality in COPD patients — capturing the clinical reality that these outcomes interact (a patient who is hospitalised has a different subsequent risk profile than one who is not).

The model was trained and validated on 150,000+ COPD patients, achieving substantially better calibration and discrimination than standard prognostic scores while identifying patient subgroups with disproportionate risk that conventional stratification misses.

Published: arXiv, 2025 (under review).

BERT for EHR: language-model pre-training on clinical sequences

Adapting the BERT pre-training paradigm to clinical sequence data — treating each patient’s medical history as a “document” of clinical events — this work explored the use of masked-event pre-training to learn general representations of disease progression.

Presented: EMBC (IEEE Engineering in Medicine and Biology Conference), 2024.


Current direction: diabetes foundation model

Tip

From task-specific models to a reusable foundation

The research programme is now consolidating these advances into a diabetes foundation model — a large pre-trained model capable of representing any T2D patient’s clinical state, adaptable to a broad range of downstream applications without task-specific retraining.

Target capabilities include:

  • Trajectory prediction: HbA1c evolution, complication onset, hospitalisation risk over 1–5 year horizons
  • Treatment response: estimating likely outcomes under alternative medication strategies from observational data
  • Subgroup discovery: identifying clinically meaningful patient subtypes that correspond to different biological mechanisms or care needs
  • Transfer to new settings: adapting the pre-trained model to smaller hospital datasets that lack the scale for training from scratch

Team

This research line involves several current and former B2SLab members working at the intersection of deep learning and clinical informatics:

  • Enrico Manzini — led the DARE development and EHR pre-training work; completed PhD 2025
  • Joana Gelabert — metabolic disease trajectories and longitudinal modelling
  • Blanca Aleajos — foundation models for emergency medicine (industrial doctorate, Hospital Sant Joan de Déu)
  • Sergi Gosalvez — deep learning for clinical sequences