Foundation Models for Chronic Disease
foundation model, EHR, electronic health records, Type 2 Diabetes, COPD, transformer, DARE, deep learning, SIDIAP, clinical AI, UPC, Ministerio de Ciencia
| Funding | Ministerio de Ciencia e Innovación (AEI) |
| Programme | Proyectos de Generación de Conocimiento / Proyectos I+D+i |
| Status | Active |
| Data source | SIDIAP — Catalan primary care database |
| Clinical focus | Type 2 Diabetes · COPD |
| B2SLab PI | Alexandre Perera Lluna |
Context: the scale problem in chronic disease AI
Chronic diseases — diabetes, chronic obstructive pulmonary disease (COPD), heart failure — account for the majority of healthcare costs, hospitalisations, and premature deaths across Europe. The clinical information needed to predict, prevent, and manage these conditions exists: health systems generate massive longitudinal records documenting each patient’s diagnoses, prescriptions, laboratory values, care interactions, and comorbidities over years or decades.
The challenge is making that information actionable. Traditional risk scores (FINDRISC, GOLD staging, Charlson index) are hand-crafted from a handful of variables and capture only a fraction of the signal embedded in a full patient record. Machine learning models trained for specific tasks can improve on these scores, but they are brittle — each new prediction target requires a new model, new labels, and a new training run.
Foundation models offer a different architecture: train once on the full longitudinal record, learn rich general representations of patient state, then adapt those representations efficiently to any downstream clinical question. This is the approach that transformed language understanding (BERT, GPT) and is now being applied to clinical sequence data.
The SIDIAP database
The computational platform for this research is SIDIAP (Sistema d’Informació per al Desenvolupament de la Investigació en Atenció Primària) — one of Europe’s largest real-world health databases, covering primary care records for the entire population of Catalonia (~7 million individuals) from 2006 onwards.
This scale and longitudinal depth is what makes foundation model training feasible: the model can learn from the temporal ordering, co-occurrence, and absence of clinical events in a way that would be impossible with the sample sizes available in most hospital cohorts.
Research outputs
DARE: transformer encoder for Type 2 Diabetes
DARE (Diabetes Adaptive Risk Encoder) is a transformer-based encoder pre-trained on the SIDIAP Type 2 Diabetes cohort. It represents each patient as a contextualised embedding of their full clinical sequence — diagnoses, medications, lab values, care interactions — and can be fine-tuned for downstream prediction tasks with minimal additional labelled data.
Validated on the SIDIAP cohort, DARE achieved state-of-the-art performance on predicting clinical outcomes (HbA1c trajectory, microvascular complications, hospitalisation) in held-out patient populations. Its attention mechanism provides interpretable insight into which clinical events the model weights most heavily for each prediction.
Published: Expert Systems with Applications, 2025.
Deep survival analysis for COPD
Applying joint modelling of competing risks to the SIDIAP COPD cohort, this work developed a deep learning approach to simultaneously predict hospitalisation and mortality in COPD patients — capturing the clinical reality that these outcomes interact (a patient who is hospitalised has a different subsequent risk profile than one who is not).
The model was trained and validated on 150,000+ COPD patients, achieving substantially better calibration and discrimination than standard prognostic scores while identifying patient subgroups with disproportionate risk that conventional stratification misses.
Published: arXiv, 2025 (under review).
BERT for EHR: language-model pre-training on clinical sequences
Adapting the BERT pre-training paradigm to clinical sequence data — treating each patient’s medical history as a “document” of clinical events — this work explored the use of masked-event pre-training to learn general representations of disease progression.
Presented: EMBC (IEEE Engineering in Medicine and Biology Conference), 2024.
Current direction: diabetes foundation model
Team
This research line involves several current and former B2SLab members working at the intersection of deep learning and clinical informatics:
- Enrico Manzini — led the DARE development and EHR pre-training work; completed PhD 2025
- Joana Gelabert — metabolic disease trajectories and longitudinal modelling
- Blanca Aleajos — foundation models for emergency medicine (industrial doctorate, Hospital Sant Joan de Déu)
- Sergi Gosalvez — deep learning for clinical sequences