V
← Back to Projects

CMS Readmission Predictor

CMS Readmission Predictor hero

Built a 30-day hospital readmission prediction pipeline on CMS DE-SynPUF synthetic Medicare claims data (Sample 1, approximately 67,000 admissions). The ETL layer uses DuckDB for out-of-core SQL joins across beneficiary summary and inpatient claims files, deriving the readmission label by windowing each discharge against subsequent admissions within 30 days.

The model is a LightGBM classifier with balanced class weights to handle the 10% positive rate. Monotonic constraints enforce clinical directional correctness on prior admissions, age, and length of stay, preventing the model from learning spurious inverse relationships in the synthetic data. MLflow tracks all experiments with logged hyperparameters, metrics, and serialized model artifacts.

Interpretability is built on SHAP TreeExplainer. The top predictive features are chronic kidney disease, COPD, and congestive heart failure, aligning with CMS Hospital Readmissions Reduction Program target conditions. An interactive Streamlit demo lets users input patient features and see both the predicted readmission probability and a per-patient SHAP waterfall explanation of the driving risk factors.

Final model AUC-ROC: 0.6648 on a held-out stratified test set, which is realistic for a claims-only feature set on synthetic data without lab values, clinical notes, or social determinants of health.

Analysis

Feature importance ranked by mean absolute SHAP value. Chronic kidney disease, COPD, and CHF are the strongest predictors.
Feature importance ranked by mean absolute SHAP value. Chronic kidney disease, COPD, and CHF are the strongest predictors.
SHAP beeswarm plot showing how each feature pushes predictions toward or away from readmission.
SHAP beeswarm plot showing how each feature pushes predictions toward or away from readmission.
Class distribution of 30-day readmissions. Approximately 10% positive rate.
Class distribution of 30-day readmissions. Approximately 10% positive rate.
Chronic condition prevalence across the study population.
Chronic condition prevalence across the study population.
PythonLightGBMDuckDBSHAPMLflowStreamlitPandasscikit-learn

Key Metrics

  • AUC-ROC: 0.6648 on stratified held-out test set
  • 67K admissions processed via DuckDB out-of-core ETL
  • 17 features with 3 monotonic constraints enforcing clinical priors
  • Full per-patient SHAP interpretability