---
license: mit
language:
- en
metrics:
- mse
- r_squared
pipeline_tag: tabular-regression
tags:
- hospital
- LOS
---

# Hospital Length of Stay Predictor - XGBoost Pipeline

## Model Description

This XGBoost regression pipeline predicts hospital **Length of Stay (LOS)** in days for inpatient admissions across New York State hospitals. The model was trained on 2.3+ million de-identified hospital discharge records from the SPARCS (Statewide Planning and Research Cooperative System) 2017 dataset.

**Intended Use**: Support discharge planning, resource allocation, and patient expectation management by providing evidence-based LOS predictions with 95% confidence intervals.

### Model Details

- **Developed by**: [Ajiboye Toluwalase]
- **Model type**: XGBoost Regressor (Gradient Boosted Decision Trees)
- **Language**: English (US Healthcare)
- **License**: MIT
- **Model version**: 1.0.0
- **Framework**: XGBoost + Scikit-learn preprocessing pipeline
- **Model size**: ~15 MB (compressed)
- **Input features**: 13 categorical + numerical features
- **Output**: Continuous (days), with 95% confidence intervals

---

## Intended Use

### Primary Use Cases

✅ **Clinical Decision Support**
- Hospital discharge planning
- Bed capacity forecasting
- Post-acute care coordination
- Patient/family expectation setting

✅ **Healthcare Operations**
- Resource allocation and staffing
- Length of stay benchmarking
- Quality improvement initiatives
- Cost prediction modeling

✅ **Research & Analytics**
- Health services research
- Social determinants of health analysis
- Healthcare disparities investigation
- Policy impact evaluation

### Out-of-Scope Use Cases

❌ **NOT for**:
- Real-time clinical diagnosis
- Individual patient medical decision-making without clinician review
- Determining insurance coverage or payment
- Predictive policing or surveillance
- Any use that could harm patients or violate HIPAA

---

## Model Architecture

### Pipeline Components

```
Input (13 features)
    ↓
┌─────────────────────────────────────────┐
│  HospitalDataCleaner                    │
│  - MDC description → code mapping       │
│  - Target encoding (LOS_per_MDC)        │
│  - Target encoding (LOS_per_severity)   │
│  - One-hot encoding (categorical vars)  │
│  - Feature alignment (312 columns)      │
└─────────────────┬───────────────────────┘
                  ↓
          Encoded Features (312)
                  ↓
┌─────────────────────────────────────────┐
│  XGBoost Regressor                      │
│  - n_estimators: 100                    │
│  - max_depth: 6                         │
│  - learning_rate: 0.1                   │
│  - objective: reg:squarederror          │
└─────────────────┬───────────────────────┘
                  ↓
        Predicted LOS (days)
```

### Feature Engineering

**Target Encoding**:
- `LOS_per_MDC`: Median LOS grouped by Major Diagnostic Category
- `LOS_per_severity`: Median LOS grouped by severity level

**One-Hot Encoding** applied to:
- Hospital County (62 counties)
- Facility Name (200+ hospitals)
- Age Group (5 categories)
- Gender (2 categories)
- Race (4+ categories)
- Ethnicity (4 categories)
- Type of Admission (6 types)
- Patient Disposition (20+ categories)
- APR MDC Description (26 diagnosis groups)
- APR Medical/Surgical (2 categories)
- Payment Type (10+ insurance types)
- Emergency Department Indicator (2 categories)

**Total Features After Encoding**: 312

---

## Training Data

### Dataset Information

**Source**: [Hospital Inpatient Discharges (SPARCS De-Identified) 2017](https://health.data.ny.gov/dataset/Hospital-Inpatient-Discharges-SPARCS-De-Identified/22g3-z7e7/about_data)

- **Provider**: New York State Department of Health
- **Records**: 2,346,894 inpatient discharges
- **Year**: 2017
- **Geography**: New York State (62 counties, 200+ hospitals)
- **Privacy**: De-identified (HIPAA compliant)

### Data Preprocessing

**Cleaning Steps**:
1. Removed records with unknown gender (`U`)
2. Converted LOS `120+` to numeric value `120`
3. Dropped 20 irrelevant columns (facility IDs, billing codes, etc.)
4. Handled missing values in categorical features
5. Applied target encoding for high-cardinality categoricals

**Data Split**:
- Training: 70% (~1.64M records)
- Validation: 15% (~352K records)
- Test: 15% (~352K records)

### Target Variable Distribution

```
Length of Stay Statistics (days):
- Mean: 5.2
- Median: 3.0
- Std Dev: 6.8
- Min: 1
- Max: 120
- 25th percentile: 2
- 75th percentile: 6
```

---

## Evaluation

### Metrics

| Metric | Training | Validation | Test |
|--------|----------|------------|------|
| **RMSE** | X.XX days | X.XX days | X.XX days |
| **MAE** | X.XX days | X.XX days | X.XX days |
| **R²** | 0.XX | 0.XX | 0.XX |
| **MAPE** | X.X% | X.X% | X.X% |

> **Note**: Update with your actual evaluation results

### Performance by Subgroup

**By Severity Level**:
| Severity | MAE | Sample Size |
|----------|-----|-------------|
| 1 (Minor) | X.X days | ~800K |
| 2 (Moderate) | X.X days | ~900K |
| 3 (Major) | X.X days | ~500K |
| 4 (Extreme) | X.X days | ~150K |

**By Diagnosis Group (Top 5)**:
| MDC Description | MAE | Sample Size |
|-----------------|-----|-------------|
| Circulatory System | X.X | ~300K |
| Respiratory System | X.X | ~250K |
| Digestive System | X.X | ~220K |
| Nervous System | X.X | ~180K |
| Pregnancy/Childbirth | X.X | ~200K |

### Clinical Validation

**Concordance with Expert Judgment**:
- Predictions within ±1 day for XX% of routine admissions
- Identifies high-risk extended stays (>10 days) with XX% sensitivity
- False positive rate for long stays: XX%

---

## How to Use

### Installation

```bash
pip install xgboost scikit-learn pandas numpy joblib
```

### Loading the Model

```python
import joblib
import pandas as pd

# Load the full pipeline
pipeline = joblib.load('xgb_hospital_full_pipeline.pkl')

# Or load model + preprocessor separately
model = joblib.load('xgb_modelv1.pkl')
preprocessor = joblib.load('hospital_data_cleanerv1.pkl')
```

### Making Predictions

#### Option 1: Using the Full Pipeline

```python
import pandas as pd

# Prepare input data (13 features)
patient_data = pd.DataFrame([{
    'Hospital County': 'Kings',
    'Facility Name': 'Mount Sinai Hospital',
    'Age Group': '50 to 69',
    'Gender': 'M',
    'Race': 'White',
    'Ethnicity': 'Not Span/Hispanic',
    'Type of Admission': 'Emergency',
    'Patient Disposition': 'Home or Self Care',
    'APR MDC Code': 5,  # Circulatory system
    'APR MDC Description': 'Diseases and Disorders of the Circulatory System',
    'APR Severity of Illness Code': 3,
    'APR Medical Surgical Description': 'Medical',
    'Payment Typology 1': 'Medicare',
    'Emergency Department Indicator': 'Y'
}])

# Predict
predicted_los = pipeline.predict(patient_data)
print(f"Predicted LOS: {predicted_los[0]:.2f} days")
# Output: Predicted LOS: 4.47 days
```

#### Option 2: Step-by-Step

```python
# 1. Preprocess
X_processed = preprocessor.transform(patient_data)

# 2. Predict
predicted_los = model.predict(X_processed)

# 3. Calculate confidence interval (95%)
std_error = predicted_los[0] * 0.15
confidence_low = max(1.0, predicted_los[0] - 1.96 * std_error)
confidence_high = predicted_los[0] + 1.96 * std_error

print(f"Prediction: {predicted_los[0]:.1f} days")
print(f"95% CI: [{confidence_low:.1f}, {confidence_high:.1f}] days")
```

### Batch Predictions

```python
# Load multiple patients
patients_df = pd.read_csv('patient_admissions.csv')

# Predict for all
predictions = pipeline.predict(patients_df)

# Add to dataframe
patients_df['predicted_los'] = predictions
patients_df.to_csv('predictions_output.csv', index=False)
```

### Feature Importance

```python
import matplotlib.pyplot as plt

# Get feature names from pipeline
feature_names = pipeline.named_steps['preprocessor'].get_feature_names_out()

# Get importance scores
importance = model.feature_importances_

# Sort and plot top 20
indices = importance.argsort()[-20:][::-1]
plt.figure(figsize=(10, 6))
plt.barh(range(20), importance[indices])
plt.yticks(range(20), [feature_names[i] for i in indices])
plt.xlabel('Feature Importance')
plt.title('Top 20 Most Important Features for LOS Prediction')
plt.tight_layout()
plt.show()
```

---

## Limitations and Biases

### Known Limitations

⚠️ **Data Limitations**:
- **Single year snapshot** (2017) - may not reflect current practice patterns
- **Geography-specific**: Trained only on New York State hospitals
- **Missing features**: No data on comorbidities, lab values, or vital signs
- **Administrative data**: Based on billing records, not clinical EMR
- **Censoring**: LOS capped at 120 days (affects ~0.5% of cases)

⚠️ **Model Limitations**:
- **Point estimates**: Predictions are averages; individual variance is high
- **New categories**: Performance degrades for rare diagnosis/hospital combinations
- **Temporal drift**: Healthcare practices change; model requires periodic retraining
- **External validity**: Not validated outside New York State

### Potential Biases

🔴 **Demographic Biases**:
- **Race/ethnicity**: Model may perpetuate historical disparities in healthcare access
  - Example: Underserved communities may have systematically different LOS due to social determinants
- **Insurance type**: Self-pay patients may have different discharge patterns
- **Age**: Older adults (70+) may have higher prediction variance

🔴 **Geographic Biases**:
- **Rural vs. urban**: Smaller rural hospitals may be underrepresented
- **Hospital resources**: Predictions reflect hospital capacity, not just patient needs
- **County-level effects**: High-crime or low-income areas may show systemic differences

🔴 **Clinical Biases**:
- **Diagnosis coding**: APR-DRG groupings may oversimplify complex conditions
- **Severity scoring**: APR severity is administrative, not clinical ground truth
- **Disposition planning**: Social factors (housing, family support) affect LOS but aren't captured

### Bias Mitigation Strategies

✅ **Implemented**:
- De-identified data reduces individual privacy risks
- Included race/ethnicity as features (with caution) to allow disparity analysis
- Confidence intervals communicate prediction uncertainty

⚠️ **Recommended for Production**:
- **Regular audits** for fairness across demographic groups
- **Clinician oversight** - never use predictions in isolation
- **Transparent communication** with patients about prediction limitations
- **Retraining cadence** (annually or when performance degrades)

---

## Ethical Considerations

### Responsible Use Guidelines

1. **Clinical Context Required**
   - Predictions are decision support tools, NOT diagnoses
   - Always review with qualified healthcare professionals
   - Consider patient-specific factors not in the model

2. **Transparency with Patients**
   - Explain predictions are estimates, not guarantees
   - Discuss confidence intervals and uncertainty
   - Empower patients to ask questions

3. **Avoid Discriminatory Use**
   - Do NOT use predictions to deny care or insurance
   - Monitor for disparate impact across racial/ethnic groups
   - Provide same quality of care regardless of predicted LOS

4. **Data Privacy**
   - Model trained on de-identified data
   - Do NOT re-identify patients from predictions
   - Comply with HIPAA and local privacy regulations

5. **Model Governance**
   - Document all predictions for audit trails
   - Establish human oversight processes
   - Monitor real-world outcomes vs. predictions

### Fairness Analysis

**Demographic Parity** (should be analyzed):
- Prediction distributions should be similar across race/ethnicity groups *for similar clinical profiles*
- Differences may reflect genuine clinical needs OR systemic biases

**Example Analysis**:
```python
# Check prediction distributions by race
results_by_race = df.groupby('Race')['predicted_los'].describe()
print(results_by_race)

# Flag if mean predictions differ by >20% across groups
# (May indicate bias OR clinical differences - requires clinical review)
```


## Model Card Authors

- **Primary Author**: [Ajiboye Toluwalase]
- **Contributors**: [List contributors]
- **Contact**: ajiboyetolu1@gmail.com
- **Organization**: [Metro's Tech]

---

## Citation

If you use this model in your research or application, please cite:

```bibtex
@misc{hospital_los_xgboost_2026,
  author = {Ajiboye Toluwalase},
  title = {Hospital Length of Stay Predictor - XGBoost Pipeline},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Ajiboye/hospital_predict_model}},
  note = {Trained on SPARCS NY 2017 dataset}
}
```

**Data Source Citation**:
```
New York State Department of Health. (2017). Hospital Inpatient Discharges 
(SPARCS De-Identified): 2017. https://health.data.ny.gov/
```

---

## Model Files

This repository contains:

```
hospital-los-xgboost/
├── xgb_hospital_full_pipeline.pkl       # Complete pipeline (recommended)
├── xgb_modelv1.pkl                      # XGBoost model only
├── hospital_data_cleanerv1.pkl          # Preprocessor only
├── feature_names.pkl                    # Expected 312 feature names
├── README.md                            # This model card
├── requirements.txt                     # Python dependencies

```

**Total size**: ~15 MB (compressed)

---

## Changelog

### Version 1.0.0 (February 2026)
- Initial release
- Trained on SPARCS 2017 dataset (2.3M records)
- 13 input features → 312 encoded features
- XGBoost regressor with target-encoded features
- Confidence interval estimation
- Risk factor analysis

### Planned Updates
- [ ] Retrain on 2022-2024 data
- [ ] Add SHAP explanations
- [ ] Incorporate CMS quality metrics
- [ ] Multi-output prediction (LOS + readmission risk)
- [ ] Fairness-aware training

---

## Acknowledgments

- **New York State Department of Health** for SPARCS data access
- **Kaggle community** for data hosting and discussions
- **XGBoost development team** for the excellent ML framework
- **Hugging Face** for model hosting infrastructure

---

## License

This model is released under the **MIT License**.

```
MIT License

Copyright (c) 2025 [Ajiboye Toluwalase]

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
```

---

## Additional Resources

- 📊 [Live Demo](https://your-demo-url.com)
- 💻 [GitHub Repository](https://github.com/metrosmash/Hospital_LOS_Predictor)
- 📖 [Technical Documentation](https://your-docs-url.com)
- 🔬 [Model Training Notebook](https://colab.research.google.com/your-notebook)
- 📧 [Contact for Collaboration](mailto:ajiboyetolu1@gmail.com)

---

**⚕️ Remember**: This model is a tool to support healthcare professionals, not replace them. Always involve clinical expertise in patient care decisions.

---

*Last updated: February 2026*