File size: 7,879 Bytes

6b8b156

---
language:
- en
license: mit
tags:
- sklearn
- tabular-regression
- health
- life-expectancy
- gradient-boosting
- scikit-learn
pipeline_tag: tabular-regression
library_name: sklearn
metrics:
- r2
- rmse
model-index:
- name: Life Expectancy Predictor
  results:
  - task:
      type: tabular-regression
      name: Tabular Regression
    metrics:
    - type: r2
      value: 0.87
      name: R² Score
    - type: rmse
      value: 4.0
      name: RMSE (years)
---

# Life Expectancy Predictor

A **Gradient Boosting Regressor** trained on 14 health and lifestyle features to predict a person's life expectancy in years. Built with scikit-learn and wrapped in a production-ready FastAPI service.

## Model Description

This model takes a snapshot of an individual's health profile — including physical attributes, lifestyle habits, and medical history — and returns a predicted life expectancy in years. The primary model is a `GradientBoostingRegressor` achieving an R² of ~0.87 on the held-out test set. A baseline `LinearRegression` model is also included for comparison.

| Artifact | File | Description |
|---|---|---|
| Primary model | `gradient_boosting_model.pkl` | GradientBoostingRegressor (472 KB) |
| Baseline model | `linear_model.pkl` | LinearRegression (4 KB) |
| Feature scaler | `scaler.pkl` | StandardScaler for all features |
| Categorical encoder | `preprocessor.pkl` | LabelEncoder mapping for categorical inputs |

## Intended Use

- **Research & education:** understanding which health factors most affect life expectancy.
- **Health-tech prototypes:** powering wellness apps or patient-facing dashboards.
- **Academic exploration:** studying gradient boosting on tabular health data.

**Not intended for:** clinical diagnosis, medical decision-making, or any high-stakes healthcare decisions. Predictions are statistical estimates, not medical advice.

## How to Use

### Install dependencies

```bash
pip install scikit-learn>=1.5.0 joblib numpy
```

### Load and run inference

```python
import joblib
import numpy as np

# Load artifacts
model = joblib.load("gradient_boosting_model.pkl")
scaler = joblib.load("scaler.pkl")
preprocessor = joblib.load("preprocessor.pkl")  # dict of LabelEncoders

# --- Prepare a sample input ---
# Categorical columns and their LabelEncoders are stored in preprocessor.pkl
# Categorical features: Gender, Physical_Activity, Smoking_Status,
#                       Alcohol_Consumption, Diet, Blood_Pressure

def encode_and_predict(sample: dict) -> float:
    """
    sample keys (all required):
        Gender, Height, Weight, BMI, Physical_Activity, Smoking_Status,
        Alcohol_Consumption, Diet, Blood_Pressure, Cholesterol,
        Diabetes, Hypertension, Heart_Disease, Asthma
    """
    categorical_cols = [
        "Gender", "Physical_Activity", "Smoking_Status",
        "Alcohol_Consumption", "Diet", "Blood_Pressure",
    ]
    for col in categorical_cols:
        le = preprocessor[col]          # LabelEncoder for this column
        sample[col] = le.transform([sample[col]])[0]

    feature_order = [
        "Gender", "Height", "Weight", "BMI",
        "Physical_Activity", "Smoking_Status", "Alcohol_Consumption",
        "Diet", "Blood_Pressure", "Cholesterol",
        "Diabetes", "Hypertension", "Heart_Disease", "Asthma",
    ]
    X = np.array([[sample[f] for f in feature_order]])
    X_scaled = scaler.transform(X)
    return float(model.predict(X_scaled)[0])


sample = {
    "Gender": "Male",
    "Height": 175,
    "Weight": 75,
    "BMI": 24.5,
    "Physical_Activity": "Medium",
    "Smoking_Status": "Never",
    "Alcohol_Consumption": "Moderate",
    "Diet": "Good",
    "Blood_Pressure": "Normal",
    "Cholesterol": 190,
    "Diabetes": 0,
    "Hypertension": 0,
    "Heart_Disease": 0,
    "Asthma": 0,
}

prediction = encode_and_predict(sample)
print(f"Predicted life expectancy: {prediction:.1f} years")
```

### Download from the Hub

```python
from huggingface_hub import hf_hub_download
import joblib

model = joblib.load(
    hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
                    filename="gradient_boosting_model.pkl")
)
scaler = joblib.load(
    hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
                    filename="scaler.pkl")
)
preprocessor = joblib.load(
    hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
                    filename="preprocessor.pkl")
)
```

## Input Features

| Feature | Type | Values / Range | Description |
|---|---|---|---|
| `Gender` | categorical | Male / Female | Biological sex |
| `Height` | numerical | cm | Body height |
| `Weight` | numerical | kg | Body weight |
| `BMI` | numerical | continuous | Body Mass Index |
| `Physical_Activity` | categorical | Low / Medium / High | Exercise level |
| `Smoking_Status` | categorical | Never / Former / Current | Smoking history |
| `Alcohol_Consumption` | categorical | None / Moderate / Heavy | Alcohol intake |
| `Diet` | categorical | Poor / Average / Good | Overall diet quality |
| `Blood_Pressure` | categorical | Low / Normal / High | Blood pressure category |
| `Cholesterol` | numerical | mg/dL | Total cholesterol level |
| `Diabetes` | binary | 0 / 1 | Diabetes diagnosis flag |
| `Hypertension` | binary | 0 / 1 | Hypertension diagnosis flag |
| `Heart_Disease` | binary | 0 / 1 | Heart disease diagnosis flag |
| `Asthma` | binary | 0 / 1 | Asthma diagnosis flag |

## Output

A single continuous float representing **predicted life expectancy in years**.

## Training Details

### Dataset
- **Size:** ~10,002 records
- **Split:** 68 % train / 10 % validation / 22 % test
- **Target variable:** `Age` (life expectancy in years)

### Preprocessing
1. Fill missing categorical values with `"None"`.
2. `LabelEncoder` applied per categorical column (encoders saved in `preprocessor.pkl`).
3. `StandardScaler` applied to all 14 features after encoding (saved in `scaler.pkl`).

### Primary Model — GradientBoostingRegressor

```python
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
)
```

### Baseline Model — LinearRegression

A standard `LinearRegression` is also provided (`linear_model.pkl`) for interpretability and benchmarking.

## Performance

| Metric | Value |
|---|---|
| R² (test set) | 0.85 – 0.92 |
| RMSE | 3 – 5 years |
| Confidence score | 0.87 |

*Metrics are on the held-out test split (~22 % of 10 k records).*

## Limitations

- The model is trained on a **synthetic / illustrative dataset**; real-world generalization is not guaranteed.
- It does not account for socioeconomic factors, genetics, geography, or environmental exposures.
- Categorical label encodings are order-sensitive — always use the supplied `preprocessor.pkl` rather than re-encoding independently.
- Predictions for feature combinations far outside the training distribution may be unreliable.

## Ethical Considerations

- **Not a medical device.** Do not use predictions to make clinical, insurance, or policy decisions.
- **Fairness:** The model may reflect biases present in the training data. Subgroup performance (by gender, age bracket, etc.) has not been audited.
- **Privacy:** No personal data is stored or transmitted by this model artifact; ensure your application handles user health data in compliance with applicable regulations (HIPAA, GDPR, etc.).

## Citation

If you use this model in your research or application, please cite:

```bibtex
@misc{lebiraja2024lifeexpectancy,
  author       = {lebiraja},
  title        = {Life Expectancy Predictor},
  year         = {2024},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/lebiraja/life-expectancy-predictor}},
}
```

## License

MIT