File size: 7,879 Bytes
6b8b156
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
language:
- en
license: mit
tags:
- sklearn
- tabular-regression
- health
- life-expectancy
- gradient-boosting
- scikit-learn
pipeline_tag: tabular-regression
library_name: sklearn
metrics:
- r2
- rmse
model-index:
- name: Life Expectancy Predictor
  results:
  - task:
      type: tabular-regression
      name: Tabular Regression
    metrics:
    - type: r2
      value: 0.87
      name:  Score
    - type: rmse
      value: 4.0
      name: RMSE (years)
---

# Life Expectancy Predictor

A **Gradient Boosting Regressor** trained on 14 health and lifestyle features to predict a person's life expectancy in years. Built with scikit-learn and wrapped in a production-ready FastAPI service.

## Model Description

This model takes a snapshot of an individual's health profile — including physical attributes, lifestyle habits, and medical history — and returns a predicted life expectancy in years. The primary model is a `GradientBoostingRegressor` achieving an R² of ~0.87 on the held-out test set. A baseline `LinearRegression` model is also included for comparison.

| Artifact | File | Description |
|---|---|---|
| Primary model | `gradient_boosting_model.pkl` | GradientBoostingRegressor (472 KB) |
| Baseline model | `linear_model.pkl` | LinearRegression (4 KB) |
| Feature scaler | `scaler.pkl` | StandardScaler for all features |
| Categorical encoder | `preprocessor.pkl` | LabelEncoder mapping for categorical inputs |

## Intended Use

- **Research & education:** understanding which health factors most affect life expectancy.
- **Health-tech prototypes:** powering wellness apps or patient-facing dashboards.
- **Academic exploration:** studying gradient boosting on tabular health data.

**Not intended for:** clinical diagnosis, medical decision-making, or any high-stakes healthcare decisions. Predictions are statistical estimates, not medical advice.

## How to Use

### Install dependencies

```bash
pip install scikit-learn>=1.5.0 joblib numpy
```

### Load and run inference

```python
import joblib
import numpy as np

# Load artifacts
model = joblib.load("gradient_boosting_model.pkl")
scaler = joblib.load("scaler.pkl")
preprocessor = joblib.load("preprocessor.pkl")  # dict of LabelEncoders

# --- Prepare a sample input ---
# Categorical columns and their LabelEncoders are stored in preprocessor.pkl
# Categorical features: Gender, Physical_Activity, Smoking_Status,
#                       Alcohol_Consumption, Diet, Blood_Pressure

def encode_and_predict(sample: dict) -> float:
    """
    sample keys (all required):
        Gender, Height, Weight, BMI, Physical_Activity, Smoking_Status,
        Alcohol_Consumption, Diet, Blood_Pressure, Cholesterol,
        Diabetes, Hypertension, Heart_Disease, Asthma
    """
    categorical_cols = [
        "Gender", "Physical_Activity", "Smoking_Status",
        "Alcohol_Consumption", "Diet", "Blood_Pressure",
    ]
    for col in categorical_cols:
        le = preprocessor[col]          # LabelEncoder for this column
        sample[col] = le.transform([sample[col]])[0]

    feature_order = [
        "Gender", "Height", "Weight", "BMI",
        "Physical_Activity", "Smoking_Status", "Alcohol_Consumption",
        "Diet", "Blood_Pressure", "Cholesterol",
        "Diabetes", "Hypertension", "Heart_Disease", "Asthma",
    ]
    X = np.array([[sample[f] for f in feature_order]])
    X_scaled = scaler.transform(X)
    return float(model.predict(X_scaled)[0])


sample = {
    "Gender": "Male",
    "Height": 175,
    "Weight": 75,
    "BMI": 24.5,
    "Physical_Activity": "Medium",
    "Smoking_Status": "Never",
    "Alcohol_Consumption": "Moderate",
    "Diet": "Good",
    "Blood_Pressure": "Normal",
    "Cholesterol": 190,
    "Diabetes": 0,
    "Hypertension": 0,
    "Heart_Disease": 0,
    "Asthma": 0,
}

prediction = encode_and_predict(sample)
print(f"Predicted life expectancy: {prediction:.1f} years")
```

### Download from the Hub

```python
from huggingface_hub import hf_hub_download
import joblib

model = joblib.load(
    hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
                    filename="gradient_boosting_model.pkl")
)
scaler = joblib.load(
    hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
                    filename="scaler.pkl")
)
preprocessor = joblib.load(
    hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
                    filename="preprocessor.pkl")
)
```

## Input Features

| Feature | Type | Values / Range | Description |
|---|---|---|---|
| `Gender` | categorical | Male / Female | Biological sex |
| `Height` | numerical | cm | Body height |
| `Weight` | numerical | kg | Body weight |
| `BMI` | numerical | continuous | Body Mass Index |
| `Physical_Activity` | categorical | Low / Medium / High | Exercise level |
| `Smoking_Status` | categorical | Never / Former / Current | Smoking history |
| `Alcohol_Consumption` | categorical | None / Moderate / Heavy | Alcohol intake |
| `Diet` | categorical | Poor / Average / Good | Overall diet quality |
| `Blood_Pressure` | categorical | Low / Normal / High | Blood pressure category |
| `Cholesterol` | numerical | mg/dL | Total cholesterol level |
| `Diabetes` | binary | 0 / 1 | Diabetes diagnosis flag |
| `Hypertension` | binary | 0 / 1 | Hypertension diagnosis flag |
| `Heart_Disease` | binary | 0 / 1 | Heart disease diagnosis flag |
| `Asthma` | binary | 0 / 1 | Asthma diagnosis flag |

## Output

A single continuous float representing **predicted life expectancy in years**.

## Training Details

### Dataset
- **Size:** ~10,002 records
- **Split:** 68 % train / 10 % validation / 22 % test
- **Target variable:** `Age` (life expectancy in years)

### Preprocessing
1. Fill missing categorical values with `"None"`.
2. `LabelEncoder` applied per categorical column (encoders saved in `preprocessor.pkl`).
3. `StandardScaler` applied to all 14 features after encoding (saved in `scaler.pkl`).

### Primary Model — GradientBoostingRegressor

```python
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
)
```

### Baseline Model — LinearRegression

A standard `LinearRegression` is also provided (`linear_model.pkl`) for interpretability and benchmarking.

## Performance

| Metric | Value |
|---|---|
| R² (test set) | 0.85 – 0.92 |
| RMSE | 3 – 5 years |
| Confidence score | 0.87 |

*Metrics are on the held-out test split (~22 % of 10 k records).*

## Limitations

- The model is trained on a **synthetic / illustrative dataset**; real-world generalization is not guaranteed.
- It does not account for socioeconomic factors, genetics, geography, or environmental exposures.
- Categorical label encodings are order-sensitive — always use the supplied `preprocessor.pkl` rather than re-encoding independently.
- Predictions for feature combinations far outside the training distribution may be unreliable.

## Ethical Considerations

- **Not a medical device.** Do not use predictions to make clinical, insurance, or policy decisions.
- **Fairness:** The model may reflect biases present in the training data. Subgroup performance (by gender, age bracket, etc.) has not been audited.
- **Privacy:** No personal data is stored or transmitted by this model artifact; ensure your application handles user health data in compliance with applicable regulations (HIPAA, GDPR, etc.).

## Citation

If you use this model in your research or application, please cite:

```bibtex
@misc{lebiraja2024lifeexpectancy,
  author       = {lebiraja},
  title        = {Life Expectancy Predictor},
  year         = {2024},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/lebiraja/life-expectancy-predictor}},
}
```

## License

MIT