lebiraja commited on
Commit
6b8b156
·
verified ·
1 Parent(s): 18146cc

Add gradient boosting life expectancy model with preprocessing artifacts and model card

Browse files
Files changed (5) hide show
  1. README.md +239 -0
  2. gradient_boosting_model.pkl +3 -0
  3. linear_model.pkl +3 -0
  4. preprocessor.pkl +3 -0
  5. scaler.pkl +3 -0
README.md ADDED
@@ -0,0 +1,239 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - sklearn
7
+ - tabular-regression
8
+ - health
9
+ - life-expectancy
10
+ - gradient-boosting
11
+ - scikit-learn
12
+ pipeline_tag: tabular-regression
13
+ library_name: sklearn
14
+ metrics:
15
+ - r2
16
+ - rmse
17
+ model-index:
18
+ - name: Life Expectancy Predictor
19
+ results:
20
+ - task:
21
+ type: tabular-regression
22
+ name: Tabular Regression
23
+ metrics:
24
+ - type: r2
25
+ value: 0.87
26
+ name: R² Score
27
+ - type: rmse
28
+ value: 4.0
29
+ name: RMSE (years)
30
+ ---
31
+
32
+ # Life Expectancy Predictor
33
+
34
+ A **Gradient Boosting Regressor** trained on 14 health and lifestyle features to predict a person's life expectancy in years. Built with scikit-learn and wrapped in a production-ready FastAPI service.
35
+
36
+ ## Model Description
37
+
38
+ This model takes a snapshot of an individual's health profile — including physical attributes, lifestyle habits, and medical history — and returns a predicted life expectancy in years. The primary model is a `GradientBoostingRegressor` achieving an R² of ~0.87 on the held-out test set. A baseline `LinearRegression` model is also included for comparison.
39
+
40
+ | Artifact | File | Description |
41
+ |---|---|---|
42
+ | Primary model | `gradient_boosting_model.pkl` | GradientBoostingRegressor (472 KB) |
43
+ | Baseline model | `linear_model.pkl` | LinearRegression (4 KB) |
44
+ | Feature scaler | `scaler.pkl` | StandardScaler for all features |
45
+ | Categorical encoder | `preprocessor.pkl` | LabelEncoder mapping for categorical inputs |
46
+
47
+ ## Intended Use
48
+
49
+ - **Research & education:** understanding which health factors most affect life expectancy.
50
+ - **Health-tech prototypes:** powering wellness apps or patient-facing dashboards.
51
+ - **Academic exploration:** studying gradient boosting on tabular health data.
52
+
53
+ **Not intended for:** clinical diagnosis, medical decision-making, or any high-stakes healthcare decisions. Predictions are statistical estimates, not medical advice.
54
+
55
+ ## How to Use
56
+
57
+ ### Install dependencies
58
+
59
+ ```bash
60
+ pip install scikit-learn>=1.5.0 joblib numpy
61
+ ```
62
+
63
+ ### Load and run inference
64
+
65
+ ```python
66
+ import joblib
67
+ import numpy as np
68
+
69
+ # Load artifacts
70
+ model = joblib.load("gradient_boosting_model.pkl")
71
+ scaler = joblib.load("scaler.pkl")
72
+ preprocessor = joblib.load("preprocessor.pkl") # dict of LabelEncoders
73
+
74
+ # --- Prepare a sample input ---
75
+ # Categorical columns and their LabelEncoders are stored in preprocessor.pkl
76
+ # Categorical features: Gender, Physical_Activity, Smoking_Status,
77
+ # Alcohol_Consumption, Diet, Blood_Pressure
78
+
79
+ def encode_and_predict(sample: dict) -> float:
80
+ """
81
+ sample keys (all required):
82
+ Gender, Height, Weight, BMI, Physical_Activity, Smoking_Status,
83
+ Alcohol_Consumption, Diet, Blood_Pressure, Cholesterol,
84
+ Diabetes, Hypertension, Heart_Disease, Asthma
85
+ """
86
+ categorical_cols = [
87
+ "Gender", "Physical_Activity", "Smoking_Status",
88
+ "Alcohol_Consumption", "Diet", "Blood_Pressure",
89
+ ]
90
+ for col in categorical_cols:
91
+ le = preprocessor[col] # LabelEncoder for this column
92
+ sample[col] = le.transform([sample[col]])[0]
93
+
94
+ feature_order = [
95
+ "Gender", "Height", "Weight", "BMI",
96
+ "Physical_Activity", "Smoking_Status", "Alcohol_Consumption",
97
+ "Diet", "Blood_Pressure", "Cholesterol",
98
+ "Diabetes", "Hypertension", "Heart_Disease", "Asthma",
99
+ ]
100
+ X = np.array([[sample[f] for f in feature_order]])
101
+ X_scaled = scaler.transform(X)
102
+ return float(model.predict(X_scaled)[0])
103
+
104
+
105
+ sample = {
106
+ "Gender": "Male",
107
+ "Height": 175,
108
+ "Weight": 75,
109
+ "BMI": 24.5,
110
+ "Physical_Activity": "Medium",
111
+ "Smoking_Status": "Never",
112
+ "Alcohol_Consumption": "Moderate",
113
+ "Diet": "Good",
114
+ "Blood_Pressure": "Normal",
115
+ "Cholesterol": 190,
116
+ "Diabetes": 0,
117
+ "Hypertension": 0,
118
+ "Heart_Disease": 0,
119
+ "Asthma": 0,
120
+ }
121
+
122
+ prediction = encode_and_predict(sample)
123
+ print(f"Predicted life expectancy: {prediction:.1f} years")
124
+ ```
125
+
126
+ ### Download from the Hub
127
+
128
+ ```python
129
+ from huggingface_hub import hf_hub_download
130
+ import joblib
131
+
132
+ model = joblib.load(
133
+ hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
134
+ filename="gradient_boosting_model.pkl")
135
+ )
136
+ scaler = joblib.load(
137
+ hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
138
+ filename="scaler.pkl")
139
+ )
140
+ preprocessor = joblib.load(
141
+ hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
142
+ filename="preprocessor.pkl")
143
+ )
144
+ ```
145
+
146
+ ## Input Features
147
+
148
+ | Feature | Type | Values / Range | Description |
149
+ |---|---|---|---|
150
+ | `Gender` | categorical | Male / Female | Biological sex |
151
+ | `Height` | numerical | cm | Body height |
152
+ | `Weight` | numerical | kg | Body weight |
153
+ | `BMI` | numerical | continuous | Body Mass Index |
154
+ | `Physical_Activity` | categorical | Low / Medium / High | Exercise level |
155
+ | `Smoking_Status` | categorical | Never / Former / Current | Smoking history |
156
+ | `Alcohol_Consumption` | categorical | None / Moderate / Heavy | Alcohol intake |
157
+ | `Diet` | categorical | Poor / Average / Good | Overall diet quality |
158
+ | `Blood_Pressure` | categorical | Low / Normal / High | Blood pressure category |
159
+ | `Cholesterol` | numerical | mg/dL | Total cholesterol level |
160
+ | `Diabetes` | binary | 0 / 1 | Diabetes diagnosis flag |
161
+ | `Hypertension` | binary | 0 / 1 | Hypertension diagnosis flag |
162
+ | `Heart_Disease` | binary | 0 / 1 | Heart disease diagnosis flag |
163
+ | `Asthma` | binary | 0 / 1 | Asthma diagnosis flag |
164
+
165
+ ## Output
166
+
167
+ A single continuous float representing **predicted life expectancy in years**.
168
+
169
+ ## Training Details
170
+
171
+ ### Dataset
172
+ - **Size:** ~10,002 records
173
+ - **Split:** 68 % train / 10 % validation / 22 % test
174
+ - **Target variable:** `Age` (life expectancy in years)
175
+
176
+ ### Preprocessing
177
+ 1. Fill missing categorical values with `"None"`.
178
+ 2. `LabelEncoder` applied per categorical column (encoders saved in `preprocessor.pkl`).
179
+ 3. `StandardScaler` applied to all 14 features after encoding (saved in `scaler.pkl`).
180
+
181
+ ### Primary Model — GradientBoostingRegressor
182
+
183
+ ```python
184
+ from sklearn.ensemble import GradientBoostingRegressor
185
+
186
+ model = GradientBoostingRegressor(
187
+ n_estimators=100,
188
+ learning_rate=0.1,
189
+ max_depth=5,
190
+ min_samples_split=5,
191
+ min_samples_leaf=2,
192
+ random_state=42,
193
+ )
194
+ ```
195
+
196
+ ### Baseline Model — LinearRegression
197
+
198
+ A standard `LinearRegression` is also provided (`linear_model.pkl`) for interpretability and benchmarking.
199
+
200
+ ## Performance
201
+
202
+ | Metric | Value |
203
+ |---|---|
204
+ | R² (test set) | 0.85 – 0.92 |
205
+ | RMSE | 3 – 5 years |
206
+ | Confidence score | 0.87 |
207
+
208
+ *Metrics are on the held-out test split (~22 % of 10 k records).*
209
+
210
+ ## Limitations
211
+
212
+ - The model is trained on a **synthetic / illustrative dataset**; real-world generalization is not guaranteed.
213
+ - It does not account for socioeconomic factors, genetics, geography, or environmental exposures.
214
+ - Categorical label encodings are order-sensitive — always use the supplied `preprocessor.pkl` rather than re-encoding independently.
215
+ - Predictions for feature combinations far outside the training distribution may be unreliable.
216
+
217
+ ## Ethical Considerations
218
+
219
+ - **Not a medical device.** Do not use predictions to make clinical, insurance, or policy decisions.
220
+ - **Fairness:** The model may reflect biases present in the training data. Subgroup performance (by gender, age bracket, etc.) has not been audited.
221
+ - **Privacy:** No personal data is stored or transmitted by this model artifact; ensure your application handles user health data in compliance with applicable regulations (HIPAA, GDPR, etc.).
222
+
223
+ ## Citation
224
+
225
+ If you use this model in your research or application, please cite:
226
+
227
+ ```bibtex
228
+ @misc{lebiraja2024lifeexpectancy,
229
+ author = {lebiraja},
230
+ title = {Life Expectancy Predictor},
231
+ year = {2024},
232
+ publisher = {Hugging Face},
233
+ howpublished = {\url{https://huggingface.co/lebiraja/life-expectancy-predictor}},
234
+ }
235
+ ```
236
+
237
+ ## License
238
+
239
+ MIT
gradient_boosting_model.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:08121de88eb105e28708379fd78d2d22b2e1053b10147f439261c2ccd9d2304a
3
+ size 480904
linear_model.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fcbe475d91b16513ee1362ed17f85ca4b407dee32dd4f7806133756c0107e610
3
+ size 793
preprocessor.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8390e184303c5f9fbdb2b7dbf461b5f8a971bd4825a53420b7972f5a6512617f
3
+ size 1881
scaler.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1c3451a1910bcac7f9da1ef0fc91408af99131eb489951632aaff22f936a2b8
3
+ size 1351