Add gradient boosting life expectancy model with preprocessing artifacts and model card

Browse files

Files changed (5) hide show

README.md +239 -0
gradient_boosting_model.pkl +3 -0
linear_model.pkl +3 -0
preprocessor.pkl +3 -0
scaler.pkl +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,239 @@

+---
+language:
+- en
+license: mit
+tags:
+- sklearn
+- tabular-regression
+- health
+- life-expectancy
+- gradient-boosting
+- scikit-learn
+pipeline_tag: tabular-regression
+library_name: sklearn
+metrics:
+- r2
+- rmse
+model-index:
+- name: Life Expectancy Predictor
+  results:
+  - task:
+      type: tabular-regression
+      name: Tabular Regression
+    metrics:
+    - type: r2
+      value: 0.87
+      name: R² Score
+    - type: rmse
+      value: 4.0
+      name: RMSE (years)
+---
+# Life Expectancy Predictor
+A **Gradient Boosting Regressor** trained on 14 health and lifestyle features to predict a person's life expectancy in years. Built with scikit-learn and wrapped in a production-ready FastAPI service.
+## Model Description
+This model takes a snapshot of an individual's health profile — including physical attributes, lifestyle habits, and medical history — and returns a predicted life expectancy in years. The primary model is a `GradientBoostingRegressor` achieving an R² of ~0.87 on the held-out test set. A baseline `LinearRegression` model is also included for comparison.
+| Artifact | File | Description |
+|---|---|---|
+| Primary model | `gradient_boosting_model.pkl` | GradientBoostingRegressor (472 KB) |
+| Baseline model | `linear_model.pkl` | LinearRegression (4 KB) |
+| Feature scaler | `scaler.pkl` | StandardScaler for all features |
+| Categorical encoder | `preprocessor.pkl` | LabelEncoder mapping for categorical inputs |
+## Intended Use
+- **Research & education:** understanding which health factors most affect life expectancy.
+- **Health-tech prototypes:** powering wellness apps or patient-facing dashboards.
+- **Academic exploration:** studying gradient boosting on tabular health data.
+**Not intended for:** clinical diagnosis, medical decision-making, or any high-stakes healthcare decisions. Predictions are statistical estimates, not medical advice.
+## How to Use
+### Install dependencies
+```bash
+pip install scikit-learn>=1.5.0 joblib numpy
+```
+### Load and run inference
+```python
+import joblib
+import numpy as np
+# Load artifacts
+model = joblib.load("gradient_boosting_model.pkl")
+scaler = joblib.load("scaler.pkl")
+preprocessor = joblib.load("preprocessor.pkl")  # dict of LabelEncoders
+# --- Prepare a sample input ---
+# Categorical columns and their LabelEncoders are stored in preprocessor.pkl
+# Categorical features: Gender, Physical_Activity, Smoking_Status,
+#                       Alcohol_Consumption, Diet, Blood_Pressure
+def encode_and_predict(sample: dict) -> float:
+    """
+    sample keys (all required):
+        Gender, Height, Weight, BMI, Physical_Activity, Smoking_Status,
+        Alcohol_Consumption, Diet, Blood_Pressure, Cholesterol,
+        Diabetes, Hypertension, Heart_Disease, Asthma
+    """
+    categorical_cols = [
+        "Gender", "Physical_Activity", "Smoking_Status",
+        "Alcohol_Consumption", "Diet", "Blood_Pressure",
+    ]
+    for col in categorical_cols:
+        le = preprocessor[col]          # LabelEncoder for this column
+        sample[col] = le.transform([sample[col]])[0]
+    feature_order = [
+        "Gender", "Height", "Weight", "BMI",
+        "Physical_Activity", "Smoking_Status", "Alcohol_Consumption",
+        "Diet", "Blood_Pressure", "Cholesterol",
+        "Diabetes", "Hypertension", "Heart_Disease", "Asthma",
+    ]
+    X = np.array([[sample[f] for f in feature_order]])
+    X_scaled = scaler.transform(X)
+    return float(model.predict(X_scaled)[0])
+sample = {
+    "Gender": "Male",
+    "Height": 175,
+    "Weight": 75,
+    "BMI": 24.5,
+    "Physical_Activity": "Medium",
+    "Smoking_Status": "Never",
+    "Alcohol_Consumption": "Moderate",
+    "Diet": "Good",
+    "Blood_Pressure": "Normal",
+    "Cholesterol": 190,
+    "Diabetes": 0,
+    "Hypertension": 0,
+    "Heart_Disease": 0,
+    "Asthma": 0,
+}
+prediction = encode_and_predict(sample)
+print(f"Predicted life expectancy: {prediction:.1f} years")
+```
+### Download from the Hub
+```python
+from huggingface_hub import hf_hub_download
+import joblib
+model = joblib.load(
+    hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
+                    filename="gradient_boosting_model.pkl")
+)
+scaler = joblib.load(
+    hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
+                    filename="scaler.pkl")
+)
+preprocessor = joblib.load(
+    hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
+                    filename="preprocessor.pkl")
+)
+```
+## Input Features
+| Feature | Type | Values / Range | Description |
+|---|---|---|---|
+| `Gender` | categorical | Male / Female | Biological sex |
+| `Height` | numerical | cm | Body height |
+| `Weight` | numerical | kg | Body weight |
+| `BMI` | numerical | continuous | Body Mass Index |
+| `Physical_Activity` | categorical | Low / Medium / High | Exercise level |
+| `Smoking_Status` | categorical | Never / Former / Current | Smoking history |
+| `Alcohol_Consumption` | categorical | None / Moderate / Heavy | Alcohol intake |
+| `Diet` | categorical | Poor / Average / Good | Overall diet quality |
+| `Blood_Pressure` | categorical | Low / Normal / High | Blood pressure category |
+| `Cholesterol` | numerical | mg/dL | Total cholesterol level |
+| `Diabetes` | binary | 0 / 1 | Diabetes diagnosis flag |
+| `Hypertension` | binary | 0 / 1 | Hypertension diagnosis flag |
+| `Heart_Disease` | binary | 0 / 1 | Heart disease diagnosis flag |
+| `Asthma` | binary | 0 / 1 | Asthma diagnosis flag |
+## Output
+A single continuous float representing **predicted life expectancy in years**.
+## Training Details
+### Dataset
+- **Size:** ~10,002 records
+- **Split:** 68 % train / 10 % validation / 22 % test
+- **Target variable:** `Age` (life expectancy in years)
+### Preprocessing
+1. Fill missing categorical values with `"None"`.
+2. `LabelEncoder` applied per categorical column (encoders saved in `preprocessor.pkl`).
+3. `StandardScaler` applied to all 14 features after encoding (saved in `scaler.pkl`).
+### Primary Model — GradientBoostingRegressor
+```python
+from sklearn.ensemble import GradientBoostingRegressor
+model = GradientBoostingRegressor(
+    n_estimators=100,
+    learning_rate=0.1,
+    max_depth=5,
+    min_samples_split=5,
+    min_samples_leaf=2,
+    random_state=42,
+)
+```
+### Baseline Model — LinearRegression
+A standard `LinearRegression` is also provided (`linear_model.pkl`) for interpretability and benchmarking.
+## Performance
+| Metric | Value |
+|---|---|
+| R² (test set) | 0.85 – 0.92 |
+| RMSE | 3 – 5 years |
+| Confidence score | 0.87 |
+*Metrics are on the held-out test split (~22 % of 10 k records).*
+## Limitations
+- The model is trained on a **synthetic / illustrative dataset**; real-world generalization is not guaranteed.
+- It does not account for socioeconomic factors, genetics, geography, or environmental exposures.
+- Categorical label encodings are order-sensitive — always use the supplied `preprocessor.pkl` rather than re-encoding independently.
+- Predictions for feature combinations far outside the training distribution may be unreliable.
+## Ethical Considerations
+- **Not a medical device.** Do not use predictions to make clinical, insurance, or policy decisions.
+- **Fairness:** The model may reflect biases present in the training data. Subgroup performance (by gender, age bracket, etc.) has not been audited.
+- **Privacy:** No personal data is stored or transmitted by this model artifact; ensure your application handles user health data in compliance with applicable regulations (HIPAA, GDPR, etc.).
+## Citation
+If you use this model in your research or application, please cite:
+```bibtex
+@misc{lebiraja2024lifeexpectancy,
+  author       = {lebiraja},
+  title        = {Life Expectancy Predictor},
+  year         = {2024},
+  publisher    = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/lebiraja/life-expectancy-predictor}},
+}
+```
+## License
+MIT

gradient_boosting_model.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:08121de88eb105e28708379fd78d2d22b2e1053b10147f439261c2ccd9d2304a
+size 480904

linear_model.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fcbe475d91b16513ee1362ed17f85ca4b407dee32dd4f7806133756c0107e610
+size 793

preprocessor.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8390e184303c5f9fbdb2b7dbf461b5f8a971bd4825a53420b7972f5a6512617f
+size 1881

scaler.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a1c3451a1910bcac7f9da1ef0fc91408af99131eb489951632aaff22f936a2b8
+size 1351