Tabular Regression
Scikit-learn
English
health
life-expectancy
gradient-boosting
scikit-learn
Eval Results (legacy)
Instructions to use lebiraja/life-expectancy-predictor with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use lebiraja/life-expectancy-predictor with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("lebiraja/life-expectancy-predictor", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| license: mit | |
| tags: | |
| - sklearn | |
| - tabular-regression | |
| - health | |
| - life-expectancy | |
| - gradient-boosting | |
| - scikit-learn | |
| pipeline_tag: tabular-regression | |
| library_name: sklearn | |
| metrics: | |
| - r2 | |
| - rmse | |
| model-index: | |
| - name: Life Expectancy Predictor | |
| results: | |
| - task: | |
| type: tabular-regression | |
| name: Tabular Regression | |
| metrics: | |
| - type: r2 | |
| value: 0.87 | |
| name: R² Score | |
| - type: rmse | |
| value: 4.0 | |
| name: RMSE (years) | |
| # Life Expectancy Predictor | |
| A **Gradient Boosting Regressor** trained on 14 health and lifestyle features to predict a person's life expectancy in years. Built with scikit-learn and wrapped in a production-ready FastAPI service. | |
| ## Model Description | |
| This model takes a snapshot of an individual's health profile — including physical attributes, lifestyle habits, and medical history — and returns a predicted life expectancy in years. The primary model is a `GradientBoostingRegressor` achieving an R² of ~0.87 on the held-out test set. A baseline `LinearRegression` model is also included for comparison. | |
| | Artifact | File | Description | | |
| |---|---|---| | |
| | Primary model | `gradient_boosting_model.pkl` | GradientBoostingRegressor (472 KB) | | |
| | Baseline model | `linear_model.pkl` | LinearRegression (4 KB) | | |
| | Feature scaler | `scaler.pkl` | StandardScaler for all features | | |
| | Categorical encoder | `preprocessor.pkl` | LabelEncoder mapping for categorical inputs | | |
| ## Intended Use | |
| - **Research & education:** understanding which health factors most affect life expectancy. | |
| - **Health-tech prototypes:** powering wellness apps or patient-facing dashboards. | |
| - **Academic exploration:** studying gradient boosting on tabular health data. | |
| **Not intended for:** clinical diagnosis, medical decision-making, or any high-stakes healthcare decisions. Predictions are statistical estimates, not medical advice. | |
| ## How to Use | |
| ### Install dependencies | |
| ```bash | |
| pip install scikit-learn>=1.5.0 joblib numpy | |
| ``` | |
| ### Load and run inference | |
| ```python | |
| import joblib | |
| import numpy as np | |
| # Load artifacts | |
| model = joblib.load("gradient_boosting_model.pkl") | |
| scaler = joblib.load("scaler.pkl") | |
| preprocessor = joblib.load("preprocessor.pkl") # dict of LabelEncoders | |
| # --- Prepare a sample input --- | |
| # Categorical columns and their LabelEncoders are stored in preprocessor.pkl | |
| # Categorical features: Gender, Physical_Activity, Smoking_Status, | |
| # Alcohol_Consumption, Diet, Blood_Pressure | |
| def encode_and_predict(sample: dict) -> float: | |
| """ | |
| sample keys (all required): | |
| Gender, Height, Weight, BMI, Physical_Activity, Smoking_Status, | |
| Alcohol_Consumption, Diet, Blood_Pressure, Cholesterol, | |
| Diabetes, Hypertension, Heart_Disease, Asthma | |
| """ | |
| categorical_cols = [ | |
| "Gender", "Physical_Activity", "Smoking_Status", | |
| "Alcohol_Consumption", "Diet", "Blood_Pressure", | |
| ] | |
| for col in categorical_cols: | |
| le = preprocessor[col] # LabelEncoder for this column | |
| sample[col] = le.transform([sample[col]])[0] | |
| feature_order = [ | |
| "Gender", "Height", "Weight", "BMI", | |
| "Physical_Activity", "Smoking_Status", "Alcohol_Consumption", | |
| "Diet", "Blood_Pressure", "Cholesterol", | |
| "Diabetes", "Hypertension", "Heart_Disease", "Asthma", | |
| ] | |
| X = np.array([[sample[f] for f in feature_order]]) | |
| X_scaled = scaler.transform(X) | |
| return float(model.predict(X_scaled)[0]) | |
| sample = { | |
| "Gender": "Male", | |
| "Height": 175, | |
| "Weight": 75, | |
| "BMI": 24.5, | |
| "Physical_Activity": "Medium", | |
| "Smoking_Status": "Never", | |
| "Alcohol_Consumption": "Moderate", | |
| "Diet": "Good", | |
| "Blood_Pressure": "Normal", | |
| "Cholesterol": 190, | |
| "Diabetes": 0, | |
| "Hypertension": 0, | |
| "Heart_Disease": 0, | |
| "Asthma": 0, | |
| } | |
| prediction = encode_and_predict(sample) | |
| print(f"Predicted life expectancy: {prediction:.1f} years") | |
| ``` | |
| ### Download from the Hub | |
| ```python | |
| from huggingface_hub import hf_hub_download | |
| import joblib | |
| model = joblib.load( | |
| hf_hub_download(repo_id="lebiraja/life-expectancy-predictor", | |
| filename="gradient_boosting_model.pkl") | |
| ) | |
| scaler = joblib.load( | |
| hf_hub_download(repo_id="lebiraja/life-expectancy-predictor", | |
| filename="scaler.pkl") | |
| ) | |
| preprocessor = joblib.load( | |
| hf_hub_download(repo_id="lebiraja/life-expectancy-predictor", | |
| filename="preprocessor.pkl") | |
| ) | |
| ``` | |
| ## Input Features | |
| | Feature | Type | Values / Range | Description | | |
| |---|---|---|---| | |
| | `Gender` | categorical | Male / Female | Biological sex | | |
| | `Height` | numerical | cm | Body height | | |
| | `Weight` | numerical | kg | Body weight | | |
| | `BMI` | numerical | continuous | Body Mass Index | | |
| | `Physical_Activity` | categorical | Low / Medium / High | Exercise level | | |
| | `Smoking_Status` | categorical | Never / Former / Current | Smoking history | | |
| | `Alcohol_Consumption` | categorical | None / Moderate / Heavy | Alcohol intake | | |
| | `Diet` | categorical | Poor / Average / Good | Overall diet quality | | |
| | `Blood_Pressure` | categorical | Low / Normal / High | Blood pressure category | | |
| | `Cholesterol` | numerical | mg/dL | Total cholesterol level | | |
| | `Diabetes` | binary | 0 / 1 | Diabetes diagnosis flag | | |
| | `Hypertension` | binary | 0 / 1 | Hypertension diagnosis flag | | |
| | `Heart_Disease` | binary | 0 / 1 | Heart disease diagnosis flag | | |
| | `Asthma` | binary | 0 / 1 | Asthma diagnosis flag | | |
| ## Output | |
| A single continuous float representing **predicted life expectancy in years**. | |
| ## Training Details | |
| ### Dataset | |
| - **Size:** ~10,002 records | |
| - **Split:** 68 % train / 10 % validation / 22 % test | |
| - **Target variable:** `Age` (life expectancy in years) | |
| ### Preprocessing | |
| 1. Fill missing categorical values with `"None"`. | |
| 2. `LabelEncoder` applied per categorical column (encoders saved in `preprocessor.pkl`). | |
| 3. `StandardScaler` applied to all 14 features after encoding (saved in `scaler.pkl`). | |
| ### Primary Model — GradientBoostingRegressor | |
| ```python | |
| from sklearn.ensemble import GradientBoostingRegressor | |
| model = GradientBoostingRegressor( | |
| n_estimators=100, | |
| learning_rate=0.1, | |
| max_depth=5, | |
| min_samples_split=5, | |
| min_samples_leaf=2, | |
| random_state=42, | |
| ) | |
| ``` | |
| ### Baseline Model — LinearRegression | |
| A standard `LinearRegression` is also provided (`linear_model.pkl`) for interpretability and benchmarking. | |
| ## Performance | |
| | Metric | Value | | |
| |---|---| | |
| | R² (test set) | 0.85 – 0.92 | | |
| | RMSE | 3 – 5 years | | |
| | Confidence score | 0.87 | | |
| *Metrics are on the held-out test split (~22 % of 10 k records).* | |
| ## Limitations | |
| - The model is trained on a **synthetic / illustrative dataset**; real-world generalization is not guaranteed. | |
| - It does not account for socioeconomic factors, genetics, geography, or environmental exposures. | |
| - Categorical label encodings are order-sensitive — always use the supplied `preprocessor.pkl` rather than re-encoding independently. | |
| - Predictions for feature combinations far outside the training distribution may be unreliable. | |
| ## Ethical Considerations | |
| - **Not a medical device.** Do not use predictions to make clinical, insurance, or policy decisions. | |
| - **Fairness:** The model may reflect biases present in the training data. Subgroup performance (by gender, age bracket, etc.) has not been audited. | |
| - **Privacy:** No personal data is stored or transmitted by this model artifact; ensure your application handles user health data in compliance with applicable regulations (HIPAA, GDPR, etc.). | |
| ## Citation | |
| If you use this model in your research or application, please cite: | |
| ```bibtex | |
| @misc{lebiraja2024lifeexpectancy, | |
| author = {lebiraja}, | |
| title = {Life Expectancy Predictor}, | |
| year = {2024}, | |
| publisher = {Hugging Face}, | |
| howpublished = {\url{https://huggingface.co/lebiraja/life-expectancy-predictor}}, | |
| } | |
| ``` | |
| ## License | |
| MIT | |