--- language: - en license: mit tags: - sklearn - tabular-regression - health - life-expectancy - gradient-boosting - scikit-learn pipeline_tag: tabular-regression library_name: sklearn metrics: - r2 - rmse model-index: - name: Life Expectancy Predictor results: - task: type: tabular-regression name: Tabular Regression metrics: - type: r2 value: 0.87 name: R² Score - type: rmse value: 4.0 name: RMSE (years) --- # Life Expectancy Predictor A **Gradient Boosting Regressor** trained on 14 health and lifestyle features to predict a person's life expectancy in years. Built with scikit-learn and wrapped in a production-ready FastAPI service. ## Model Description This model takes a snapshot of an individual's health profile — including physical attributes, lifestyle habits, and medical history — and returns a predicted life expectancy in years. The primary model is a `GradientBoostingRegressor` achieving an R² of ~0.87 on the held-out test set. A baseline `LinearRegression` model is also included for comparison. | Artifact | File | Description | |---|---|---| | Primary model | `gradient_boosting_model.pkl` | GradientBoostingRegressor (472 KB) | | Baseline model | `linear_model.pkl` | LinearRegression (4 KB) | | Feature scaler | `scaler.pkl` | StandardScaler for all features | | Categorical encoder | `preprocessor.pkl` | LabelEncoder mapping for categorical inputs | ## Intended Use - **Research & education:** understanding which health factors most affect life expectancy. - **Health-tech prototypes:** powering wellness apps or patient-facing dashboards. - **Academic exploration:** studying gradient boosting on tabular health data. **Not intended for:** clinical diagnosis, medical decision-making, or any high-stakes healthcare decisions. Predictions are statistical estimates, not medical advice. ## How to Use ### Install dependencies ```bash pip install scikit-learn>=1.5.0 joblib numpy ``` ### Load and run inference ```python import joblib import numpy as np # Load artifacts model = joblib.load("gradient_boosting_model.pkl") scaler = joblib.load("scaler.pkl") preprocessor = joblib.load("preprocessor.pkl") # dict of LabelEncoders # --- Prepare a sample input --- # Categorical columns and their LabelEncoders are stored in preprocessor.pkl # Categorical features: Gender, Physical_Activity, Smoking_Status, # Alcohol_Consumption, Diet, Blood_Pressure def encode_and_predict(sample: dict) -> float: """ sample keys (all required): Gender, Height, Weight, BMI, Physical_Activity, Smoking_Status, Alcohol_Consumption, Diet, Blood_Pressure, Cholesterol, Diabetes, Hypertension, Heart_Disease, Asthma """ categorical_cols = [ "Gender", "Physical_Activity", "Smoking_Status", "Alcohol_Consumption", "Diet", "Blood_Pressure", ] for col in categorical_cols: le = preprocessor[col] # LabelEncoder for this column sample[col] = le.transform([sample[col]])[0] feature_order = [ "Gender", "Height", "Weight", "BMI", "Physical_Activity", "Smoking_Status", "Alcohol_Consumption", "Diet", "Blood_Pressure", "Cholesterol", "Diabetes", "Hypertension", "Heart_Disease", "Asthma", ] X = np.array([[sample[f] for f in feature_order]]) X_scaled = scaler.transform(X) return float(model.predict(X_scaled)[0]) sample = { "Gender": "Male", "Height": 175, "Weight": 75, "BMI": 24.5, "Physical_Activity": "Medium", "Smoking_Status": "Never", "Alcohol_Consumption": "Moderate", "Diet": "Good", "Blood_Pressure": "Normal", "Cholesterol": 190, "Diabetes": 0, "Hypertension": 0, "Heart_Disease": 0, "Asthma": 0, } prediction = encode_and_predict(sample) print(f"Predicted life expectancy: {prediction:.1f} years") ``` ### Download from the Hub ```python from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download(repo_id="lebiraja/life-expectancy-predictor", filename="gradient_boosting_model.pkl") ) scaler = joblib.load( hf_hub_download(repo_id="lebiraja/life-expectancy-predictor", filename="scaler.pkl") ) preprocessor = joblib.load( hf_hub_download(repo_id="lebiraja/life-expectancy-predictor", filename="preprocessor.pkl") ) ``` ## Input Features | Feature | Type | Values / Range | Description | |---|---|---|---| | `Gender` | categorical | Male / Female | Biological sex | | `Height` | numerical | cm | Body height | | `Weight` | numerical | kg | Body weight | | `BMI` | numerical | continuous | Body Mass Index | | `Physical_Activity` | categorical | Low / Medium / High | Exercise level | | `Smoking_Status` | categorical | Never / Former / Current | Smoking history | | `Alcohol_Consumption` | categorical | None / Moderate / Heavy | Alcohol intake | | `Diet` | categorical | Poor / Average / Good | Overall diet quality | | `Blood_Pressure` | categorical | Low / Normal / High | Blood pressure category | | `Cholesterol` | numerical | mg/dL | Total cholesterol level | | `Diabetes` | binary | 0 / 1 | Diabetes diagnosis flag | | `Hypertension` | binary | 0 / 1 | Hypertension diagnosis flag | | `Heart_Disease` | binary | 0 / 1 | Heart disease diagnosis flag | | `Asthma` | binary | 0 / 1 | Asthma diagnosis flag | ## Output A single continuous float representing **predicted life expectancy in years**. ## Training Details ### Dataset - **Size:** ~10,002 records - **Split:** 68 % train / 10 % validation / 22 % test - **Target variable:** `Age` (life expectancy in years) ### Preprocessing 1. Fill missing categorical values with `"None"`. 2. `LabelEncoder` applied per categorical column (encoders saved in `preprocessor.pkl`). 3. `StandardScaler` applied to all 14 features after encoding (saved in `scaler.pkl`). ### Primary Model — GradientBoostingRegressor ```python from sklearn.ensemble import GradientBoostingRegressor model = GradientBoostingRegressor( n_estimators=100, learning_rate=0.1, max_depth=5, min_samples_split=5, min_samples_leaf=2, random_state=42, ) ``` ### Baseline Model — LinearRegression A standard `LinearRegression` is also provided (`linear_model.pkl`) for interpretability and benchmarking. ## Performance | Metric | Value | |---|---| | R² (test set) | 0.85 – 0.92 | | RMSE | 3 – 5 years | | Confidence score | 0.87 | *Metrics are on the held-out test split (~22 % of 10 k records).* ## Limitations - The model is trained on a **synthetic / illustrative dataset**; real-world generalization is not guaranteed. - It does not account for socioeconomic factors, genetics, geography, or environmental exposures. - Categorical label encodings are order-sensitive — always use the supplied `preprocessor.pkl` rather than re-encoding independently. - Predictions for feature combinations far outside the training distribution may be unreliable. ## Ethical Considerations - **Not a medical device.** Do not use predictions to make clinical, insurance, or policy decisions. - **Fairness:** The model may reflect biases present in the training data. Subgroup performance (by gender, age bracket, etc.) has not been audited. - **Privacy:** No personal data is stored or transmitted by this model artifact; ensure your application handles user health data in compliance with applicable regulations (HIPAA, GDPR, etc.). ## Citation If you use this model in your research or application, please cite: ```bibtex @misc{lebiraja2024lifeexpectancy, author = {lebiraja}, title = {Life Expectancy Predictor}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/lebiraja/life-expectancy-predictor}}, } ``` ## License MIT