Add gradient boosting life expectancy model with preprocessing artifacts and model card

6b8b156 verified 30 days ago

7.88 kB

	---
	language:
	- en
	license: mit
	tags:
	- sklearn
	- tabular-regression
	- health
	- life-expectancy
	- gradient-boosting
	- scikit-learn
	pipeline_tag: tabular-regression
	library_name: sklearn
	metrics:
	- r2
	- rmse
	model-index:
	- name: Life Expectancy Predictor
	results:
	- task:
	type: tabular-regression
	name: Tabular Regression
	metrics:
	- type: r2
	value: 0.87
	name: R² Score
	- type: rmse
	value: 4.0
	name: RMSE (years)
	---

	# Life Expectancy Predictor

	A Gradient Boosting Regressor trained on 14 health and lifestyle features to predict a person's life expectancy in years. Built with scikit-learn and wrapped in a production-ready FastAPI service.

	## Model Description

	This model takes a snapshot of an individual's health profile — including physical attributes, lifestyle habits, and medical history — and returns a predicted life expectancy in years. The primary model is a `GradientBoostingRegressor` achieving an R² of ~0.87 on the held-out test set. A baseline `LinearRegression` model is also included for comparison.

	\| Artifact \| File \| Description \|
	\|---\|---\|---\|
	\| Primary model \| `gradient_boosting_model.pkl` \| GradientBoostingRegressor (472 KB) \|
	\| Baseline model \| `linear_model.pkl` \| LinearRegression (4 KB) \|
	\| Feature scaler \| `scaler.pkl` \| StandardScaler for all features \|
	\| Categorical encoder \| `preprocessor.pkl` \| LabelEncoder mapping for categorical inputs \|

	## Intended Use

	- Research & education: understanding which health factors most affect life expectancy.
	- Health-tech prototypes: powering wellness apps or patient-facing dashboards.
	- Academic exploration: studying gradient boosting on tabular health data.

	Not intended for: clinical diagnosis, medical decision-making, or any high-stakes healthcare decisions. Predictions are statistical estimates, not medical advice.

	## How to Use

	### Install dependencies

	```bash
	pip install scikit-learn>=1.5.0 joblib numpy
	```

	### Load and run inference

	```python
	import joblib
	import numpy as np

	# Load artifacts
	model = joblib.load("gradient_boosting_model.pkl")
	scaler = joblib.load("scaler.pkl")
	preprocessor = joblib.load("preprocessor.pkl") # dict of LabelEncoders

	# --- Prepare a sample input ---
	# Categorical columns and their LabelEncoders are stored in preprocessor.pkl
	# Categorical features: Gender, Physical_Activity, Smoking_Status,
	# Alcohol_Consumption, Diet, Blood_Pressure

	def encode_and_predict(sample: dict) -> float:
	"""
	sample keys (all required):
	Gender, Height, Weight, BMI, Physical_Activity, Smoking_Status,
	Alcohol_Consumption, Diet, Blood_Pressure, Cholesterol,
	Diabetes, Hypertension, Heart_Disease, Asthma
	"""
	categorical_cols = [
	"Gender", "Physical_Activity", "Smoking_Status",
	"Alcohol_Consumption", "Diet", "Blood_Pressure",
	]
	for col in categorical_cols:
	le = preprocessor[col] # LabelEncoder for this column
	sample[col] = le.transform([sample[col]])[0]

	feature_order = [
	"Gender", "Height", "Weight", "BMI",
	"Physical_Activity", "Smoking_Status", "Alcohol_Consumption",
	"Diet", "Blood_Pressure", "Cholesterol",
	"Diabetes", "Hypertension", "Heart_Disease", "Asthma",
	]
	X = np.array([[sample[f] for f in feature_order]])
	X_scaled = scaler.transform(X)
	return float(model.predict(X_scaled)[0])


	sample = {
	"Gender": "Male",
	"Height": 175,
	"Weight": 75,
	"BMI": 24.5,
	"Physical_Activity": "Medium",
	"Smoking_Status": "Never",
	"Alcohol_Consumption": "Moderate",
	"Diet": "Good",
	"Blood_Pressure": "Normal",
	"Cholesterol": 190,
	"Diabetes": 0,
	"Hypertension": 0,
	"Heart_Disease": 0,
	"Asthma": 0,
	}

	prediction = encode_and_predict(sample)
	print(f"Predicted life expectancy: {prediction:.1f} years")
	```

	### Download from the Hub

	```python
	from huggingface_hub import hf_hub_download
	import joblib

	model = joblib.load(
	hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
	filename="gradient_boosting_model.pkl")
	)
	scaler = joblib.load(
	hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
	filename="scaler.pkl")
	)
	preprocessor = joblib.load(
	hf_hub_download(repo_id="lebiraja/life-expectancy-predictor",
	filename="preprocessor.pkl")
	)
	```

	## Input Features

	\| Feature \| Type \| Values / Range \| Description \|
	\|---\|---\|---\|---\|
	\| `Gender` \| categorical \| Male / Female \| Biological sex \|
	\| `Height` \| numerical \| cm \| Body height \|
	\| `Weight` \| numerical \| kg \| Body weight \|
	\| `BMI` \| numerical \| continuous \| Body Mass Index \|
	\| `Physical_Activity` \| categorical \| Low / Medium / High \| Exercise level \|
	\| `Smoking_Status` \| categorical \| Never / Former / Current \| Smoking history \|
	\| `Alcohol_Consumption` \| categorical \| None / Moderate / Heavy \| Alcohol intake \|
	\| `Diet` \| categorical \| Poor / Average / Good \| Overall diet quality \|
	\| `Blood_Pressure` \| categorical \| Low / Normal / High \| Blood pressure category \|
	\| `Cholesterol` \| numerical \| mg/dL \| Total cholesterol level \|
	\| `Diabetes` \| binary \| 0 / 1 \| Diabetes diagnosis flag \|
	\| `Hypertension` \| binary \| 0 / 1 \| Hypertension diagnosis flag \|
	\| `Heart_Disease` \| binary \| 0 / 1 \| Heart disease diagnosis flag \|
	\| `Asthma` \| binary \| 0 / 1 \| Asthma diagnosis flag \|

	## Output

	A single continuous float representing predicted life expectancy in years.

	## Training Details

	### Dataset
	- Size: ~10,002 records
	- Split: 68 % train / 10 % validation / 22 % test
	- Target variable: `Age` (life expectancy in years)

	### Preprocessing
	1. Fill missing categorical values with `"None"`.
	2. `LabelEncoder` applied per categorical column (encoders saved in `preprocessor.pkl`).
	3. `StandardScaler` applied to all 14 features after encoding (saved in `scaler.pkl`).

	### Primary Model — GradientBoostingRegressor

	```python
	from sklearn.ensemble import GradientBoostingRegressor

	model = GradientBoostingRegressor(
	n_estimators=100,
	learning_rate=0.1,
	max_depth=5,
	min_samples_split=5,
	min_samples_leaf=2,
	random_state=42,
	)
	```

	### Baseline Model — LinearRegression

	A standard `LinearRegression` is also provided (`linear_model.pkl`) for interpretability and benchmarking.

	## Performance

	\| Metric \| Value \|
	\|---\|---\|
	\| R² (test set) \| 0.85 – 0.92 \|
	\| RMSE \| 3 – 5 years \|
	\| Confidence score \| 0.87 \|

	Metrics are on the held-out test split (~22 % of 10 k records).

	## Limitations

	- The model is trained on a synthetic / illustrative dataset; real-world generalization is not guaranteed.
	- It does not account for socioeconomic factors, genetics, geography, or environmental exposures.
	- Categorical label encodings are order-sensitive — always use the supplied `preprocessor.pkl` rather than re-encoding independently.
	- Predictions for feature combinations far outside the training distribution may be unreliable.

	## Ethical Considerations

	- Not a medical device. Do not use predictions to make clinical, insurance, or policy decisions.
	- Fairness: The model may reflect biases present in the training data. Subgroup performance (by gender, age bracket, etc.) has not been audited.
	- Privacy: No personal data is stored or transmitted by this model artifact; ensure your application handles user health data in compliance with applicable regulations (HIPAA, GDPR, etc.).

	## Citation

	If you use this model in your research or application, please cite:

	```bibtex
	@misc{lebiraja2024lifeexpectancy,
	author = {lebiraja},
	title = {Life Expectancy Predictor},
	year = {2024},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/lebiraja/life-expectancy-predictor}},
	}
	```

	## License

	MIT