credit-card-fraud-detector

Modelo ensemble (Random Forest + XGBoost) para detectar fraude en transacciones de tarjeta de crédito sobre el dataset alenc123/credit-card-fraud.

Modelo ganador

xgboost — seleccionado por F1 de la clase fraude con umbral calibrado.

Métrica	Valor
F1 (clase fraude, umbral calibrado)	0.9180
Precision	0.9634
Recall	0.8767
ROC-AUC	0.9990
PR-AUC	0.9596
Umbral calibrado	0.9516

Datos

Fuente: alenc123/credit-card-fraud, archivo credit_card_transactions.parquet.
Train / Test: 1,037,340 / 259,335 filas (split estratificado 80/20).
Tasa de fraude: ~0.579% (clase positiva fuertemente minoritaria).
Features finales (post-FE): 30, incluidas amt_log1p, distance_km, hour, dayofweek, month, age, frequency encoding de merchant/city/job/state y one-hot de category/gender.

Hiperparámetros

Random Forest (Bagging)

{
  "n_estimators": 300,
  "min_samples_leaf": 1,
  "max_features": 0.5,
  "max_depth": null,
  "criterion": "entropy"
}

XGBoost (Boosting)

{
  "subsample": 1.0,
  "reg_lambda": 0,
  "n_estimators": 600,
  "min_child_weight": 1,
  "max_depth": 6,
  "learning_rate": 0.1,
  "colsample_bytree": 0.8
}

Estrategia frente al desbalance

Random Forest: class_weight='balanced'.
XGBoost: scale_pos_weight = n_neg / n_pos ≈ 172.
Calibración del umbral via curva precision-recall (max F1 sobre train).
Sin SMOTE (ver notes/02_design_modeling.md).

Cómo usar

import joblib
import pandas as pd
from huggingface_hub import hf_hub_download

model_path = hf_hub_download("gusdelact/credit-card-fraud-bagging-boosting", "model.joblib")
pre_path   = hf_hub_download("gusdelact/credit-card-fraud-bagging-boosting", "preprocessor.joblib")
model = joblib.load(model_path)
preprocessor = joblib.load(pre_path)

# X_new debe contener las columnas crudas del dataset original; aplicar el mismo
# feature engineering (ver scripts/03_feature_engineering.py o app_inference/).
X_t = preprocessor.transform(X_new_engineered)
proba = model.predict_proba(X_t)[:, 1]
prediction = (proba >= 0.9516).astype(int)

Limitaciones

El dataset original es sintético (Sparkov-style); las métricas pueden ser optimistas en producción.
Frequency encoding mapea categorías nuevas a 0; un merchant no visto bajará la señal.
Sin split temporal: para escenarios con concept drift se recomienda re-evaluar.
Las probabilidades NO están calibradas en sentido estricto (no se aplicó CalibratedClassifierCV).

Citar

@model{ credit-card-fraud-bagging-boosting_2026,
  author    = {gusdelact},
  title     = {credit-card-fraud-detector},
  year      = {2026},
  publisher = {Hugging Face}
}

Downloads last month: -

gusdelact
/

credit-card-fraud-bagging-boosting