# π IntelliCredit-X β Project Summary
> *The complete technical story behind building a multi-agent RL environment for MSME credit underwriting and GRPO fine-tuning of Mistral-7B*
**Authors:** V S S K Sai Narayana & Sujeet Jaiswal
**Event:** Meta Γ Hugging Face OpenEnv Hackathon 2025
**Status:** β
All 8 Phases Complete β Production Deployed
**Published:** April 25, 2026 | **License:** MIT
---
## π Quick Links
| Resource | Link |
|----------|------|
| π€ Live Environment | [huggingface.co/spaces/vssksn/intellicredit-openenv](https://huggingface.co/spaces/vssksn/intellicredit-openenv) |
| π€ GRPO Model | [vssksn/intellicredit-mistral-7b-grpo](https://huggingface.co/vssksn/intellicredit-mistral-7b-grpo) |
| π€ Training Dataset | [vssksn/intellicredit-grpo-dataset](https://huggingface.co/datasets/vssksn/intellicredit-grpo-dataset) |
| π Colab Training Notebook | [Open in Colab](https://colab.research.google.com/drive/1HhVu1JezKoT32zfHIEfAFersxRrwZSYu?usp=sharing) |
| π» GitHub Repository | [1919-14/intellicredit-openenv](https://github.com/1919-14/intellicredit-openenv) |
| π API Docs | [Swagger UI](https://vssksn-intellicredit-openenv.hf.space/docs) |
| π Mini Blog (Story & Results) | [docs/mini_blog.md](./docs/mini_blog.md) |
| π Full Blog Post | [docs/blog.md](./docs/blog.md) |
---
## π― The Problem
The MSME lending sector in India processes **100,000+ loan applications daily**. The traditional solution β hire more senior bankers β doesn't scale:
- A senior officer reviews **~16 applications/day** at 30 minutes each
- Officers cost βΉ40β80 lakhs annually
- Human fatigue causes **inconsistent rule application**
- **12β15% annual default rates** due to missed fraud signals buried in data
- No explainable audit trail β decisions driven by "gut feeling"
**Our approach:** Build a reinforcement learning training environment where an AI learns to reason like the best credit officers β gathering evidence, detecting hidden fraud, respecting hard regulatory constraints, and managing portfolio risk across time.
---
## ποΈ What IntelliCredit-X Is
IntelliCredit-X is a **Constrained Multi-Agent MDP** (Markov Decision Process) built as an OpenEnv-compliant environment. An LLM agent acts as a **Senior Credit Officer** across a 50-step episode representing a full credit committee lifecycle.
**What makes this fundamentally different from a classifier:**
| Challenge | How IntelliCredit-X Models It |
|-----------|-------------------------------|
| Multi-step investigation | Agent calls tools before deciding β not a one-shot classification |
| Delayed consequences | Loan approved at step 5 may default at step 30 β reward arrives 25 steps later |
| Adversarial borrowers | BorrowerAgent improves surface metrics after rejection; hidden PD unchanged |
| Regulatory enforcement | RegulatorAgent audits portfolio every ~10 steps; 3 failures = shutdown |
| Missing data | β1.0 sentinel = masked feature; data absence itself is a risk signal |
| Regulatory hard rules | 6 RBI mandates that auto-reject and penalize regardless of model choice |
---
## π System Architecture
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INTELLICREDIT-X SYSTEM OVERVIEW β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β BORROWER AGENT (Adversarial) REGULATOR AGENT (Enforcer) β
β βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ β
β β Presents applications β β Audits at β10/20/30/ β β
β β Hides real risk behind β β 40/50 steps (Β±1 jitter) β β
β β improved surface numbersβ β NPA/CRAR/concentration β β
β β Reapplies up to 3Γ β β Shutdown after 3 fails β β
β ββββββββββββββ¬βββββββββββββ ββββββββββββββ¬βββββββββββββ β
β ββββββββββββββββ¬βββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββ β
β β WORLD STATE β β
β β Macro economy trends β β
β β Sector health scores β β
β β Loan maturity queue β β
β β Portfolio ledger β β
β βββββββββββββ¬βββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
β β CREDIT OFFICER AGENT (LLM) β β
β β β β
β β Sees 55D observation as a text prompt β β
β β Calls 3 investigation tools (max 4/step) β β
β β Writes reasoning + submits decision β β
β β β β
β β Fine-tuned via GRPO (TRL + Unsloth) β β
β β Self-improves via Reflection Module β β
β ββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
---
## ποΈ Observation Space β 55 Dimensions
| Group | Dims | Description |
|-------|------|-------------|
| Application Features | 0β24 | DSCR, Current Ratio, GST gap, DIN score, cheque bounce rate, circular trading flag, ITC mismatch, RONW, collateralβ¦ |
| Portfolio State | 25β34 | Capital deployed, NPA rate, CRAR, provisioning coverage, sector concentration flags |
| Macro State | 35β39 | Systemic stress, GDP growth, inflation, credit cycle phase, stressed sector flag |
| Alert State | 40β44 | Running RED/YELLOW alert tallies for episode |
| **Memory Features** *(v2)* | **45β54** | Rolling NPA, approval rate, sector max concentration, macro trend, borrower persistence, audit risk, capital buffer, reflection count, episode progress, world model confidence |
**Key dimension:** `Dim 49: borrower_persistence_score` β `0.0` = first attempt, `0.5` = second, `1.0` = third attempt. A value of 1.0 is the environment's strongest manipulation signal.
---
## πΉοΈ Action Space + Tool System
**Actions:** `0 = APPROVE` | `1 = CONDITIONAL` | `2 = REJECT`
Plus optional tool calls before each decision (max 4 per step):
| Tool | Returns | Best Used When |
|------|---------|----------------|
| `get_financial_report(company_id)` | 3yr revenue, EBITDA, auditor remarks, related-party transactions | Borderline financials, need trend data |
| `check_compliance_status(company_id)` | DIN status, NCLT cases, GST filings, CIBIL, prior defaults | RED alert present or low governance score |
| `get_market_intelligence(sector)` | Sector stress, RBI advisory, portfolio exposure, peer NPA rate | Approaching 30% concentration limit |
| `submit_decision(action, reasoning)` | Finalizes step (reasoning β₯ 50 chars mandatory) | After investigation complete |
---
## π Hard Rules (Non-Negotiable RBI Mandates)
| Rule | Trigger Condition | Outcome |
|------|-------------------|---------|
| HR-01 | DSCR < 1.0 | Auto-REJECT + β2.0 penalty |
| HR-02 | Director DIN < 0.1 (disqualified) | Auto-REJECT + β2.0 penalty |
| HR-03 | RED forensic alert present | Auto-REJECT + β2.0 penalty |
| HR-04 | Cheque bounce rate > 25% | Auto-REJECT + β2.0 penalty |
| HR-05 | GST compliance < 40% | Auto-REJECT + β2.0 penalty |
| HR-06 | Adverse media score > 0.80 | Auto-REJECT + β2.0 penalty |
---
## π Reward System
### Per-Step Components
| Component | Weight | Range | Purpose |
|-----------|--------|-------|---------|
| R1: Decision Correctness | 40% | [β2.0, +1.0] | PD-based: low PD+APPROVE=+1.0; high PD+APPROVE=β2.0 |
| R2: Hard Rule Compliance | 30% | [β2.0, +0.5] | HR+REJECT=+0.5; HR+APPROVE=β2.0 |
| R3: Format Compliance | 15% | [β0.3, +0.3] | `submit_decision()` used=+0.3; parse failure=β0.3 |
| R4: Portfolio Awareness | 15% | [β0.8, +0.3] | NPA>8%+risky approve=β0.5; healthy approve=+0.2 |
### Delayed + Event Rewards
| Event | Reward | When It Fires |
|-------|--------|---------------|
| Loan fully repaid | **+10.0** | T+10 to T+30 after approval |
| Partial default (recovery β₯50%) | **β5.0** | T+10 to T+30 after approval |
| Full default (recovery <50%) | **β15.0 Γ (1βrecovery)** | T+10 to T+30 after approval |
| Clean audit (all checks pass) | **+2.0** | Each audit step |
| Audit violation | **β8.0 per violation** | Each audit step |
| Capital breach (CRAR <12.5%) | **β15.0** | Each audit step |
| 3rd consecutive audit fail | **β50.0 + shutdown** | Terminal |
| Survival bonus (CRAR >12.5%) | **+0.05β+0.10** | Every 10 steps |
### Settlement Score (Step 50)
```
score = 0.30 Γ portfolio_yield
+ 0.30 Γ (1 β npa_rate / 0.05)
+ 0.20 Γ regulatory_compliance
+ 0.20 Γ capital_utilization
Range: [β1.0, +5.0] | Good episode β₯ 3.0
```
---
## π€ GRPO Training Pipeline
### Model & Config
| Parameter | Value |
|-----------|-------|
| Base Model | `mistralai/Mistral-7B-Instruct-v0.3` |
| Quantization | 4-bit QLoRA via Unsloth |
| LoRA Rank | 16, targeting q/k/v/o projections |
| Sequence Length | 2048 tokens (strictly enforced) |
| Generations per prompt | 8 |
| Effective batch size | 16 (batch=2, grad_accum=8) |
| KL Beta | 0.001 |
| Hardware | A100 80GB β ~45 minutes total |
### 3-Stage Curriculum
| Stage | Training Data | LR | Temperature | Goal |
|-------|-------------|-----|-------------|------|
| Stage 0 (SFT Warmup) | Mixed tasks | 5e-5 | β | Bootstrap `submit_decision()` format compliance |
| Stage 1 | task1 (Easy) | 5e-6 | 0.9 | Hard rule recognition on clean profiles |
| Stage 2 | task1 + task2 | 5e-6 | 0.9 | Forensic alert detection + tool call initiation |
| Stage 3 | All 5 tasks | 2e-6 | 0.8 | Long-horizon portfolio management under macro shocks |
### Training Dataset
- **2,000 prompts** β 400 per task level (task1βtask5)
- Each prompt ~2,400 characters (role + tools + rules + application + portfolio + macro)
- Ground truth metadata: hidden PD, optimal action, hard rules, forensic alerts, CRAR, NPA
- Distribution: **47.2%** hard rules triggered | **28.1%** RED forensic alerts
- Dataset: [vssksn/intellicredit-grpo-dataset](https://huggingface.co/datasets/vssksn/intellicredit-grpo-dataset)
### Critical Bugs Fixed During Training
| Bug | Root Cause | Fix Applied |
|-----|-----------|-------------|
| CUDA Index Out of Bounds | Unsloth pads vocab 32768β32832; OOB token IDs caused assertions | Clamp all IDs to `vocab_sizeβ1`; add `valid_mask` |
| Shape mismatch on logits | `full_ids` exceeded 2048 before forward pass; logits truncated | Enforce `full_ids = full_ids[:, :MAX_SEQ_LEN]` before forward |
| Flat KL divergence | `clamp(min=0)` made KL=0 when new policy was more confident | Changed to `abs()` for symmetric KL |
| Zero-LP episodes | Long prompts consumed full context; zero completion tokens | `continue` when `sum(log_probs) == 0` |
---
## πͺ Self-Improvement Reflection System
The Reflection Module enables **cross-episode learning without weight updates** by injecting structured lessons into the next episode's system prompt.
### How It Works
```
Episode N β Analyze all steps where reward < 0
β Extract lessons by failure type (6 categories)
β Store in memory_bank.json (max 20, FIFO eviction)
Episode N+1 β Inject top 5 lessons into system prompt Layer 3
β Agent makes better decisions without retraining
```
### 6 Lesson Categories
| Trigger | Lesson Format | Severity |
|---------|---------------|----------|
| Hard Rule Violation | `RULE: When [condition], always REJECT` | Critical |
| Delayed Default | `CAUTION: Loans with [pattern] defaulted at step X` | High |
| Audit Failure | `COMPLIANCE: Audit failed due to [metric]` | High |
| Borrower Manipulation | `FRAUD: 3rd-attempt applicant with [pattern] defaulted` | Critical |
| Macro Shock Loss | `MACRO: During [state], be conservative with [sector]` | Medium |
| Portfolio Overexposure | `PORTFOLIO: NPA rate reached X%. Increase rejections.` | High |
**Measured result:** Base model (no fine-tuning) improved average episode score from **0.22 β 0.55** across 30 episodes using only reflection β a **+150% improvement without changing a single weight**.
---
## π Results β GRPO Fine-Tuned vs. Base Model
| Task | Metric | Base Mistral-7B | GRPO Model | Delta |
|------|--------|----------------|-----------|-------|
| Task 1 (Easy) | Score | 0.900 | **0.955** | **+0.055 β
** |
| | Accuracy | 80.0% | **86.7%** | **+6.7% β
** |
| | Total Reward | 2.904 | 3.272 | +0.368 β
|
| | Capital Util | 40.0% | **60.0%** | **+20.0% β
** |
| Task 2 (Medium) | Score | 1.000 | 1.000 | ceiling β
|
| | Total Reward | 10.305 | **10.584** | **+0.279 β
** |
| | Capital Util | 25.0% | **29.2%** | **+4.2% β
** |
| Task 3 (Hard) | Score | 0.767 | **0.833** | **+0.067 β
** |
| | Total Reward | 0.215 | **2.491** | **+2.276 β
(10Γ!)** |
| | Accuracy | 58.3% | **66.7%** | **+8.3% β
** |
| | **NPA Rate** | **16.7%** | **8.3%** | **β8.3% β
(halved!)** |
| | Capital Util | 16.7% | **25.0%** | **+8.3% β
** |
**Zero regressions across all 24 metric-task combinations.**
---
## π§ Training Curves
| Panel | What It Shows |
|-------|---------------|
| GRPO Loss (red) | Controlled upward drift 0β0.05 β policy meaningfully diverging from base |
| Mean Reward (blue) | Starts β2.0, crosses zero by step 10, stabilizes +0.5 to +1.0 |
| KL Divergence (purple) | Grows to 0.04β0.08 β new behaviors learned, base capabilities preserved |
| `submit_pct` (teal) | Format compliance 0%β40β65% β model acquires the task's vocabulary |
The biggest reward jump happens at the Stage 1β2 transition (when forensic alerts first appear), suggesting forensic pattern recognition was the primary learning bottleneck.
---
## πΊοΈ Project Phase Completion
| Phase | Name | Status |
|-------|------|--------|
| Phase 0 | Strategic Alignment & Decisions | β
Complete |
| Phase 1 | Environment Upgrade (v1 β v2: 50-step, 55D, multi-agent) | β
Complete |
| Phase 2 | Multi-Agent System Design (Borrower + Regulator agents) | β
Complete |
| Phase 3 | Tool Calling System (3 tools + parser + anti-hacking) | β
Complete |
| Phase 4 | Reward System Redesign (sparse/delayed + 4 components) | β
Complete |
| Phase 5 | Self-Improvement & Reflection System | β
Complete |
| Phase 6 | GRPO Training Pipeline (Unsloth + TRL, 3-stage curriculum) | β
Complete |
| Phase 7 | Evaluation & Proof Generation | β
Complete |
| Phase 8 | GRPO Stabilization + Real Model Evaluation | β
Complete |
---
## π File Inventory
| File | Role | Phase |
|------|------|-------|
| `server/app.py` | FastAPI server β /reset, /step, /info, /health | P0 |
| `server/intellicredit_env.py` | v2 core: WorldState, 50-step lifecycle, multi-agent | P1βP4 |
| `server/dataset.py` | Application generator (Anchor Γ Sector Γ Size Γ Tier) | P0 |
| `server/reward.py` | R1-R4 reward engine + settlement grader | P4 |
| `server/action_parser.py` | LLM text β tool call / decision parser (6 parse levels) | P3 |
| `server/tool_executor.py` | Read-only tool execution (financial, compliance, market) | P3 |
| `server/agent_loop.py` | Agent orchestrator + prompt injection | P3βP5 |
| `server/reflection.py` | Self-improvement + memory bank | P5 |
| `training/colab_grpo_3b_v2.py` | PRIMARY: Unsloth GRPO training script (A100, ~45 min) | P6/P8 |
| `training/generate_dataset.py` | 2,000-prompt GRPO dataset generator | P6 |
| `training/grpo_rewards.py` | 4 GRPO reward functions (R1-R4) | P6 |
| `training/train_grpo.py` | 3-stage curriculum pipeline | P6 |
| `evaluation/evaluate.py` | Multi-mode evaluation engine | P7 |
| `evaluation/compare.py` | Comparison tables + reward curves | P7 |
| `eval_llm.py` | LLM evaluation via HTTP (base vs trained) | P8 |
| `compare_results.py` | Bar chart comparison generator | P8 |
| `models.py` | Pydantic schemas (55D observation, action) | P1 |
| `inference.py` | LLM inference wrapper | P0 |
| `docs/mini_blog.md` | Mini blog β storytelling format, 2-min read, proof charts | P8 |
| `docs/blog.md` | Full technical blog post (~5,000 words) | P8 |
| `docs/assets/comparison.png` | Baseline vs GRPO results chart | P8 |
| `docs/assets/training_curves.png` | GRPO training curves (Mistral-7B, A100) | P8 |
**Total: ~10,000+ lines of implementation across 21 source files.**
---
## π Citation
```bibtex
@article{intellicredit2026,
title = {IntelliCredit-X: A Multi-Agent Constrained MDP for MSME Credit
Appraisal with GRPO Fine-Tuning},
author = {Narayana, V S S K Sai and Jaiswal, Sujeet},
year = {2025},
note = {Meta Γ Hugging Face OpenEnv Hackathon},
url = {https://huggingface.co/spaces/vssksn/intellicredit-openenv}
}
```
---
*IntelliCredit-X β Built by V S S K Sai Narayana & Sujeet Jaiswal | MIT License*