# πŸ“‹ IntelliCredit-X β€” Project Summary > *The complete technical story behind building a multi-agent RL environment for MSME credit underwriting and GRPO fine-tuning of Mistral-7B* **Authors:** V S S K Sai Narayana & Sujeet Jaiswal **Event:** Meta Γ— Hugging Face OpenEnv Hackathon 2025 **Status:** βœ… All 8 Phases Complete β€” Production Deployed **Published:** April 25, 2026 | **License:** MIT --- ## πŸ”— Quick Links | Resource | Link | |----------|------| | πŸ€— Live Environment | [huggingface.co/spaces/vssksn/intellicredit-openenv](https://huggingface.co/spaces/vssksn/intellicredit-openenv) | | πŸ€— GRPO Model | [vssksn/intellicredit-mistral-7b-grpo](https://huggingface.co/vssksn/intellicredit-mistral-7b-grpo) | | πŸ€— Training Dataset | [vssksn/intellicredit-grpo-dataset](https://huggingface.co/datasets/vssksn/intellicredit-grpo-dataset) | | πŸ““ Colab Training Notebook | [Open in Colab](https://colab.research.google.com/drive/1HhVu1JezKoT32zfHIEfAFersxRrwZSYu?usp=sharing) | | πŸ’» GitHub Repository | [1919-14/intellicredit-openenv](https://github.com/1919-14/intellicredit-openenv) | | πŸ“– API Docs | [Swagger UI](https://vssksn-intellicredit-openenv.hf.space/docs) | | πŸ“ Mini Blog (Story & Results) | [docs/mini_blog.md](./docs/mini_blog.md) | | πŸ“– Full Blog Post | [docs/blog.md](./docs/blog.md) | --- ## 🎯 The Problem The MSME lending sector in India processes **100,000+ loan applications daily**. The traditional solution β€” hire more senior bankers β€” doesn't scale: - A senior officer reviews **~16 applications/day** at 30 minutes each - Officers cost β‚Ή40–80 lakhs annually - Human fatigue causes **inconsistent rule application** - **12–15% annual default rates** due to missed fraud signals buried in data - No explainable audit trail β€” decisions driven by "gut feeling" **Our approach:** Build a reinforcement learning training environment where an AI learns to reason like the best credit officers β€” gathering evidence, detecting hidden fraud, respecting hard regulatory constraints, and managing portfolio risk across time. --- ## πŸ—οΈ What IntelliCredit-X Is IntelliCredit-X is a **Constrained Multi-Agent MDP** (Markov Decision Process) built as an OpenEnv-compliant environment. An LLM agent acts as a **Senior Credit Officer** across a 50-step episode representing a full credit committee lifecycle. **What makes this fundamentally different from a classifier:** | Challenge | How IntelliCredit-X Models It | |-----------|-------------------------------| | Multi-step investigation | Agent calls tools before deciding β€” not a one-shot classification | | Delayed consequences | Loan approved at step 5 may default at step 30 β€” reward arrives 25 steps later | | Adversarial borrowers | BorrowerAgent improves surface metrics after rejection; hidden PD unchanged | | Regulatory enforcement | RegulatorAgent audits portfolio every ~10 steps; 3 failures = shutdown | | Missing data | βˆ’1.0 sentinel = masked feature; data absence itself is a risk signal | | Regulatory hard rules | 6 RBI mandates that auto-reject and penalize regardless of model choice | --- ## πŸ“ System Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ INTELLICREDIT-X SYSTEM OVERVIEW β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ BORROWER AGENT (Adversarial) REGULATOR AGENT (Enforcer) β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Presents applications β”‚ β”‚ Audits at β‰ˆ10/20/30/ β”‚ β”‚ β”‚ β”‚ Hides real risk behind β”‚ β”‚ 40/50 steps (Β±1 jitter) β”‚ β”‚ β”‚ β”‚ improved surface numbersβ”‚ β”‚ NPA/CRAR/concentration β”‚ β”‚ β”‚ β”‚ Reapplies up to 3Γ— β”‚ β”‚ Shutdown after 3 fails β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β–Ό β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ WORLD STATE β”‚ β”‚ β”‚ β”‚ Macro economy trends β”‚ β”‚ β”‚ β”‚ Sector health scores β”‚ β”‚ β”‚ β”‚ Loan maturity queue β”‚ β”‚ β”‚ β”‚ Portfolio ledger β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β–Ό β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ CREDIT OFFICER AGENT (LLM) β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Sees 55D observation as a text prompt β”‚ β”‚ β”‚ β”‚ Calls 3 investigation tools (max 4/step) β”‚ β”‚ β”‚ β”‚ Writes reasoning + submits decision β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Fine-tuned via GRPO (TRL + Unsloth) β”‚ β”‚ β”‚ β”‚ Self-improves via Reflection Module β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` --- ## πŸ‘οΈ Observation Space β€” 55 Dimensions | Group | Dims | Description | |-------|------|-------------| | Application Features | 0–24 | DSCR, Current Ratio, GST gap, DIN score, cheque bounce rate, circular trading flag, ITC mismatch, RONW, collateral… | | Portfolio State | 25–34 | Capital deployed, NPA rate, CRAR, provisioning coverage, sector concentration flags | | Macro State | 35–39 | Systemic stress, GDP growth, inflation, credit cycle phase, stressed sector flag | | Alert State | 40–44 | Running RED/YELLOW alert tallies for episode | | **Memory Features** *(v2)* | **45–54** | Rolling NPA, approval rate, sector max concentration, macro trend, borrower persistence, audit risk, capital buffer, reflection count, episode progress, world model confidence | **Key dimension:** `Dim 49: borrower_persistence_score` β€” `0.0` = first attempt, `0.5` = second, `1.0` = third attempt. A value of 1.0 is the environment's strongest manipulation signal. --- ## πŸ•ΉοΈ Action Space + Tool System **Actions:** `0 = APPROVE` | `1 = CONDITIONAL` | `2 = REJECT` Plus optional tool calls before each decision (max 4 per step): | Tool | Returns | Best Used When | |------|---------|----------------| | `get_financial_report(company_id)` | 3yr revenue, EBITDA, auditor remarks, related-party transactions | Borderline financials, need trend data | | `check_compliance_status(company_id)` | DIN status, NCLT cases, GST filings, CIBIL, prior defaults | RED alert present or low governance score | | `get_market_intelligence(sector)` | Sector stress, RBI advisory, portfolio exposure, peer NPA rate | Approaching 30% concentration limit | | `submit_decision(action, reasoning)` | Finalizes step (reasoning β‰₯ 50 chars mandatory) | After investigation complete | --- ## πŸ›‘ Hard Rules (Non-Negotiable RBI Mandates) | Rule | Trigger Condition | Outcome | |------|-------------------|---------| | HR-01 | DSCR < 1.0 | Auto-REJECT + βˆ’2.0 penalty | | HR-02 | Director DIN < 0.1 (disqualified) | Auto-REJECT + βˆ’2.0 penalty | | HR-03 | RED forensic alert present | Auto-REJECT + βˆ’2.0 penalty | | HR-04 | Cheque bounce rate > 25% | Auto-REJECT + βˆ’2.0 penalty | | HR-05 | GST compliance < 40% | Auto-REJECT + βˆ’2.0 penalty | | HR-06 | Adverse media score > 0.80 | Auto-REJECT + βˆ’2.0 penalty | --- ## πŸ“ˆ Reward System ### Per-Step Components | Component | Weight | Range | Purpose | |-----------|--------|-------|---------| | R1: Decision Correctness | 40% | [βˆ’2.0, +1.0] | PD-based: low PD+APPROVE=+1.0; high PD+APPROVE=βˆ’2.0 | | R2: Hard Rule Compliance | 30% | [βˆ’2.0, +0.5] | HR+REJECT=+0.5; HR+APPROVE=βˆ’2.0 | | R3: Format Compliance | 15% | [βˆ’0.3, +0.3] | `submit_decision()` used=+0.3; parse failure=βˆ’0.3 | | R4: Portfolio Awareness | 15% | [βˆ’0.8, +0.3] | NPA>8%+risky approve=βˆ’0.5; healthy approve=+0.2 | ### Delayed + Event Rewards | Event | Reward | When It Fires | |-------|--------|---------------| | Loan fully repaid | **+10.0** | T+10 to T+30 after approval | | Partial default (recovery β‰₯50%) | **βˆ’5.0** | T+10 to T+30 after approval | | Full default (recovery <50%) | **βˆ’15.0 Γ— (1βˆ’recovery)** | T+10 to T+30 after approval | | Clean audit (all checks pass) | **+2.0** | Each audit step | | Audit violation | **βˆ’8.0 per violation** | Each audit step | | Capital breach (CRAR <12.5%) | **βˆ’15.0** | Each audit step | | 3rd consecutive audit fail | **βˆ’50.0 + shutdown** | Terminal | | Survival bonus (CRAR >12.5%) | **+0.05–+0.10** | Every 10 steps | ### Settlement Score (Step 50) ``` score = 0.30 Γ— portfolio_yield + 0.30 Γ— (1 βˆ’ npa_rate / 0.05) + 0.20 Γ— regulatory_compliance + 0.20 Γ— capital_utilization Range: [βˆ’1.0, +5.0] | Good episode β‰₯ 3.0 ``` --- ## πŸ€– GRPO Training Pipeline ### Model & Config | Parameter | Value | |-----------|-------| | Base Model | `mistralai/Mistral-7B-Instruct-v0.3` | | Quantization | 4-bit QLoRA via Unsloth | | LoRA Rank | 16, targeting q/k/v/o projections | | Sequence Length | 2048 tokens (strictly enforced) | | Generations per prompt | 8 | | Effective batch size | 16 (batch=2, grad_accum=8) | | KL Beta | 0.001 | | Hardware | A100 80GB β€” ~45 minutes total | ### 3-Stage Curriculum | Stage | Training Data | LR | Temperature | Goal | |-------|-------------|-----|-------------|------| | Stage 0 (SFT Warmup) | Mixed tasks | 5e-5 | β€” | Bootstrap `submit_decision()` format compliance | | Stage 1 | task1 (Easy) | 5e-6 | 0.9 | Hard rule recognition on clean profiles | | Stage 2 | task1 + task2 | 5e-6 | 0.9 | Forensic alert detection + tool call initiation | | Stage 3 | All 5 tasks | 2e-6 | 0.8 | Long-horizon portfolio management under macro shocks | ### Training Dataset - **2,000 prompts** β€” 400 per task level (task1–task5) - Each prompt ~2,400 characters (role + tools + rules + application + portfolio + macro) - Ground truth metadata: hidden PD, optimal action, hard rules, forensic alerts, CRAR, NPA - Distribution: **47.2%** hard rules triggered | **28.1%** RED forensic alerts - Dataset: [vssksn/intellicredit-grpo-dataset](https://huggingface.co/datasets/vssksn/intellicredit-grpo-dataset) ### Critical Bugs Fixed During Training | Bug | Root Cause | Fix Applied | |-----|-----------|-------------| | CUDA Index Out of Bounds | Unsloth pads vocab 32768β†’32832; OOB token IDs caused assertions | Clamp all IDs to `vocab_sizeβˆ’1`; add `valid_mask` | | Shape mismatch on logits | `full_ids` exceeded 2048 before forward pass; logits truncated | Enforce `full_ids = full_ids[:, :MAX_SEQ_LEN]` before forward | | Flat KL divergence | `clamp(min=0)` made KL=0 when new policy was more confident | Changed to `abs()` for symmetric KL | | Zero-LP episodes | Long prompts consumed full context; zero completion tokens | `continue` when `sum(log_probs) == 0` | --- ## πŸͺž Self-Improvement Reflection System The Reflection Module enables **cross-episode learning without weight updates** by injecting structured lessons into the next episode's system prompt. ### How It Works ``` Episode N β†’ Analyze all steps where reward < 0 β†’ Extract lessons by failure type (6 categories) β†’ Store in memory_bank.json (max 20, FIFO eviction) Episode N+1 β†’ Inject top 5 lessons into system prompt Layer 3 β†’ Agent makes better decisions without retraining ``` ### 6 Lesson Categories | Trigger | Lesson Format | Severity | |---------|---------------|----------| | Hard Rule Violation | `RULE: When [condition], always REJECT` | Critical | | Delayed Default | `CAUTION: Loans with [pattern] defaulted at step X` | High | | Audit Failure | `COMPLIANCE: Audit failed due to [metric]` | High | | Borrower Manipulation | `FRAUD: 3rd-attempt applicant with [pattern] defaulted` | Critical | | Macro Shock Loss | `MACRO: During [state], be conservative with [sector]` | Medium | | Portfolio Overexposure | `PORTFOLIO: NPA rate reached X%. Increase rejections.` | High | **Measured result:** Base model (no fine-tuning) improved average episode score from **0.22 β†’ 0.55** across 30 episodes using only reflection β€” a **+150% improvement without changing a single weight**. --- ## πŸ“Š Results β€” GRPO Fine-Tuned vs. Base Model 051701af-7e29-4e6c-8895-f0c9b6569cf2 | Task | Metric | Base Mistral-7B | GRPO Model | Delta | |------|--------|----------------|-----------|-------| | Task 1 (Easy) | Score | 0.900 | **0.955** | **+0.055 βœ…** | | | Accuracy | 80.0% | **86.7%** | **+6.7% βœ…** | | | Total Reward | 2.904 | 3.272 | +0.368 βœ… | | | Capital Util | 40.0% | **60.0%** | **+20.0% βœ…** | | Task 2 (Medium) | Score | 1.000 | 1.000 | ceiling βœ… | | | Total Reward | 10.305 | **10.584** | **+0.279 βœ…** | | | Capital Util | 25.0% | **29.2%** | **+4.2% βœ…** | | Task 3 (Hard) | Score | 0.767 | **0.833** | **+0.067 βœ…** | | | Total Reward | 0.215 | **2.491** | **+2.276 βœ… (10Γ—!)** | | | Accuracy | 58.3% | **66.7%** | **+8.3% βœ…** | | | **NPA Rate** | **16.7%** | **8.3%** | **βˆ’8.3% βœ… (halved!)** | | | Capital Util | 16.7% | **25.0%** | **+8.3% βœ…** | **Zero regressions across all 24 metric-task combinations.** --- ## 🧠 Training Curves c54ed1cb-564e-40bd-81be-d56a76d9713f | Panel | What It Shows | |-------|---------------| | GRPO Loss (red) | Controlled upward drift 0β†’0.05 β€” policy meaningfully diverging from base | | Mean Reward (blue) | Starts βˆ’2.0, crosses zero by step 10, stabilizes +0.5 to +1.0 | | KL Divergence (purple) | Grows to 0.04–0.08 β€” new behaviors learned, base capabilities preserved | | `submit_pct` (teal) | Format compliance 0%β†’40–65% β€” model acquires the task's vocabulary | The biggest reward jump happens at the Stage 1β†’2 transition (when forensic alerts first appear), suggesting forensic pattern recognition was the primary learning bottleneck. --- ## πŸ—ΊοΈ Project Phase Completion | Phase | Name | Status | |-------|------|--------| | Phase 0 | Strategic Alignment & Decisions | βœ… Complete | | Phase 1 | Environment Upgrade (v1 β†’ v2: 50-step, 55D, multi-agent) | βœ… Complete | | Phase 2 | Multi-Agent System Design (Borrower + Regulator agents) | βœ… Complete | | Phase 3 | Tool Calling System (3 tools + parser + anti-hacking) | βœ… Complete | | Phase 4 | Reward System Redesign (sparse/delayed + 4 components) | βœ… Complete | | Phase 5 | Self-Improvement & Reflection System | βœ… Complete | | Phase 6 | GRPO Training Pipeline (Unsloth + TRL, 3-stage curriculum) | βœ… Complete | | Phase 7 | Evaluation & Proof Generation | βœ… Complete | | Phase 8 | GRPO Stabilization + Real Model Evaluation | βœ… Complete | --- ## πŸ“ File Inventory | File | Role | Phase | |------|------|-------| | `server/app.py` | FastAPI server β€” /reset, /step, /info, /health | P0 | | `server/intellicredit_env.py` | v2 core: WorldState, 50-step lifecycle, multi-agent | P1–P4 | | `server/dataset.py` | Application generator (Anchor Γ— Sector Γ— Size Γ— Tier) | P0 | | `server/reward.py` | R1-R4 reward engine + settlement grader | P4 | | `server/action_parser.py` | LLM text β†’ tool call / decision parser (6 parse levels) | P3 | | `server/tool_executor.py` | Read-only tool execution (financial, compliance, market) | P3 | | `server/agent_loop.py` | Agent orchestrator + prompt injection | P3–P5 | | `server/reflection.py` | Self-improvement + memory bank | P5 | | `training/colab_grpo_3b_v2.py` | PRIMARY: Unsloth GRPO training script (A100, ~45 min) | P6/P8 | | `training/generate_dataset.py` | 2,000-prompt GRPO dataset generator | P6 | | `training/grpo_rewards.py` | 4 GRPO reward functions (R1-R4) | P6 | | `training/train_grpo.py` | 3-stage curriculum pipeline | P6 | | `evaluation/evaluate.py` | Multi-mode evaluation engine | P7 | | `evaluation/compare.py` | Comparison tables + reward curves | P7 | | `eval_llm.py` | LLM evaluation via HTTP (base vs trained) | P8 | | `compare_results.py` | Bar chart comparison generator | P8 | | `models.py` | Pydantic schemas (55D observation, action) | P1 | | `inference.py` | LLM inference wrapper | P0 | | `docs/mini_blog.md` | Mini blog β€” storytelling format, 2-min read, proof charts | P8 | | `docs/blog.md` | Full technical blog post (~5,000 words) | P8 | | `docs/assets/comparison.png` | Baseline vs GRPO results chart | P8 | | `docs/assets/training_curves.png` | GRPO training curves (Mistral-7B, A100) | P8 | **Total: ~10,000+ lines of implementation across 21 source files.** --- ## πŸ“š Citation ```bibtex @article{intellicredit2026, title = {IntelliCredit-X: A Multi-Agent Constrained MDP for MSME Credit Appraisal with GRPO Fine-Tuning}, author = {Narayana, V S S K Sai and Jaiswal, Sujeet}, year = {2025}, note = {Meta Γ— Hugging Face OpenEnv Hackathon}, url = {https://huggingface.co/spaces/vssksn/intellicredit-openenv} } ``` --- *IntelliCredit-X β€” Built by V S S K Sai Narayana & Sujeet Jaiswal | MIT License*