--- title: IntelliCredit CreditAppraisal v2 emoji: 🏦 colorFrom: indigo colorTo: red sdk: docker app_port: 7860 pinned: false --- # 🏦 IntelliCredit-X β€” Teaching an LLM to Think Like a Credit Officer
[![HF Space](https://img.shields.io/badge/πŸ€—_Space-Live_Demo-blue)](https://huggingface.co/spaces/vssksn/intellicredit-openenv) [![Dataset](https://img.shields.io/badge/πŸ€—_Dataset-GRPO_Training_Data-green)](https://huggingface.co/datasets/vssksn/intellicredit-grpo-v2) [![Model](https://img.shields.io/badge/πŸ€—_Model-Mistral--7B_GRPO-orange)](https://huggingface.co/vssksn/intellicredit-mistral-7b-grpo) [![GitHub](https://img.shields.io/badge/GitHub-intellicredit--openenv-black)](https://github.com/1919-14/intellicredit-openenv) [![API Docs](https://img.shields.io/badge/API-Swagger_UI-purple)](https://vssksn-intellicredit-openenv.hf.space/docs) [![Mini Blog](https://img.shields.io/badge/πŸ“_Mini_Blog-Story_&_Results-ff6600)](./docs/mini_blog.md) [![Blog](https://img.shields.io/badge/πŸ“–_Blog-Full_Technical_Writeup-teal)](./docs/blog.md) [![Colab](https://img.shields.io/badge/Colab-GRPO_Training_Notebook-yellow)](https://colab.research.google.com/drive/1HhVu1JezKoT32zfHIEfAFersxRrwZSYu?usp=sharing) [![License](https://img.shields.io/badge/License-MIT-red)](LICENSE) [![Version](https://img.shields.io/badge/version-2.0-orange)](./PROJECT_SUMMARY.md) [![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-blueviolet)](https://github.com/meta-pytorch/openenv) **By V S S K Sai Narayana & Sujeet Jaiswal** *Meta Γ— Hugging Face OpenEnv Hackathon 2026*
--- ## πŸ“ Start Here β€” Mini Blog (2-Minute Read) > ### ➑️ [The Story of Teaching an AI to Catch What Humans Miss](./docs/mini_blog.md) > > *Quick read Β· Storytelling format Β· Both proof charts included Β· The "why" behind IntelliCredit-X* > The real-world MSME problem β†’ what we built β†’ how we trained it β†’ the proof it works β†’ a real before/after example. > > **[πŸš€ Read the Mini Blog β†’](./docs/mini_blog.md)** ## πŸ“– Want the Full Technical Deep Dive? > ### ➑️ [IntelliCredit-X: Full Technical Writeup (~5,000 words)](./docs/blog.md) > > *Deep technical walkthrough Β· Fully illustrated* > Covers: 3-agent architecture β†’ 55D observation space β†’ tool calling system β†’ sparse reward design β†’ GRPO 3-stage curriculum β†’ training curves analysis β†’ qualitative before/after examples β†’ 4 critical bugs we had to fix β†’ honest limitations. > > **[πŸ“– Read the Full Blog β†’](./docs/blog.md)** --- > **IntelliCredit-X** is an OpenEnv-compliant multi-agent reinforcement learning environment where an LLM learns to act as a regulatory-compliant Senior Credit Officer β€” investigating fraud signals via tool calls, managing a live loan portfolio across 50-step episodes, and respecting hard RBI mandates enforced by a RegulatorAgent. After GRPO fine-tuning of Mistral-7B, NPA rate halved on the hardest task and total reward improved 10Γ—. --- ## πŸ“Š Results at a Glance 051701af-7e29-4e6c-8895-f0c9b6569cf2 *Baseline Mistral-7B-Instruct-v0.3 (blue) vs. GRPO-trained IntelliCredit model (green) β€” **zero regressions across all 24 metric-task combinations.*** | Task | Metric | Base Model | GRPO Model | Ξ” | |------|--------|-----------|-----------|---| | Task 1 (Easy) | Score | 0.900 | **0.955** | **+0.055 βœ…** | | | Accuracy | 80.0% | **86.7%** | **+6.7% βœ…** | | | Capital Util | 40.0% | **60.0%** | **+20.0% βœ…** | | Task 2 (Medium) | Score | 1.000 | 1.000 | ceiling βœ… | | | Total Reward | 10.305 | **10.584** | **+0.279 βœ…** | | Task 3 (Hard) | Score | 0.767 | **0.833** | **+0.067 βœ…** | | | Total Reward | 0.215 | **2.491** | **+2.276 βœ… (10Γ—!)** | | | **NPA Rate** | **16.7%** | **8.3%** | **βˆ’8.3% βœ… (halved!)** | --- ## 🎯 Core Motivation The MSME lending sector in India processes over **100,000 loan applications daily**. Current bottlenecks: - A senior loan officer reviews **~16 applications/day** β€” 0.016% of total volume by human experts - **12–15% annual default rates** due to poor risk assessment - **Manual cross-referencing** of GST, MCA, CIBIL, court records takes days per application - **No explainable audit trail** β€” decisions based on "gut feeling" under time pressure **Our approach:** Create a training ground where an AI learns to *think* like the best credit officers β€” gathering evidence, detecting hidden fraud, respecting non-negotiable regulations, and managing portfolio risk across time. --- ## βš™οΈ How the Environment Works (v2.0) An agent plays a **50-step Credit Committee Episode**: ``` Step T = 1..50: 1. Environment generates an MSME application (Anchor Γ— Sector Γ— Size Γ— Tier) 2. Agent sees 55D observation (application + portfolio + macro + memory) 3. Agent may call up to 4 investigation tools 4. Agent submits: APPROVE (0) | CONDITIONAL (1) | REJECT (2) 5. Reward computed: R1 (correctness) + R2 (hard rules) + R3 (format) + R4 (portfolio) 6. Approved loans join portfolio 7. RegulatorAgent audits at jittered steps β‰ˆ 10/20/30/40/50 8. Loan maturity events fire T+10 to T+30 (delayed NPA consequences) 9. At step 50: settlement reward + Reflection Module activates ``` ### Multi-Agent System | Agent | Simulated By | Responsibility | |-------|-------------|----------------| | **Credit Officer** | LLM (GRPO fine-tuned) | Reviews applications, calls tools, makes decisions | | **BorrowerAgent** | Programmatic | Reapplies after rejection with improved *surface* metrics (hidden PD unchanged or worse) | | **RegulatorAgent** | Programmatic | Audits at β‰ˆsteps 10/20/30/40/50 (Β±1 jitter), shuts down after 3 consecutive failures | #### BorrowerAgent β€” Fraud Simulation Detail When rejected, BorrowerAgent: (1) waits 3–5 steps, (2) reapplies with DSCR +8%, Collateral +15%, Director guarantee added. **Hidden PD stays the same or worsens.** Max 3 attempts. Detection signals β€” `Dim 49: borrower_persistence_score` (0.0=1st, 0.5=2nd, **1.0=3rd attempt = maximum manipulation signal**), `alert_state[4]` REPEAT_APPLICANT flag, and `check_compliance_status()` reveals unchanged GST/MCA history. #### RegulatorAgent β€” Audit Mechanics Audits fire at β‰ˆsteps 10/20/30/40/50 (Β±1 jitter to prevent timing exploitation). | Constraint | Clean | Warning | Violation | Penalty | |-----------|-------|---------|-----------|--------| | NPA Rate | <3% | 3–5% | β‰₯5% | **βˆ’8.0** | | CRAR | >15% | 12.5–15% | <12.5% | **βˆ’15.0 + possible termination** | | Sector Concentration | <25% | 25–30% | >30% | **βˆ’8.0** | | Single Borrower Limit | <12% | 12–15% | >15% | **βˆ’5.0** | Escalation: 0 failures=normal β†’ 1=warning β†’ 2=capital penalty (βˆ’10% available capital) β†’ 3=**REGULATORY_SHUTDOWN** (βˆ’50.0 terminal penalty) ### πŸ“… 50-Step Episode Lifecycle ``` Steps 1–10 β”‚ EARLY PHASE β”‚ Clean profiles, build portfolio baseline Step ~10 β”‚ AUDIT #1 β”‚ NPA / CRAR / sector concentration checked Steps 11–20 β”‚ MIDDLE PHASE β”‚ Forensic RED alerts appear; repeat applicants reapply Step ~20 β”‚ MACRO SHOCK β”‚ GDP contraction fires; 1–2 sectors enter stressed state Step ~20 β”‚ AUDIT #2 β”‚ Steps 21–30 β”‚ CRISIS PHASE β”‚ Maturity events fire: Repaid +10.0 / Defaulted βˆ’15.0 Step ~30 β”‚ AUDIT #3 β”‚ Steps 31–40 β”‚ RECOVERY β”‚ Balance new approvals vs NPA cleanup Step ~40 β”‚ AUDIT #4 β”‚ Steps 41–50 β”‚ FINAL PHASE β”‚ Survival, capital preservation Step 50 β”‚ SETTLEMENT β”‚ score = 0.30Γ—yield + 0.30Γ—(1βˆ’npa) + 0.20Γ—compliance + 0.20Γ—capital_util β”‚ REFLECTION β”‚ Lesson extraction activates for next episode ``` --- ## 🧠 Training Curves c54ed1cb-564e-40bd-81be-d56a76d9713f *IntelliCredit GRPO v2 training across 3 curriculum stages. Note the key inflection points at stage transitions (dashed lines):* - **GRPO Loss (red):** Controlled upward drift from ~0 β†’ 0.05 β€” policy is meaningfully diverging from base model - **Mean Reward (blue):** Starts at βˆ’2.0 (random violations), crosses zero by step 10, stabilizes near +0.5–+1.0 β€” **the environment is learnable** - **KL Divergence (purple):** Grows to ~0.04–0.08 β€” model learned new behaviors while preserving language capability - **`submit_pct` (teal):** Format compliance climbs from 0% β†’ 40–65% β€” model acquires the task's vocabulary --- ## πŸ›‘ Regulatory Rules (6 Non-Negotiable Hard Rules) | Rule | Condition | Action | |------|-----------|--------| | **HR-01** | DSCR < 1.0 | Mandatory REJECT + βˆ’2.0 penalty | | **HR-02** | Director disqualified (DIN < 0.1) | Mandatory REJECT + βˆ’2.0 penalty | | **HR-03** | RED forensic alert present | Mandatory REJECT + βˆ’2.0 penalty | | **HR-04** | Cheque bounce rate > 25% | Mandatory REJECT + βˆ’2.0 penalty | | **HR-05** | GST compliance < 40% | Mandatory REJECT + βˆ’2.0 penalty | | **HR-06** | Severe adverse media (> 0.80) | Mandatory REJECT + βˆ’2.0 penalty | ### Portfolio Constraints | Constraint | Threshold | Consequence | |------------|-----------|-------------| | CRAR | > 12.5% | Episode terminates if breached | | NPA Rate | < 5% | Episode terminates if breached | | Sector Concentration | < 30% | βˆ’8.0 penalty per audit | | Single Borrower | < 15% | βˆ’5.0 penalty per audit | --- ## πŸ‘οΈ Observation Space (55D) The agent observes a **55-dimensional vector** bounded `[βˆ’1.0, +1.0]`. *(βˆ’1.0 = sentinel for missing/masked data β€” teaching the agent that data absence itself is a risk signal.)* | Group | Dims | Description | |-------|------|-------------| | Application Features | 0–24 | 25 financial/forensic/governance ratios | | Portfolio State | 25–34 | Capital deployed, NPA rate, CRAR, provisioning coverage, sector flags | | Macro State | 35–39 | Systemic stress, GDP growth, inflation, credit cycle phase | | Alert State | 40–44 | Running RED/YELLOW alert tallies from episode | | **Memory Features** *(v2 NEW)* | **45–54** | **Agent's own behavioral history encoded as state** | ### Application Features (Dims 0–24) β€” Key Metrics | Category | Metrics | |----------|---------| | Debt Serviceability | DSCR, Current Ratio, Debt-to-Equity, EBITDA Margin | | Collateral | Collateral Coverage Ratio, RONW | | Banking Behavior | OD Utilisation, CC Volatility, Cheque Bounce Rate, Working Capital Cycle | | GST/Tax | GST CAGR, GST 2A-3B Gap, ITC Mismatch, GST Alignment Score | | Fraud Signals | Related-Party Transactions, Circular Trading Score | | Governance | Promoter Litigation Count, MCA Charges, Adverse Media Sentiment, DIN Score | **Key memory dimensions (Dims 45–54):** - `Dim 49: borrower_persistence_score` β€” 0.0=1st attempt, 0.5=2nd, **1.0=3rd attempt (maximum manipulation signal)** - `Dim 50: audit_risk_score` β€” proximity to next regulator audit - `Dim 51: capital_buffer_ratio` β€” headroom above minimum CRAR - `Dim 53: episode_progress` β€” normalized step count (0.0–1.0) --- ## πŸ•ΉοΈ Action Space + Tool Calling **Discrete(3):** APPROVE(0) | CONDITIONAL(1) | REJECT(2) β€” plus optional tool calls before deciding. ### Investigation Tools (up to 4 per step) | Tool | Returns | Best Used When | |------|---------|----------------| | `get_financial_report(company_id)` | 3yr revenue trend, EBITDA, auditor remarks, related-party txns | Borderline financials, need trend confirmation | | `check_compliance_status(company_id)` | DIN status, NCLT cases, GST filings, CIBIL, prior defaults | RED alert present, low governance score | | `get_market_intelligence(sector)` | Sector stress, RBI advisory, portfolio exposure, peer NPA rate | Approaching 30% concentration limit | | `submit_decision(action, reasoning)` | Finalizes step (reasoning β‰₯ 50 chars required) | After investigation complete | ### Action Parser β€” Priority Order (`server/action_parser.py`) The LLM outputs free-form text. Parsed in strict priority: 1. **Tool call detected** β†’ `get_financial_report(...)` β†’ executes tool, does **not** advance step 2. **`submit_decision(action, reasoning)`** β†’ validates format, advances step counter 3. **Standalone keyword** β†’ `APPROVE` / `CONDITIONAL` / `REJECT` scanned in text 4. **Default fallback** β†’ REJECT (safe default) + logs `parse_failure=True` Anti-abuse: multiple decisions β†’ last wins; reasoning < 50 chars β†’ penalty; empty reasoning β†’ decision blocked. --- ## πŸ“ˆ Reward System | Component | Weight | Range | Description | |-----------|--------|-------|-------------| | R1: Decision Correctness | 40% | [βˆ’2.0, +1.0] | PD-based: low PD+APPROVE=+1.0; high PD+APPROVE=βˆ’2.0 | | R2: Hard Rule Compliance | 30% | [βˆ’2.0, +0.5] | HR+REJECT=+0.5; HR+APPROVE=βˆ’2.0 | | R3: Format Compliance | 15% | [βˆ’0.3, +0.3] | `submit_decision()` used=+0.3; parse failure=βˆ’0.3 | | R4: Portfolio Awareness | 15% | [βˆ’0.8, +0.3] | NPA>8%+risky approve=βˆ’0.5; healthy approve=+0.2 | **Delayed Events:** Loan maturity fires T+10 to T+30 after approval (Repaid: +10.0, Defaulted: βˆ’15.0Γ—(1βˆ’recovery)) **Audit Bonus:** +2.0 clean audit / βˆ’8.0 violation / βˆ’15.0 capital breach / βˆ’50.0 shutdown (3rd failure) **Settlement (step 50):** `0.30Γ—yield + 0.30Γ—(1βˆ’npa) + 0.20Γ—compliance + 0.20Γ—capital_util` ### Survival Bonus (Every 10 Steps) | CRAR Level | Bonus | Meaning | |-----------|-------|--------| | β‰₯ 15% | +0.10 | Healthy capital buffer | | 12.5–15% | +0.05 | Marginal β€” caution signal | | < 12.5% | Episode terminates | Capital inadequacy = bank failure | ## πŸ›‘οΈ Anti-Gaming Mechanisms (10 Independent Safeguards) | # | Mechanism | What It Prevents | |---|-----------|------------------| | 1 | Hidden PD β€” agent cannot see true default probability | Cannot directly optimize against ground truth | | 2 | Read-only tools β€” cannot mutate environment state | Tool calls cannot manipulate outcomes | | 3 | Max 4 tool calls enforced at env level (not agent) | Cannot bypass limit via prompt tricks | | 4 | Reasoning quality check β€” empty text blocks decision | Cannot submit empty reasoning for format reward | | 5 | Redundant tool call penalty (βˆ’0.1 each) | Prevents information-flooding strategy | | 6 | Delayed NPA β€” defaults arrive T+10 to T+30 | Cannot see future consequences to optimize backward | | 7 | World state locked β€” agent has no write access | Cannot modify portfolio variables directly | | 8 | Deterministic episode seeds | No lucky randomness β€” same episode every run | | 9 | Multiple independent reward functions | Gaming one component doesn't win overall | | 10 | Jittered audit timing (Β±1 step) | Cannot predict exact audit step to game timing | --- ## πŸ€– GRPO Training Pipeline β€” 2-Stage Approach The final model [`vssksn/intellicredit-mistral-7b-grpo`](https://huggingface.co/vssksn/intellicredit-mistral-7b-grpo) was trained using a **2-stage pipeline**: offline GRPO for speed and domain knowledge, then **online GRPO directly against the live IntelliCredit environment** for true behavioral alignment. ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ STAGE 1 β€” Offline GRPO β”‚ β”‚ STAGE 2 β€” Online GRPO β”‚ β”‚ (Speed-Optimised) β”‚ β”‚ (Environment-Native) β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ Model : Mistral-7B-v0.3 β”‚ β”‚ Model : Mistral-7B (Stg 1) β”‚ β”‚ Engine: Unsloth + TRL β”‚ β”‚ Env : Live HF Space β”‚ β”‚ Data : 2,000 prompts β”‚ β”‚ Data : Real episodes β”‚ β”‚ Reward: 4 local functions β”‚ β”‚ Reward: /step endpoint 100% β”‚ β”‚ Speed : ~45 minutes β”‚ β”‚ Type : True Online RL β”‚ β”‚ Goal : Domain knowledge β”‚ β”‚ Goal : True env alignment β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό vssksn/intellicredit-mistral-7b-grpo Post-trained on live environment interactions ``` ### Stage 1 β€” Offline GRPO (Speed-Optimised) πŸš€ **[πŸ““ Stage 1 Colab Notebook](https://colab.research.google.com/drive/1HhVu1JezKoT32zfHIEfAFersxRrwZSYu?usp=sharing)** β€” Mistral-7B + Unsloth, A100, ~45 minutes Pre-trains on a curated 2,000-prompt dataset for maximum training speed and domain knowledge transfer. **Training Dataset:** - **2,000 prompts** β€” 400 per task level (task1–task5), ~2,400 chars each - Ground truth metadata: hidden PD, optimal action, hard rules, alerts, sector, CRAR, NPA - Distribution: **47.2%** hard rules triggered | **28.1%** RED forensic alerts - Published: [vssksn/intellicredit-grpo-v2](https://huggingface.co/datasets/vssksn/intellicredit-grpo-v2) **3-Stage Curriculum:** | Stage | Data | LR | Temperature | Goal | |-------|------|----|-------------|------| | Stage 0 (SFT Warmup) | Mixed | 5e-5 | β€” | Bootstrap `submit_decision()` format compliance | | Stage 1 | task1 (Easy) | 5e-6 | 0.9 | Hard rule recognition on clean profiles | | Stage 2 | task1 + task2 | 5e-6 | 0.9 | Forensic alert detection, tool call initiation | | Stage 3 | All tasks | 2e-6 | 0.8 | Long-horizon portfolio management | ``` Config: rank=16 QLoRA (Unsloth), seq_len=2048, 8 generations/prompt batch=2 + grad_accum=8 (effective=16), KL Ξ²=0.001 ``` ### Stage 2 β€” Online GRPO (Environment-Native) 🌍 **[🌍 Stage 2 Notebook β€” Online Training (Colab)](https://colab.research.google.com/github/1919-14/intellicredit-openenv/blob/main/training/colab_online_grpo.ipynb)** β€” Live env, 50-step episodes, real rewards, Mistral-7B Post-trains the Stage 1 model by **directly interacting with the live IntelliCredit environment**. Every single reward signal comes from the actual `/step` endpoint β€” this is true online RL, not a proxy. | Feature | Detail | |---------|--------| | Environment | [vssksn-intellicredit-openenv.hf.space](https://vssksn-intellicredit-openenv.hf.space) (live HTTP) | | Episode length | **50 steps** β€” full credit committee lifecycle | | **Reward source** | **`/step` endpoint β€” 100% environment-native** | | Tool calling | Multi-turn: tools β†’ evidence β†’ `submit_decision()` | | Reflection | Cross-episode memory bank (6 lesson categories, FIFO 20) | | Curriculum | 3 phases: task1 β†’ task3 β†’ all 5 tasks, temp 1.2β†’0.8 | | Model published | [vssksn/intellicredit-mistral-7b-grpo](https://huggingface.co/vssksn/intellicredit-mistral-7b-grpo) | ### πŸ”§ Critical Training Bug Fixes | Bug | Root Cause | Fix Applied | |-----|-----------|-------------| | CUDA Index OOB | Unsloth pads vocab 32768β†’32832; padded token IDs indexed into smaller training logits | Clamp all IDs to `vocab_sizeβˆ’1` + `valid_mask` to skip OOB | | Sequence Mismatch | `full_ids` exceeded 2048 before forward pass; logits truncated β†’ shape crash | Enforce `full_ids = full_ids[:, :MAX_SEQ_LEN]` before forward | | Loss Scale Instability | Raw log-prob sum scaled with sequence length β†’ exploding gradients | Switch to per-token average: `loss = -sum(log_probs) / n_valid_tokens` | | Flat KL Divergence | `clamp(min=0)` β†’ KL=0 when new policy more confident than reference | Changed to `abs()` for symmetric KL β€” always non-zero | | Zero-LP Episodes | Prompt filled entire 2048-token context β†’ 0 completion tokens | Skip with `continue` when `sum(log_probs) == 0` | --- ## πŸͺž Self-Improvement Reflection System GRPO updates weights. The Reflection Module improves the model **without retraining** β€” by injecting structured lessons from episode failures into the next episode's system prompt. ``` Episode N β†’ Analyze all steps where reward < 0 β†’ Extract lessons by failure type (6 categories) β†’ Store top 20 lessons in memory_bank.json (FIFO eviction, deduplicated) Episode N+1 β†’ Inject top 5 lessons into system prompt Layer 3 β†’ better decisions ``` ### 6 Lesson Trigger Types | Trigger | Lesson Injected | Severity | |---------|-----------------|----------| | Hard Rule Violation | `RULE: When [condition], always REJECT` | Critical | | Delayed Default | `CAUTION: Loans with [pattern] defaulted T+N steps later` | High | | Audit Failure | `COMPLIANCE: Audit failed due to [metric breach]` | High | | Borrower Manipulation | `FRAUD RISK: Repeat applicant with [pattern] defaulted` | Critical | | Macro Shock Loss | `MACRO: During [state], be conservative with [sector]` | Medium | | Portfolio Overexposure | `PORTFOLIO: NPA reached X%. Tighten approvals.` | High | **Verified result (base model, no fine-tuning, 3 consecutive episodes):** | Episode | Score | Improvement | |---------|-------|-------------| | 1 | 0.213 | Baseline | | 2 | 0.265 | **+24.4% βœ…** | | 3 | 0.304 | **+43.2% βœ…** | 43% improvement purely through in-context lesson injection β€” zero weight changes. --- ## πŸ† Task Descriptions | Task | Difficulty | Steps | Key Challenge | |------|-----------|-------|---------------| | `task1` | 🟒 Easy | 50 | Clean profiles, basic APPROVE/REJECT | | `task2` | 🟑 Medium | 50 | Forensic alerts (YELLOW/RED), tool investigation | | `task3` | πŸ”΄ Hard | 50 | Macro shocks + missing data + repeat applicants | | `task4` | πŸ”₯ Expert | 50 | Hard-rule violations + all adversarial patterns | | `task5` | ⚑ Master | 50 | Full: CRAR limits + cascading NPAs + 5 audits | --- ## πŸ’» Quick Start ### Try the Live API ```bash # Start an episode curl -X POST https://vssksn-intellicredit-openenv.hf.space/reset \ -H "Content-Type: application/json" \ -d '{"episode_id": "demo-001", "seed": 42, "task_id": "task2"}' # Submit a decision (0=APPROVE, 1=CONDITIONAL, 2=REJECT) curl -X POST https://vssksn-intellicredit-openenv.hf.space/step \ -H "Content-Type: application/json" \ -d '{"episode_id": "demo-001", "action": {"decision": 2}}' ``` **β†’ [Full Swagger UI](https://vssksn-intellicredit-openenv.hf.space/docs)** ### Local Setup ```bash git clone https://github.com/1919-14/intellicredit-openenv.git --branch v2 cd intellicredit-openenv uv venv && source .venv/bin/activate uv pip install -r requirements.txt python -m server.app # β†’ http://localhost:7860/docs ``` ### Evaluate the GRPO Model ```bash # Run GRPO model against environment python eval_llm.py \ --model vssksn/intellicredit-mistral-7b-grpo \ --env-url http://localhost:7860 \ --out grpo_results.json # Compare vs base model python eval_llm.py \ --model mistralai/Mistral-7B-Instruct-v0.3 \ --env-url http://localhost:7860 \ --out base_results.json # Generate comparison chart python compare_results.py \ --baseline base_results.json \ --after grpo_results.json \ --out comparison.png ``` ### Docker ```bash docker build -t intellicredit-v2 . docker run -p 7860:7860 intellicredit-v2 ``` --- ## πŸ“ Project Structure ``` intellicraft-openenv/ β”œβ”€β”€ server/ β”‚ β”œβ”€β”€ app.py # FastAPI server β€” /reset, /step, /info, /health β”‚ β”œβ”€β”€ intellicredit_env.py # v2 core: WorldState, 50-step lifecycle, multi-agent β”‚ β”œβ”€β”€ dataset.py # Application generator (Anchor Γ— Sector Γ— Size Γ— Tier) β”‚ β”œβ”€β”€ reward.py # R1-R4 reward engine + settlement grader β”‚ β”œβ”€β”€ action_parser.py # LLM text β†’ tool call / decision parser (6-level) β”‚ β”œβ”€β”€ tool_executor.py # Read-only tool execution (financial, compliance, market) β”‚ β”œβ”€β”€ agent_loop.py # Agent orchestrator + prompt injection + step logger β”‚ └── reflection.py # Self-improvement + memory bank system β”‚ β”œβ”€β”€ training/ β”‚ β”œβ”€β”€ colab_grpo_3b_v2.py # ← PRIMARY: Unsloth GRPO training (A100, ~45 min) β”‚ β”œβ”€β”€ generate_dataset.py # 2000-prompt GRPO dataset generator β”‚ β”œβ”€β”€ grpo_rewards.py # 4 GRPO reward functions (R1-R4) β”‚ └── train_grpo.py # 3-stage curriculum pipeline β”‚ β”œβ”€β”€ evaluation/ β”‚ β”œβ”€β”€ evaluate.py # Multi-mode evaluation engine (baseline/reflection/GRPO) β”‚ └── compare.py # Comparison tables + reward curves (4-panel PNG) β”‚ β”œβ”€β”€ docs/ β”‚ β”œβ”€β”€ blog.md # Full technical blog post (~5,000 words) β”‚ └── assets/ β”‚ β”œβ”€β”€ comparison.png # Baseline vs GRPO results chart β”‚ └── training_curves.png # GRPO training curves (Mistral-7B, A100) β”‚ β”œβ”€β”€ eval_llm.py # LLM evaluation via HTTP (base vs trained) β”œβ”€β”€ compare_results.py # Bar chart generator (8 metrics Γ— 3 tasks) β”œβ”€β”€ baseline_results.json # RuleBasedAgent reference scores β”œβ”€β”€ memory_bank.json # Persistent cross-episode lesson storage (auto-generated) β”œβ”€β”€ inference.py # LLM inference wrapper (HF API) β”œβ”€β”€ models.py # Pydantic schemas (55D observation, action) β”œβ”€β”€ client.py # HTTP client for environment interaction β”œβ”€β”€ openenv.yaml # OpenEnv framework config β”œβ”€β”€ PROJECT_SUMMARY.md # Complete project summary (all 8 phases) β”œβ”€β”€ Dockerfile # HF Spaces Docker deployment └── requirements.txt # Python dependencies ``` ## πŸ§ͺ Evaluation Methodology Two evaluation approaches: **Approach 1 β€” Direct Python (`evaluation/evaluate.py`):** Tests agents by calling `IntelliCreditEnvironment` directly. Agents: `RuleBasedAgent` (optimal), `RandomAgent` (lower bound), `GreedyApproveAgent`. Output: `baseline_results.json`. **Approach 2 β€” HTTP API (`eval_llm.py`):** Tests actual LLM via running server. Since `/step` returns only `{observation, reward, done}`, scores computed locally: | Metric | Formula | Weight | |--------|---------|--------| | Accuracy | steps with positive reward / total steps | 0.5 | | HR Compliance | 1 βˆ’ (steps with reward < βˆ’5) / total steps | 0.3 | | Survival Rate | 1.0 if all 50 steps completed without shutdown | 0.2 | | **Final Score** | accuracyΓ—0.5 + hr_complianceΓ—0.3 + survivalΓ—0.2 | β€” | ## πŸ“‹ Version History | Feature | v1.0 | v2.0 (Current) | |---------|------|----------------| | Episode Length | 12 steps | **50 steps** (4Γ—) | | Observation Dims | 45D | **55D** (+10 memory features) | | Agent Count | 1 | **3** (Credit Officer + Borrower + Regulator) | | Reward Type | Dense per-step | **Delayed + sparse** (realistic credit risk) | | Tool Calling | ❌ | **βœ… 3 tools, max 4 calls/step** | | Self-Improvement | ❌ | **βœ… Cross-episode reflection module** | | GRPO Fine-Tuning | ❌ | **βœ… Mistral-7B, A100, ~45 min** | | Deployment | Local only | **βœ… Docker + HF Spaces** | ## πŸ“Š Baseline Agent Results (RuleBasedAgent β€” 25 episodes) | Task | Avg Score | Accuracy | NPA Rate | |------|-----------|----------|----------| | task1 (Easy) | 0.389 | 77.9% | 4.8% | | task2 (Medium) | 0.325 | 66.6% | 8.9% | | task3 (Hard) | 0.288 | 81.5% | 20.2% | | task4 (Expert) | 0.265 | 85.9% | 26.7% | | task5 (Master) | 0.251 | 77.8% | 6.7% | | **Overall** | **0.304** | **77.9%** | **13.4%** | --- ## πŸ”— All Links | Resource | Link | |----------|------| | πŸ€— **Live Environment** | [huggingface.co/spaces/vssksn/intellicredit-openenv](https://huggingface.co/spaces/vssksn/intellicredit-openenv) | | πŸ€— **GRPO Model** | [huggingface.co/vssksn/intellicredit-mistral-7b-grpo](https://huggingface.co/vssksn/intellicredit-mistral-7b-grpo) | | πŸ€— **Training Dataset** | [huggingface.co/datasets/vssksn/intellicredit-grpo-v2](https://huggingface.co/datasets/vssksn/intellicredit-grpo-v2) | | πŸ’» **GitHub (v2 branch)** | [github.com/1919-14/intellicredit-openenv/tree/v2](https://github.com/1919-14/intellicredit-openenv/tree/v2) | | πŸ“– **API Swagger** | [vssksn-intellicredit-openenv.hf.space/docs](https://vssksn-intellicredit-openenv.hf.space/docs) | | πŸ“ **Mini Blog (Story & Results)** | [docs/mini_blog.md](./docs/mini_blog.md) | | πŸ“– **Full Technical Blog** | [docs/blog.md](./docs/blog.md) | | πŸ““ **Colab Training Notebook** | [Open in Colab](https://colab.research.google.com/drive/1HhVu1JezKoT32zfHIEfAFersxRrwZSYu?usp=sharing) | | πŸ“Š **Project Summary** | [PROJECT_SUMMARY.md](./PROJECT_SUMMARY.md) | | πŸ“‹ **Env Info API** | [/info endpoint](https://vssksn-intellicredit-openenv.hf.space/info) | --- ## 🐳 Docker Deployment ```bash docker build -t intellicredit-v2 . docker run -p 7860:7860 intellicredit-v2 # With HF token for LLM inference: docker run -p 7860:7860 -e HF_TOKEN="your-token" intellicredit-v2 ``` ## πŸ’» Hardware Requirements | Component | Environment Server | GRPO Training | |-----------|-------------------|---------------| | CPU | 2 vCPUs minimum | 8+ cores | | RAM | 2 GB minimum | 32 GB minimum | | GPU | **Not required** | **A100 80GB mandatory** | | Storage | 500 MB | ~30 GB (model checkpoints) | | Training Time | β€” | ~45 minutes | ## πŸ” Environment Variables | Variable | Description | Default | |----------|-------------|--------| | `HF_TOKEN` | Hugging Face API token | Required for `inference.py` | | `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` | | `MODEL_NAME` | LLM model for inference | `meta-llama/Llama-3.3-70B-Instruct` | | `ENV_URL` | Environment server URL (for eval/training) | `http://localhost:7860` | --- ## πŸ“š Citation ```bibtex @article{intellicredit2025, title = {IntelliCredit-X: A Multi-Agent Constrained MDP for MSME Credit Appraisal with GRPO Fine-Tuning}, author = {Narayana, V S S K Sai and Jaiswal, Sujeet}, year = {2026}, note = {OpenEnv Hackathon Submission β€” Meta Γ— Hugging Face}, url = {https://huggingface.co/spaces/vssksn/intellicredit-openenv} } ``` --- ## πŸ“œ License MIT License β€” See [LICENSE](LICENSE) for details. --- *Built by **V S S K Sai Narayana** & **Sujeet Jaiswal** for the Meta Γ— Hugging Face OpenEnv Hackathon 2026.*