---
title: IntelliCredit CreditAppraisal v2
emoji: π¦
colorFrom: indigo
colorTo: red
sdk: docker
app_port: 7860
pinned: false
---
# π¦ IntelliCredit-X β Teaching an LLM to Think Like a Credit Officer
[](https://huggingface.co/spaces/vssksn/intellicredit-openenv)
[](https://huggingface.co/datasets/vssksn/intellicredit-grpo-v2)
[](https://huggingface.co/vssksn/intellicredit-mistral-7b-grpo)
[](https://github.com/1919-14/intellicredit-openenv)
[](https://vssksn-intellicredit-openenv.hf.space/docs)
[](./docs/mini_blog.md)
[](./docs/blog.md)
[](https://colab.research.google.com/drive/1HhVu1JezKoT32zfHIEfAFersxRrwZSYu?usp=sharing)
[](LICENSE)
[](./PROJECT_SUMMARY.md)
[](https://github.com/meta-pytorch/openenv)
**By V S S K Sai Narayana & Sujeet Jaiswal**
*Meta Γ Hugging Face OpenEnv Hackathon 2026*
---
## π Start Here β Mini Blog (2-Minute Read)
> ### β‘οΈ [The Story of Teaching an AI to Catch What Humans Miss](./docs/mini_blog.md)
>
> *Quick read Β· Storytelling format Β· Both proof charts included Β· The "why" behind IntelliCredit-X*
> The real-world MSME problem β what we built β how we trained it β the proof it works β a real before/after example.
>
> **[π Read the Mini Blog β](./docs/mini_blog.md)**
## π Want the Full Technical Deep Dive?
> ### β‘οΈ [IntelliCredit-X: Full Technical Writeup (~5,000 words)](./docs/blog.md)
>
> *Deep technical walkthrough Β· Fully illustrated*
> Covers: 3-agent architecture β 55D observation space β tool calling system β sparse reward design β GRPO 3-stage curriculum β training curves analysis β qualitative before/after examples β 4 critical bugs we had to fix β honest limitations.
>
> **[π Read the Full Blog β](./docs/blog.md)**
---
> **IntelliCredit-X** is an OpenEnv-compliant multi-agent reinforcement learning environment where an LLM learns to act as a regulatory-compliant Senior Credit Officer β investigating fraud signals via tool calls, managing a live loan portfolio across 50-step episodes, and respecting hard RBI mandates enforced by a RegulatorAgent. After GRPO fine-tuning of Mistral-7B, NPA rate halved on the hardest task and total reward improved 10Γ.
---
## π Results at a Glance
*Baseline Mistral-7B-Instruct-v0.3 (blue) vs. GRPO-trained IntelliCredit model (green) β **zero regressions across all 24 metric-task combinations.***
| Task | Metric | Base Model | GRPO Model | Ξ |
|------|--------|-----------|-----------|---|
| Task 1 (Easy) | Score | 0.900 | **0.955** | **+0.055 β
** |
| | Accuracy | 80.0% | **86.7%** | **+6.7% β
** |
| | Capital Util | 40.0% | **60.0%** | **+20.0% β
** |
| Task 2 (Medium) | Score | 1.000 | 1.000 | ceiling β
|
| | Total Reward | 10.305 | **10.584** | **+0.279 β
** |
| Task 3 (Hard) | Score | 0.767 | **0.833** | **+0.067 β
** |
| | Total Reward | 0.215 | **2.491** | **+2.276 β
(10Γ!)** |
| | **NPA Rate** | **16.7%** | **8.3%** | **β8.3% β
(halved!)** |
---
## π― Core Motivation
The MSME lending sector in India processes over **100,000 loan applications daily**. Current bottlenecks:
- A senior loan officer reviews **~16 applications/day** β 0.016% of total volume by human experts
- **12β15% annual default rates** due to poor risk assessment
- **Manual cross-referencing** of GST, MCA, CIBIL, court records takes days per application
- **No explainable audit trail** β decisions based on "gut feeling" under time pressure
**Our approach:** Create a training ground where an AI learns to *think* like the best credit officers β gathering evidence, detecting hidden fraud, respecting non-negotiable regulations, and managing portfolio risk across time.
---
## βοΈ How the Environment Works (v2.0)
An agent plays a **50-step Credit Committee Episode**:
```
Step T = 1..50:
1. Environment generates an MSME application (Anchor Γ Sector Γ Size Γ Tier)
2. Agent sees 55D observation (application + portfolio + macro + memory)
3. Agent may call up to 4 investigation tools
4. Agent submits: APPROVE (0) | CONDITIONAL (1) | REJECT (2)
5. Reward computed: R1 (correctness) + R2 (hard rules) + R3 (format) + R4 (portfolio)
6. Approved loans join portfolio
7. RegulatorAgent audits at jittered steps β 10/20/30/40/50
8. Loan maturity events fire T+10 to T+30 (delayed NPA consequences)
9. At step 50: settlement reward + Reflection Module activates
```
### Multi-Agent System
| Agent | Simulated By | Responsibility |
|-------|-------------|----------------|
| **Credit Officer** | LLM (GRPO fine-tuned) | Reviews applications, calls tools, makes decisions |
| **BorrowerAgent** | Programmatic | Reapplies after rejection with improved *surface* metrics (hidden PD unchanged or worse) |
| **RegulatorAgent** | Programmatic | Audits at βsteps 10/20/30/40/50 (Β±1 jitter), shuts down after 3 consecutive failures |
#### BorrowerAgent β Fraud Simulation Detail
When rejected, BorrowerAgent: (1) waits 3β5 steps, (2) reapplies with DSCR +8%, Collateral +15%, Director guarantee added. **Hidden PD stays the same or worsens.** Max 3 attempts.
Detection signals β `Dim 49: borrower_persistence_score` (0.0=1st, 0.5=2nd, **1.0=3rd attempt = maximum manipulation signal**), `alert_state[4]` REPEAT_APPLICANT flag, and `check_compliance_status()` reveals unchanged GST/MCA history.
#### RegulatorAgent β Audit Mechanics
Audits fire at βsteps 10/20/30/40/50 (Β±1 jitter to prevent timing exploitation).
| Constraint | Clean | Warning | Violation | Penalty |
|-----------|-------|---------|-----------|--------|
| NPA Rate | <3% | 3β5% | β₯5% | **β8.0** |
| CRAR | >15% | 12.5β15% | <12.5% | **β15.0 + possible termination** |
| Sector Concentration | <25% | 25β30% | >30% | **β8.0** |
| Single Borrower Limit | <12% | 12β15% | >15% | **β5.0** |
Escalation: 0 failures=normal β 1=warning β 2=capital penalty (β10% available capital) β 3=**REGULATORY_SHUTDOWN** (β50.0 terminal penalty)
### π
50-Step Episode Lifecycle
```
Steps 1β10 β EARLY PHASE β Clean profiles, build portfolio baseline
Step ~10 β AUDIT #1 β NPA / CRAR / sector concentration checked
Steps 11β20 β MIDDLE PHASE β Forensic RED alerts appear; repeat applicants reapply
Step ~20 β MACRO SHOCK β GDP contraction fires; 1β2 sectors enter stressed state
Step ~20 β AUDIT #2 β
Steps 21β30 β CRISIS PHASE β Maturity events fire: Repaid +10.0 / Defaulted β15.0
Step ~30 β AUDIT #3 β
Steps 31β40 β RECOVERY β Balance new approvals vs NPA cleanup
Step ~40 β AUDIT #4 β
Steps 41β50 β FINAL PHASE β Survival, capital preservation
Step 50 β SETTLEMENT β score = 0.30Γyield + 0.30Γ(1βnpa) + 0.20Γcompliance + 0.20Γcapital_util
β REFLECTION β Lesson extraction activates for next episode
```
---
## π§ Training Curves
*IntelliCredit GRPO v2 training across 3 curriculum stages. Note the key inflection points at stage transitions (dashed lines):*
- **GRPO Loss (red):** Controlled upward drift from ~0 β 0.05 β policy is meaningfully diverging from base model
- **Mean Reward (blue):** Starts at β2.0 (random violations), crosses zero by step 10, stabilizes near +0.5β+1.0 β **the environment is learnable**
- **KL Divergence (purple):** Grows to ~0.04β0.08 β model learned new behaviors while preserving language capability
- **`submit_pct` (teal):** Format compliance climbs from 0% β 40β65% β model acquires the task's vocabulary
---
## π Regulatory Rules (6 Non-Negotiable Hard Rules)
| Rule | Condition | Action |
|------|-----------|--------|
| **HR-01** | DSCR < 1.0 | Mandatory REJECT + β2.0 penalty |
| **HR-02** | Director disqualified (DIN < 0.1) | Mandatory REJECT + β2.0 penalty |
| **HR-03** | RED forensic alert present | Mandatory REJECT + β2.0 penalty |
| **HR-04** | Cheque bounce rate > 25% | Mandatory REJECT + β2.0 penalty |
| **HR-05** | GST compliance < 40% | Mandatory REJECT + β2.0 penalty |
| **HR-06** | Severe adverse media (> 0.80) | Mandatory REJECT + β2.0 penalty |
### Portfolio Constraints
| Constraint | Threshold | Consequence |
|------------|-----------|-------------|
| CRAR | > 12.5% | Episode terminates if breached |
| NPA Rate | < 5% | Episode terminates if breached |
| Sector Concentration | < 30% | β8.0 penalty per audit |
| Single Borrower | < 15% | β5.0 penalty per audit |
---
## ποΈ Observation Space (55D)
The agent observes a **55-dimensional vector** bounded `[β1.0, +1.0]`.
*(β1.0 = sentinel for missing/masked data β teaching the agent that data absence itself is a risk signal.)*
| Group | Dims | Description |
|-------|------|-------------|
| Application Features | 0β24 | 25 financial/forensic/governance ratios |
| Portfolio State | 25β34 | Capital deployed, NPA rate, CRAR, provisioning coverage, sector flags |
| Macro State | 35β39 | Systemic stress, GDP growth, inflation, credit cycle phase |
| Alert State | 40β44 | Running RED/YELLOW alert tallies from episode |
| **Memory Features** *(v2 NEW)* | **45β54** | **Agent's own behavioral history encoded as state** |
### Application Features (Dims 0β24) β Key Metrics
| Category | Metrics |
|----------|---------|
| Debt Serviceability | DSCR, Current Ratio, Debt-to-Equity, EBITDA Margin |
| Collateral | Collateral Coverage Ratio, RONW |
| Banking Behavior | OD Utilisation, CC Volatility, Cheque Bounce Rate, Working Capital Cycle |
| GST/Tax | GST CAGR, GST 2A-3B Gap, ITC Mismatch, GST Alignment Score |
| Fraud Signals | Related-Party Transactions, Circular Trading Score |
| Governance | Promoter Litigation Count, MCA Charges, Adverse Media Sentiment, DIN Score |
**Key memory dimensions (Dims 45β54):**
- `Dim 49: borrower_persistence_score` β 0.0=1st attempt, 0.5=2nd, **1.0=3rd attempt (maximum manipulation signal)**
- `Dim 50: audit_risk_score` β proximity to next regulator audit
- `Dim 51: capital_buffer_ratio` β headroom above minimum CRAR
- `Dim 53: episode_progress` β normalized step count (0.0β1.0)
---
## πΉοΈ Action Space + Tool Calling
**Discrete(3):** APPROVE(0) | CONDITIONAL(1) | REJECT(2) β plus optional tool calls before deciding.
### Investigation Tools (up to 4 per step)
| Tool | Returns | Best Used When |
|------|---------|----------------|
| `get_financial_report(company_id)` | 3yr revenue trend, EBITDA, auditor remarks, related-party txns | Borderline financials, need trend confirmation |
| `check_compliance_status(company_id)` | DIN status, NCLT cases, GST filings, CIBIL, prior defaults | RED alert present, low governance score |
| `get_market_intelligence(sector)` | Sector stress, RBI advisory, portfolio exposure, peer NPA rate | Approaching 30% concentration limit |
| `submit_decision(action, reasoning)` | Finalizes step (reasoning β₯ 50 chars required) | After investigation complete |
### Action Parser β Priority Order (`server/action_parser.py`)
The LLM outputs free-form text. Parsed in strict priority:
1. **Tool call detected** β `get_financial_report(...)` β executes tool, does **not** advance step
2. **`submit_decision(action, reasoning)`** β validates format, advances step counter
3. **Standalone keyword** β `APPROVE` / `CONDITIONAL` / `REJECT` scanned in text
4. **Default fallback** β REJECT (safe default) + logs `parse_failure=True`
Anti-abuse: multiple decisions β last wins; reasoning < 50 chars β penalty; empty reasoning β decision blocked.
---
## π Reward System
| Component | Weight | Range | Description |
|-----------|--------|-------|-------------|
| R1: Decision Correctness | 40% | [β2.0, +1.0] | PD-based: low PD+APPROVE=+1.0; high PD+APPROVE=β2.0 |
| R2: Hard Rule Compliance | 30% | [β2.0, +0.5] | HR+REJECT=+0.5; HR+APPROVE=β2.0 |
| R3: Format Compliance | 15% | [β0.3, +0.3] | `submit_decision()` used=+0.3; parse failure=β0.3 |
| R4: Portfolio Awareness | 15% | [β0.8, +0.3] | NPA>8%+risky approve=β0.5; healthy approve=+0.2 |
**Delayed Events:** Loan maturity fires T+10 to T+30 after approval (Repaid: +10.0, Defaulted: β15.0Γ(1βrecovery))
**Audit Bonus:** +2.0 clean audit / β8.0 violation / β15.0 capital breach / β50.0 shutdown (3rd failure)
**Settlement (step 50):** `0.30Γyield + 0.30Γ(1βnpa) + 0.20Γcompliance + 0.20Γcapital_util`
### Survival Bonus (Every 10 Steps)
| CRAR Level | Bonus | Meaning |
|-----------|-------|--------|
| β₯ 15% | +0.10 | Healthy capital buffer |
| 12.5β15% | +0.05 | Marginal β caution signal |
| < 12.5% | Episode terminates | Capital inadequacy = bank failure |
## π‘οΈ Anti-Gaming Mechanisms (10 Independent Safeguards)
| # | Mechanism | What It Prevents |
|---|-----------|------------------|
| 1 | Hidden PD β agent cannot see true default probability | Cannot directly optimize against ground truth |
| 2 | Read-only tools β cannot mutate environment state | Tool calls cannot manipulate outcomes |
| 3 | Max 4 tool calls enforced at env level (not agent) | Cannot bypass limit via prompt tricks |
| 4 | Reasoning quality check β empty text blocks decision | Cannot submit empty reasoning for format reward |
| 5 | Redundant tool call penalty (β0.1 each) | Prevents information-flooding strategy |
| 6 | Delayed NPA β defaults arrive T+10 to T+30 | Cannot see future consequences to optimize backward |
| 7 | World state locked β agent has no write access | Cannot modify portfolio variables directly |
| 8 | Deterministic episode seeds | No lucky randomness β same episode every run |
| 9 | Multiple independent reward functions | Gaming one component doesn't win overall |
| 10 | Jittered audit timing (Β±1 step) | Cannot predict exact audit step to game timing |
---
## π€ GRPO Training Pipeline β 2-Stage Approach
The final model [`vssksn/intellicredit-mistral-7b-grpo`](https://huggingface.co/vssksn/intellicredit-mistral-7b-grpo) was trained using a **2-stage pipeline**: offline GRPO for speed and domain knowledge, then **online GRPO directly against the live IntelliCredit environment** for true behavioral alignment.
```
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β STAGE 1 β Offline GRPO β β STAGE 2 β Online GRPO β
β (Speed-Optimised) β β (Environment-Native) β
β β β β
β Model : Mistral-7B-v0.3 β β Model : Mistral-7B (Stg 1) β
β Engine: Unsloth + TRL β β Env : Live HF Space β
β Data : 2,000 prompts β β Data : Real episodes β
β Reward: 4 local functions β β Reward: /step endpoint 100% β
β Speed : ~45 minutes β β Type : True Online RL β
β Goal : Domain knowledge β β Goal : True env alignment β
ββββββββββββββββ¬βββββββββββ ββββββββββββββββ¬βββββββββββ
β β
ββββββββββββββββ¬βββββββββββββββ
β
βΌ
vssksn/intellicredit-mistral-7b-grpo
Post-trained on live environment interactions
```
### Stage 1 β Offline GRPO (Speed-Optimised) π
**[π Stage 1 Colab Notebook](https://colab.research.google.com/drive/1HhVu1JezKoT32zfHIEfAFersxRrwZSYu?usp=sharing)** β Mistral-7B + Unsloth, A100, ~45 minutes
Pre-trains on a curated 2,000-prompt dataset for maximum training speed and domain knowledge transfer.
**Training Dataset:**
- **2,000 prompts** β 400 per task level (task1βtask5), ~2,400 chars each
- Ground truth metadata: hidden PD, optimal action, hard rules, alerts, sector, CRAR, NPA
- Distribution: **47.2%** hard rules triggered | **28.1%** RED forensic alerts
- Published: [vssksn/intellicredit-grpo-v2](https://huggingface.co/datasets/vssksn/intellicredit-grpo-v2)
**3-Stage Curriculum:**
| Stage | Data | LR | Temperature | Goal |
|-------|------|----|-------------|------|
| Stage 0 (SFT Warmup) | Mixed | 5e-5 | β | Bootstrap `submit_decision()` format compliance |
| Stage 1 | task1 (Easy) | 5e-6 | 0.9 | Hard rule recognition on clean profiles |
| Stage 2 | task1 + task2 | 5e-6 | 0.9 | Forensic alert detection, tool call initiation |
| Stage 3 | All tasks | 2e-6 | 0.8 | Long-horizon portfolio management |
```
Config: rank=16 QLoRA (Unsloth), seq_len=2048, 8 generations/prompt
batch=2 + grad_accum=8 (effective=16), KL Ξ²=0.001
```
### Stage 2 β Online GRPO (Environment-Native) π
**[π Stage 2 Notebook β Online Training (Colab)](https://colab.research.google.com/github/1919-14/intellicredit-openenv/blob/main/training/colab_online_grpo.ipynb)** β Live env, 50-step episodes, real rewards, Mistral-7B
Post-trains the Stage 1 model by **directly interacting with the live IntelliCredit environment**. Every single reward signal comes from the actual `/step` endpoint β this is true online RL, not a proxy.
| Feature | Detail |
|---------|--------|
| Environment | [vssksn-intellicredit-openenv.hf.space](https://vssksn-intellicredit-openenv.hf.space) (live HTTP) |
| Episode length | **50 steps** β full credit committee lifecycle |
| **Reward source** | **`/step` endpoint β 100% environment-native** |
| Tool calling | Multi-turn: tools β evidence β `submit_decision()` |
| Reflection | Cross-episode memory bank (6 lesson categories, FIFO 20) |
| Curriculum | 3 phases: task1 β task3 β all 5 tasks, temp 1.2β0.8 |
| Model published | [vssksn/intellicredit-mistral-7b-grpo](https://huggingface.co/vssksn/intellicredit-mistral-7b-grpo) |
### π§ Critical Training Bug Fixes
| Bug | Root Cause | Fix Applied |
|-----|-----------|-------------|
| CUDA Index OOB | Unsloth pads vocab 32768β32832; padded token IDs indexed into smaller training logits | Clamp all IDs to `vocab_sizeβ1` + `valid_mask` to skip OOB |
| Sequence Mismatch | `full_ids` exceeded 2048 before forward pass; logits truncated β shape crash | Enforce `full_ids = full_ids[:, :MAX_SEQ_LEN]` before forward |
| Loss Scale Instability | Raw log-prob sum scaled with sequence length β exploding gradients | Switch to per-token average: `loss = -sum(log_probs) / n_valid_tokens` |
| Flat KL Divergence | `clamp(min=0)` β KL=0 when new policy more confident than reference | Changed to `abs()` for symmetric KL β always non-zero |
| Zero-LP Episodes | Prompt filled entire 2048-token context β 0 completion tokens | Skip with `continue` when `sum(log_probs) == 0` |
---
## πͺ Self-Improvement Reflection System
GRPO updates weights. The Reflection Module improves the model **without retraining** β by injecting structured lessons from episode failures into the next episode's system prompt.
```
Episode N β Analyze all steps where reward < 0
β Extract lessons by failure type (6 categories)
β Store top 20 lessons in memory_bank.json (FIFO eviction, deduplicated)
Episode N+1 β Inject top 5 lessons into system prompt Layer 3 β better decisions
```
### 6 Lesson Trigger Types
| Trigger | Lesson Injected | Severity |
|---------|-----------------|----------|
| Hard Rule Violation | `RULE: When [condition], always REJECT` | Critical |
| Delayed Default | `CAUTION: Loans with [pattern] defaulted T+N steps later` | High |
| Audit Failure | `COMPLIANCE: Audit failed due to [metric breach]` | High |
| Borrower Manipulation | `FRAUD RISK: Repeat applicant with [pattern] defaulted` | Critical |
| Macro Shock Loss | `MACRO: During [state], be conservative with [sector]` | Medium |
| Portfolio Overexposure | `PORTFOLIO: NPA reached X%. Tighten approvals.` | High |
**Verified result (base model, no fine-tuning, 3 consecutive episodes):**
| Episode | Score | Improvement |
|---------|-------|-------------|
| 1 | 0.213 | Baseline |
| 2 | 0.265 | **+24.4% β
** |
| 3 | 0.304 | **+43.2% β
** |
43% improvement purely through in-context lesson injection β zero weight changes.
---
## π Task Descriptions
| Task | Difficulty | Steps | Key Challenge |
|------|-----------|-------|---------------|
| `task1` | π’ Easy | 50 | Clean profiles, basic APPROVE/REJECT |
| `task2` | π‘ Medium | 50 | Forensic alerts (YELLOW/RED), tool investigation |
| `task3` | π΄ Hard | 50 | Macro shocks + missing data + repeat applicants |
| `task4` | π₯ Expert | 50 | Hard-rule violations + all adversarial patterns |
| `task5` | β‘ Master | 50 | Full: CRAR limits + cascading NPAs + 5 audits |
---
## π» Quick Start
### Try the Live API
```bash
# Start an episode
curl -X POST https://vssksn-intellicredit-openenv.hf.space/reset \
-H "Content-Type: application/json" \
-d '{"episode_id": "demo-001", "seed": 42, "task_id": "task2"}'
# Submit a decision (0=APPROVE, 1=CONDITIONAL, 2=REJECT)
curl -X POST https://vssksn-intellicredit-openenv.hf.space/step \
-H "Content-Type: application/json" \
-d '{"episode_id": "demo-001", "action": {"decision": 2}}'
```
**β [Full Swagger UI](https://vssksn-intellicredit-openenv.hf.space/docs)**
### Local Setup
```bash
git clone https://github.com/1919-14/intellicredit-openenv.git --branch v2
cd intellicredit-openenv
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt
python -m server.app # β http://localhost:7860/docs
```
### Evaluate the GRPO Model
```bash
# Run GRPO model against environment
python eval_llm.py \
--model vssksn/intellicredit-mistral-7b-grpo \
--env-url http://localhost:7860 \
--out grpo_results.json
# Compare vs base model
python eval_llm.py \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--env-url http://localhost:7860 \
--out base_results.json
# Generate comparison chart
python compare_results.py \
--baseline base_results.json \
--after grpo_results.json \
--out comparison.png
```
### Docker
```bash
docker build -t intellicredit-v2 .
docker run -p 7860:7860 intellicredit-v2
```
---
## π Project Structure
```
intellicraft-openenv/
βββ server/
β βββ app.py # FastAPI server β /reset, /step, /info, /health
β βββ intellicredit_env.py # v2 core: WorldState, 50-step lifecycle, multi-agent
β βββ dataset.py # Application generator (Anchor Γ Sector Γ Size Γ Tier)
β βββ reward.py # R1-R4 reward engine + settlement grader
β βββ action_parser.py # LLM text β tool call / decision parser (6-level)
β βββ tool_executor.py # Read-only tool execution (financial, compliance, market)
β βββ agent_loop.py # Agent orchestrator + prompt injection + step logger
β βββ reflection.py # Self-improvement + memory bank system
β
βββ training/
β βββ colab_grpo_3b_v2.py # β PRIMARY: Unsloth GRPO training (A100, ~45 min)
β βββ generate_dataset.py # 2000-prompt GRPO dataset generator
β βββ grpo_rewards.py # 4 GRPO reward functions (R1-R4)
β βββ train_grpo.py # 3-stage curriculum pipeline
β
βββ evaluation/
β βββ evaluate.py # Multi-mode evaluation engine (baseline/reflection/GRPO)
β βββ compare.py # Comparison tables + reward curves (4-panel PNG)
β
βββ docs/
β βββ blog.md # Full technical blog post (~5,000 words)
β βββ assets/
β βββ comparison.png # Baseline vs GRPO results chart
β βββ training_curves.png # GRPO training curves (Mistral-7B, A100)
β
βββ eval_llm.py # LLM evaluation via HTTP (base vs trained)
βββ compare_results.py # Bar chart generator (8 metrics Γ 3 tasks)
βββ baseline_results.json # RuleBasedAgent reference scores
βββ memory_bank.json # Persistent cross-episode lesson storage (auto-generated)
βββ inference.py # LLM inference wrapper (HF API)
βββ models.py # Pydantic schemas (55D observation, action)
βββ client.py # HTTP client for environment interaction
βββ openenv.yaml # OpenEnv framework config
βββ PROJECT_SUMMARY.md # Complete project summary (all 8 phases)
βββ Dockerfile # HF Spaces Docker deployment
βββ requirements.txt # Python dependencies
```
## π§ͺ Evaluation Methodology
Two evaluation approaches:
**Approach 1 β Direct Python (`evaluation/evaluate.py`):** Tests agents by calling `IntelliCreditEnvironment` directly. Agents: `RuleBasedAgent` (optimal), `RandomAgent` (lower bound), `GreedyApproveAgent`. Output: `baseline_results.json`.
**Approach 2 β HTTP API (`eval_llm.py`):** Tests actual LLM via running server. Since `/step` returns only `{observation, reward, done}`, scores computed locally:
| Metric | Formula | Weight |
|--------|---------|--------|
| Accuracy | steps with positive reward / total steps | 0.5 |
| HR Compliance | 1 β (steps with reward < β5) / total steps | 0.3 |
| Survival Rate | 1.0 if all 50 steps completed without shutdown | 0.2 |
| **Final Score** | accuracyΓ0.5 + hr_complianceΓ0.3 + survivalΓ0.2 | β |
## π Version History
| Feature | v1.0 | v2.0 (Current) |
|---------|------|----------------|
| Episode Length | 12 steps | **50 steps** (4Γ) |
| Observation Dims | 45D | **55D** (+10 memory features) |
| Agent Count | 1 | **3** (Credit Officer + Borrower + Regulator) |
| Reward Type | Dense per-step | **Delayed + sparse** (realistic credit risk) |
| Tool Calling | β | **β
3 tools, max 4 calls/step** |
| Self-Improvement | β | **β
Cross-episode reflection module** |
| GRPO Fine-Tuning | β | **β
Mistral-7B, A100, ~45 min** |
| Deployment | Local only | **β
Docker + HF Spaces** |
## π Baseline Agent Results (RuleBasedAgent β 25 episodes)
| Task | Avg Score | Accuracy | NPA Rate |
|------|-----------|----------|----------|
| task1 (Easy) | 0.389 | 77.9% | 4.8% |
| task2 (Medium) | 0.325 | 66.6% | 8.9% |
| task3 (Hard) | 0.288 | 81.5% | 20.2% |
| task4 (Expert) | 0.265 | 85.9% | 26.7% |
| task5 (Master) | 0.251 | 77.8% | 6.7% |
| **Overall** | **0.304** | **77.9%** | **13.4%** |
---
## π All Links
| Resource | Link |
|----------|------|
| π€ **Live Environment** | [huggingface.co/spaces/vssksn/intellicredit-openenv](https://huggingface.co/spaces/vssksn/intellicredit-openenv) |
| π€ **GRPO Model** | [huggingface.co/vssksn/intellicredit-mistral-7b-grpo](https://huggingface.co/vssksn/intellicredit-mistral-7b-grpo) |
| π€ **Training Dataset** | [huggingface.co/datasets/vssksn/intellicredit-grpo-v2](https://huggingface.co/datasets/vssksn/intellicredit-grpo-v2) |
| π» **GitHub (v2 branch)** | [github.com/1919-14/intellicredit-openenv/tree/v2](https://github.com/1919-14/intellicredit-openenv/tree/v2) |
| π **API Swagger** | [vssksn-intellicredit-openenv.hf.space/docs](https://vssksn-intellicredit-openenv.hf.space/docs) |
| π **Mini Blog (Story & Results)** | [docs/mini_blog.md](./docs/mini_blog.md) |
| π **Full Technical Blog** | [docs/blog.md](./docs/blog.md) |
| π **Colab Training Notebook** | [Open in Colab](https://colab.research.google.com/drive/1HhVu1JezKoT32zfHIEfAFersxRrwZSYu?usp=sharing) |
| π **Project Summary** | [PROJECT_SUMMARY.md](./PROJECT_SUMMARY.md) |
| π **Env Info API** | [/info endpoint](https://vssksn-intellicredit-openenv.hf.space/info) |
---
## π³ Docker Deployment
```bash
docker build -t intellicredit-v2 .
docker run -p 7860:7860 intellicredit-v2
# With HF token for LLM inference:
docker run -p 7860:7860 -e HF_TOKEN="your-token" intellicredit-v2
```
## π» Hardware Requirements
| Component | Environment Server | GRPO Training |
|-----------|-------------------|---------------|
| CPU | 2 vCPUs minimum | 8+ cores |
| RAM | 2 GB minimum | 32 GB minimum |
| GPU | **Not required** | **A100 80GB mandatory** |
| Storage | 500 MB | ~30 GB (model checkpoints) |
| Training Time | β | ~45 minutes |
## π Environment Variables
| Variable | Description | Default |
|----------|-------------|--------|
| `HF_TOKEN` | Hugging Face API token | Required for `inference.py` |
| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
| `MODEL_NAME` | LLM model for inference | `meta-llama/Llama-3.3-70B-Instruct` |
| `ENV_URL` | Environment server URL (for eval/training) | `http://localhost:7860` |
---
## π Citation
```bibtex
@article{intellicredit2025,
title = {IntelliCredit-X: A Multi-Agent Constrained MDP for MSME Credit
Appraisal with GRPO Fine-Tuning},
author = {Narayana, V S S K Sai and Jaiswal, Sujeet},
year = {2026},
note = {OpenEnv Hackathon Submission β Meta Γ Hugging Face},
url = {https://huggingface.co/spaces/vssksn/intellicredit-openenv}
}
```
---
## π License
MIT License β See [LICENSE](LICENSE) for details.
---
*Built by **V S S K Sai Narayana** & **Sujeet Jaiswal** for the Meta Γ Hugging Face OpenEnv Hackathon 2026.*