# GIS-Coder 7B — Training Package Fine-tune **Qwen2.5-Coder-7B-Instruct** into a GIS code specialist using QLoRA SFT. ## 📁 What's Included | File | Description | |------|-------------| | `train_7b.py` | Production training script with CLI args | | `evaluate.py` | Evaluation on 12 GIS benchmarks with scoring | | `requirements.txt` | All dependencies | Dataset: [`RhodWeo/gis-code-instructions`](https://huggingface.co/datasets/RhodWeo/gis-code-instructions) — 70 expert-curated GIS code examples covering 13 Python libraries. ## 🚀 Quick Start ### 1. Install dependencies ```bash pip install -r requirements.txt ``` ### 2. Login to HuggingFace ```bash huggingface-cli login ``` ### 3. Train (single GPU) ```bash # Default settings (recommended for A100 80GB) python train_7b.py # A10G / RTX 4090 (24GB) — reduce batch size python train_7b.py --batch_size 1 --grad_accum 16 --max_length 2048 # H100 — can afford larger batch and sequence length python train_7b.py --batch_size 4 --grad_accum 4 --max_length 8192 # Full precision LoRA (no quantization, needs ~30GB) python train_7b.py --no_quantize --batch_size 1 # With Flash Attention (faster, needs flash-attn installed) python train_7b.py --use_flash_attn # With Trackio monitoring python train_7b.py --use_trackio --trackio_project my-gis-coder ``` ### 4. Multi-GPU ```bash accelerate launch --num_processes 2 train_7b.py --batch_size 2 --grad_accum 4 ``` ### 5. Evaluate ```bash # Evaluate fine-tuned model python evaluate.py --adapter_id RhodWeo/GIS-Coder-7B # Compare with base model python evaluate.py --adapter_id RhodWeo/GIS-Coder-7B --compare_base # Evaluate local checkpoint python evaluate.py --adapter_id ./gis-coder-7b-output/final ``` ## ⚙️ Hyperparameter Guide ### Recommended defaults (battle-tested recipe): | Parameter | Value | Source | |-----------|-------|--------| | `--lr` | `2e-4` | LoRA Without Regret (10× base SFT rate) | | `--lora_r` | `32` | MapCoder-Lite optimal for code tasks | | `--lora_alpha` | `16` | α/r = 0.5 | | `--target_modules` | `all-linear` | LoRA Without Regret | | `--epochs` | `3` | CFD paper: peak at epoch 2, decline after 4 | | `--scheduler` | `cosine` | Standard for LoRA | | `--warmup_ratio` | `0.1` | CFD paper: 10% warmup | | `--max_length` | `4096` | Covers longest GIS code examples | ### Hardware-specific settings: | GPU | VRAM | `--batch_size` | `--grad_accum` | `--max_length` | Notes | |-----|------|----------------|-----------------|----------------|-------| | RTX 3090 | 24GB | 1 | 16 | 2048 | QLoRA only | | RTX 4090 | 24GB | 1 | 16 | 2048 | QLoRA, slightly faster | | A10G | 24GB | 1 | 16 | 2048 | QLoRA only | | L40S | 48GB | 2 | 8 | 4096 | QLoRA or LoRA | | A100 40GB | 40GB | 2 | 8 | 4096 | Recommended minimum | | A100 80GB | 80GB | 2 | 8 | 4096 | Ideal | | H100 | 80GB | 4 | 4 | 8192 | Fastest | ### Ablation ideas: ```bash # Higher LoRA rank (more capacity, slower) python train_7b.py --lora_r 64 --lora_alpha 32 # Lower learning rate (more stable, slower convergence) python train_7b.py --lr 5e-5 # More epochs (risk overfitting on 70 examples) python train_7b.py --epochs 5 # Target only attention layers (fewer params, faster) python train_7b.py --target_modules q_proj,k_proj,v_proj,o_proj ``` ## 📊 Expected Results From our CPU training run with 0.5B base model (70 examples, 3 epochs): | Metric | Start → End | |--------|------------| | Loss | 1.52 → 0.88 (−42%) | | Token accuracy | 69% → 79% | | Eval quality score | 85% | **With the 7B model + QLoRA, expect significantly better results** — the CFD paper achieved 88.7% accuracy with this exact recipe on a similarly-sized domain-specific dataset. ## 📚 Dataset Details **70 examples** covering 13 GIS Python libraries: | Library | Examples | Why Important | |---------|----------|---------------| | OSMnx | 9 | **All models score 0%** — routing, POIs, isochrones | | Rasterio | 9 | Satellite imagery, DEM, NDVI, reprojection | | GeoPandas | 25 | Core: spatial joins, buffering, I/O | | Shapely | 14 | Geometry operations, validation | | MovingPandas | 3 | **All models score 0%** — GPS trajectories | | GDAL | 6 | Raster processing, format conversion | | PyProj | 2 | CRS handling (critical weakness) | | H3 | 2 | Hexagonal indexing | | Folium | 1 | Interactive maps | | Fiona | 2 | Low-level vector I/O | | xarray | 1 | Climate/raster datacubes | | PyQGIS | 1 | Desktop GIS scripting | | PySAL | 1 | Spatial statistics | Each example includes: - System prompt establishing GIS expertise - Natural language instruction - Step-by-step Chain-of-Thought reasoning - Complete, documented Python code - Key points explaining design decisions ## 🔬 Scaling to 20K+ Examples To maximize quality, use the **OSS-Instruct pattern** (from Magicoder): 1. Crawl GitHub for GIS Python code (`import geopandas`, `import rasterio`, etc.) 2. Use GPT-4o to generate (instruction, solution) pairs from real code snippets 3. Execute and test all generated solutions 4. Add CoT annotations to passing examples (+20.9% pass@1 per CFD paper) Target: 20K–75K examples for production-grade GIS-Coder. ## 📖 References | Paper | Key Insight | |-------|-------------| | [CFD Fine-tuning](https://arxiv.org/abs/2504.09602) | QLoRA SFT recipe: 7B model beats 72B on domain tasks | | [MapCoder-Lite](https://arxiv.org/abs/2509.17489) | Qwen2.5-Coder-7B best backbone for code LoRA | | [GIS Benchmark](https://arxiv.org/abs/2410.04617) | All models score 0% on OSMNX/MovingPandas | | [Magicoder](https://arxiv.org/abs/2312.02120) | OSS-Instruct for synthetic data from real code | | [LoRA Without Regret](https://arxiv.org/abs/2410.13732) | target all-linear, r=64-256, lr=2e-4 |