# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

A Hugging Face Spaces app that translates between 418 languages from Table 9 (Section A.1) of Google's [MADLAD-400](https://arxiv.org/pdf/2309.04662) 3B Seq2Seq model. Built with Gradio and deployed on HF Spaces. Falls back to CPU with a warning when no CUDA GPU is available.

## Commands

```bash
# Setup
uv venv --python 3.12
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install -r requirements-dev.txt

# Run (launches on http://localhost:7860)
python app.py

# Lint and format
ruff check .
ruff format .

# Type check
ty check

# Test
pytest                     # all 48 tests (slow require CUDA + model download)
pytest -m "not slow"       # 38 fast tests only
pytest -m slow             # 10 model tests only (CUDA only)

# Generate language mapping (dev only)
python scripts/generate_langmap.py <path-to-paper.pdf>
```

## Architecture

**`app.py`** — Single-file application with a Google Translate-style layout: top row has two symmetric, filterable, region-sorted language dropdowns (source defaults to "English (en)", target defaults to "French (fr)") with a swap button ("⇄") between them; below that, input textbox with inline clear button and output textbox with copy button side by side. The Translate button spans full width below both textboxes (shows "Translating..." during processing). Ctrl+Enter submits from the input. The model auto-detects source language; the source dropdown is for user reference and the swap button only. Uses `@lru_cache` for lazy loading of the `google/madlad400-3b-mt` tokenizer and model (no download on import). Uses `float16` on CUDA, `float32` on CPU. MPS is not supported (produces garbage output with T5 models). Translation prepends a target language token with a space to the input text (e.g., `<2fr> Hello`) before tokenization and generation. The `@spaces.GPU` decorator allocates GPU on HF Spaces infrastructure.

**`langmap/`** — Package with `langid_mapping.py`, mapping 418 language tokens to `{"name": ..., "region": ...}` dicts. Auto-generated by `scripts/generate_langmap.py` from Table 9 (Section A.1) of the MADLAD-400 paper. Available languages at runtime are the intersection of this mapping and the model's vocabulary.

**`scripts/`** — `generate_langmap.py` parses the MADLAD-400 paper PDF (Table 9, pages 16-22) using pdfplumber and generates the static language mapping with region assignments. Dev-only tool; requires `requirements-dev.txt` dependencies.

**`tests/`** — 48 tests (38 fast, 10 slow). `test_langmap.py` has 10 fast tests for mapping validation (dict shape, regions, spot-checks). `test_app.py` has 28 fast tests (signatures, device fallback, UI layout with symmetric dropdowns, swap button, textbox config, handler wiring, no HTML elements, locale codes, no title) and 10 slow tests (translation with various parameters, language mapping). Slow tests require CUDA and model download; auto-skipped without CUDA.

## Tooling

- **uv** — Python package manager. Used for venv creation and dependency installation from `requirements.txt`. No `pyproject.toml`; `requirements.txt` remains the single source of truth (required by HF Spaces).
- **Ruff** — linter and formatter (`ruff.toml`). Rules: `E`, `F`, `I`, `W`. Line length: 120.
- **ty** — type checker (`ty.toml`). Python 3.12 target.
- **pytest** — test runner (`pytest.ini`). Custom `slow` marker for CUDA-dependent tests.