Spaces:
Running
Running
File size: 13,065 Bytes
81e3ca2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | # Build Small Hackathon β Ideation & Decisions
> Working notes from ideation. Project: **AI Prof** (primary submission).
> Hackathon: https://huggingface.co/build-small-hackathon Β· Field guide: https://build-small-hackathon-field-guide.hf.space/
## Hackathon constraints (the rules that shape everything)
- **Model size: β€32B parameters PER MODEL** (not aggregate). You can freely combine multiple
small models as long as each one individually stays under the cap. ("not just active params")
- **No sponsor exclusivity** β a sponsor's model just has to be *a core part of the experience*;
you may mix in other providers' models.
- **Platform:** must be built in **Gradio**, hosted as a **Hugging Face Space**.
- **Deliverables:** working Space + **demo video** + **social media post**.
- **Timeline:** hack window June 5β15, 2026.
- **Credits:** OpenAI $100 is for **Codex (their coding agent)**, *not* API/inference. Modal $250, HF $20.
## Tracks
- **Backyard AI** β solve a genuine problem for *someone you personally know*. Judged on specificity,
**actual user adoption**, and appropriate model fit. β our track.
- **Thousand Token Wood (TTW)** β delightful/unconventional, AI must be load-bearing (game, story, art).
## Prize surface (it stacks β one app can hit many)
Main tracks ($18k): 1stβ4th per track + Community Choice ($2k).
Sponsor awards:
- **OpenBMB $10k** β build with MiniCPM (incl. vision MiniCPM-V / omni MiniCPM-o).
- **OpenAI $10k** β won via **Codex-attributed commits** in the repo/Space (build-tool track, model-agnostic).
- **NVIDIA** β two RTX 5080 GPUs; build with **Nemotron**. One for "best space", one for community engagement (likes).
- **Modal $20k credits** β use Modal for dev/runtime, note in README.
Special awards ($8k): Bonus Quest Champion $2k Β· Off-Brand (best custom UI via `gr.Server`) $1.5k Β·
Tiny Titan (best app on β€4B model) $1.5k Β· Best Demo $1k Β· Best Agent $1k Β· Judges' Wildcard $1k.
Six merit badges: Off the Grid (no cloud APIs) Β· Well-Tuned (publish fine-tune) Β· Off-Brand (custom UI) Β·
Llama Champion (llama.cpp runtime) Β· Sharing is Caring (publish agent trace) Β· Field Notes (build blog).
## Decision: build the AI Prof (Backyard AI)
**Problem (real):** a specific classmate finds that having the slides isn't enough β they're static and
lack the in-class explanation. Test with **real, anonymized lecture slides** from our class.
**Core loop:** upload lecture PDF β model reads each slide *as an image* β explains it like a TA,
streamed in real time β classmate can **interject** with a question at any moment.
### Two-model architecture (stacks OpenBMB + NVIDIA, cleanly justified)
- **MiniCPM-V (~4.1B, GGUF)** = *the eyes.* Reads each slide as an image (diagrams, equations, layout β
not just scraped text). Core β satisfies **OpenBMB**. Runs via **llama.cpp** β Llama Champion badge.
- **Nemotron 3 Nano (9B, or 30B-A3B MoE = only ~3.6B active β fast decode)** = *the brain.* Turns the
slide reading into the spoken explanation and answers interjections. Reasoning/agentic β **NVIDIA** fit.
- Per-model cap means no param-budget tension between the two. Division of labor (see/explain) survives
the "core part of the experience" test β not sponsor-stuffing.
### Prize map for this one app
Backyard placement Β· OpenBMB $10k Β· NVIDIA RTX 5080 Β· OpenAI $10k (Codex commits) Β· Modal $20k (self-host) Β·
Off-Brand $1.5k+badge (custom whiteboard UI) Β· Llama Champion Β· Off the Grid Β· Sharing is Caring (publish a
teaching-session trace) Β· Field Notes (blog) Β· Best Demo. β realistically 6+ surfaces.
### Scope discipline (vertical slice order)
1. Slide upload β MiniCPM-V reads β Nemotron explains, **streamed** in a simple custom UI. *(Submittable alone.)*
2. Text interjection (pause, ask, resume).
3. Whiteboard β model emits **Mermaid / Excalidraw JSON** (structured; small models do this reliably).
Avoid freeform tldraw generation β that's the day-eating trap.
4. Voice / TTS β only with slack.
## Real-time architecture
Goal: feel like a *live* lecture β explanation streams as if the prof is talking through each slide,
smooth slide-to-slide, interruptible mid-sentence.
- **Token streaming:** stream Nemotron output to the UI (Gradio generators) so it appears as it's generated.
- **Complete index before teaching:** process the full deck before starting the lecture. The professor needs a
global map to choose slides intelligently and answer interjections. Show preparation progress in Gradio.
- **Two-stage per slide, cached:** (a) MiniCPM-V β a structured "slide reading" (text + diagram desc + equations),
computed once and cached; (b) Nemotron β the explanation. Interjections reuse cached (a), never re-run vision.
- **Interjection = interrupt + branch:** need a *cancellable* generation (async cancel token / threading.Event).
On input: stop current stream, answer using the cached slide reading + history as context, then resume.
- **Streaming TTS (optional):** chunk explanation into sentences, TTS each as generated, play sequentially.
Barge-in (interrupt-on-speech) is hard mode β defer.
- **Session state:** { slides[], current_index, cached_slide_readings{}, conversation_history }.
- **Hosting tradeoff for low latency:** free HF Space hardware likely too slow for 2 models in real time.
Lean: **Modal GPU as inference backend** (uses the Modal track) + Gradio Space as frontend. Self-hosting
(not an external inference API) still supports the *Off the Grid* badge.
## Voice (STT + TTS) β makes it feel like *class*, but it's the deepest rabbit hole
Two more tiny, on-HF models (free under the β€32B-*per-model* cap; FAQ uses "a 7B speech model" as its example).
Keep both **open + self-hosted** (not ElevenLabs/Deepgram/OpenAI *API*) β protects the **Off the Grid** badge.
**Pipeline:** TTS (out) speaks Nemotron's streamed explanation; STT (in) transcribes the spoken interjection
into text for Nemotron; **VAD** is the referee that detects *when* the classmate starts talking β triggers barge-in.
- **STT (interjections) β pick: Whisper, or Moonshine for speed.**
- `faster-whisper` / `distil-whisper` (large-v3 ~1.5B, or base/small for latency) β accurate, OpenAI open weights.
- **Moonshine** (~tiny) β built for real-time/on-device, faster time-to-text on short clips.
- Interjections are short β small model is fine; *time-to-text*, not accuracy, is the bottleneck. Start with
`faster-whisper-base`, switch to Moonshine only if laggy.
- **TTS (narration) β pick: Kokoro, fallback Piper.**
- **Kokoro-82M** (`hexgrad/Kokoro-82M`) β 82M, good quality, fast time-to-first-audio, streamable. Sweet spot.
- **Piper** β even lower latency, CPU-friendly, slightly more robotic. Use if speech layer isn't on GPU.
- Stream sentence-by-sentence: synthesize + play each sentence as Nemotron emits it (audio ~1 sentence behind gen).
- **Barge-in / turn-taking β use FastRTC** (`fastrtc`, the Gradio/HF WebRTC stack). Gives low-latency mic+playback
over WebRTC, built-in **VAD turn detection** (`ReplyOnPause`), and the hook to cancel playback + generation the
instant the user speaks. Avoids hand-rolling silence detection; reuses our cancellable-generation design.
- Loop: narrate (TTS) β Silero VAD hears speech β kill TTS + cancel Nemotron β buffer to end-of-speech β STT β
answer using the **cached slide reading** as context β TTS answer β resume narration.
- **Latency budget (what "feels live" needs):** minimize time-to-first-audio. STT small (~100β300ms) + Nemotron
MoE fast prefill + Kokoro first chunk (~tens of ms). Run STT/TTS/VAD on the **same Modal GPU** as Nemotron
(or CPU for Piper+Silero) to avoid network hops.
- **Scope ladder (don't let voice eat the week):**
1. **TTS narration only** β Prof talks, classmate *types* to interject. Low risk, already feels real-time.
2. **Push-to-talk interject** β hold key / tap to ask aloud β STT. No VAD/barge-in. ~90% of magic, ~30% of work.
3. **Full-duplex barge-in** via FastRTC + VAD β only once 1β2 are solid. The 2β3 jump is where time goes.
- **Model count after voice:** MiniCPM-V (vision) + Nemotron (brain) + Whisper/Moonshine (STT) + Kokoro/Piper (TTS)
+ Silero VAD. Multi-model is explicitly blessed; more "appropriate model fit" surface, but more integration.
## Agent loop, tools & slide grounding
The brain (Nemotron) runs as a **tool-using agent** driving the lecture β professor-esque: decides which slide
to be on, when to draw vs. just talk, when to jump back. (Strengthens the **Best Agent** award; the tool-call
sequence *is* the publishable agent trace β **Sharing is Caring** badge.)
**Tools (mutate UI / session state):**
- `goto_slide(i)` / `next_slide()` / `prev_slide()` β navigation; lets it jump back to a referenced slide.
- `look_closer(question)` β on-demand **real-time** MiniCPM-V call on the *current slide image* for detail.
- `draw(mermaid | excalidraw_json)` β render on the whiteboard surface.
- `clear_whiteboard()`
- `highlight_region(bbox)` β optional, later.
- narration = the free-text part of the response β streamed to TTS.
**Control flow:** orchestrator loop. Each turn the agent gets `{ current slide reading, deck outline, recent
history, trigger }` where trigger = "continue lecture" | "user asked: <q>". It returns narration + optional
tool calls; the orchestrator executes tools (swap displayed slide, render whiteboard) and streams narration.
**Reserve heavy reasoning for decision points (between slides), not during narration**, or latency balloons.
**Grounding β preprocess (breadth) + real-time (depth), both:**
- **On upload (once; complete before lecture begins):**
- render each slide β image (serves *both* display and vision),
- **MiniCPM-V** β cached structured slide reading (title, bullets, equations, diagram desc, key concepts),
- **PDF text-layer** extraction (exact text ground-truth; vision misreads text) to complement vision,
- build a **deck outline / index** (slide β title/concepts) β this is what lets the agent *plan* and *pick* a
slide; it can't navigate to a slide it has never seen.
- **During lecture (`look_closer`):** targeted MiniCPM-V look at the actual slide *with the specific question in
context* β handles the long tail (specific visual questions, detail the cached summary glossed over).
- Why both: preprocess = global map + fast/cheap narration + navigation; real-time = accuracy on specific asks.
**UI: show the REAL slide, never a summary.**
- Main surface = the actual rendered slide **image / PDF page**, synced to the agent's `current_slide`.
- **Whiteboard = a separate adjacent canvas** (Mermaid/Excalidraw render) so drawings read as the prof's
annotations, not edits to the original slide. (Region-highlight overlay on the slide can come later.)
- Plus: live caption / transcript + mic / interject control.
The detailed deployment, deck-cache, teaching-beat, interruption, speech, and whiteboard decisions are in
[`ARCHITECTURE.md`](ARCHITECTURE.md).
## Fine-tuning (after core pipeline β not before)
Chosen direction: **teaching-style SFT / guided learning** β tune the brain (Nemotron) to explain like a good
TA: analogies, checks for understanding, concise, no preamble. Bootstrap dataset = (slide reading β ideal
explanation) pairs generated with a strong model + a little hand-curation; **QLoRA via TRL / Unsloth on Modal**;
publish the adapter to HF β **Well-Tuned** badge (+ feeds Bonus Quest Champion). Put a before/after note in the
model card. Strictly a step-2 enhancement β only once the live pipeline works and there's a clear objective.
## Backburner ideas (TTW β only if a 2nd submission is feasible)
- **Emotion-driven TRPG / choose-your-own-adventure:** free text β LLM updates structured NPC emotional
state (trust/fear/affectionβ¦) β state steers the story. Best Agent + Sharing is Caring (emotion-delta traces).
- **AI Garlic Phone:** message/drawing passed down a chain of model personas, mutating; needs a reveal payoff.
If a drawing is passed, pulls in vision = OpenBMB again.
- **Sketch-to-story adventure:** you draw your action, MiniCPM-V interprets the doodle, the world reacts.
Fuses vision + emotion mechanic. Strong demo, only-AI-can-do-this.
- **Model-vs-model battle β traces for training** (browser-brawl-style, different domain): most technically
impressive (Best Agent + Well-Tuned + Sharing is Caring) but research-shaped; the training loop can eat the week.
## Key model references
- MiniCPM-V-4 (4.1B, multimodal): https://huggingface.co/openbmb/MiniCPM-V-4
- MiniCPM-V-4 GGUF (llama.cpp): https://huggingface.co/openbmb/MiniCPM-V-4-gguf
- Nemotron 3 Nano collection: https://huggingface.co/collections/nvidia/nvidia-nemotron-v3
- Nemotron 3 Nano 4B: https://huggingface.co/blog/nvidia/nemotron-3-nano-4b
- Nemotron 3 Nano 30B-A3B: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
|