Spaces:
Running
A newer version of the Gradio SDK is available: 6.19.0
AI Prof Architecture
This document captures the intended production and demo architecture for AI Prof. The core experience is a professor-like agent that understands an entire lecture, controls the visible slide and whiteboard, speaks naturally, and responds to student interruptions without losing its place.
Deployment
Hugging Face Space
Gradio UI
session state
professor orchestrator
slide and whiteboard rendering
|
v
Modal inference services
MiniCPM-V slide vision
Nemotron professor agent
VoxCPM text-to-speech
Whisper or Moonshine speech-to-text
|
v
Hugging Face storage
published demo deck
processed deck manifests
slide images and readings
- Hugging Face Space: public Gradio application and lightweight orchestration.
- Modal: GPU-backed, self-hosted model inference that can scale down when idle.
- Local development: the same Gradio app in mock mode or pointed at local model servers.
- Hugging Face storage: persistent processed lectures, including one polished public demo deck that works immediately for judges and visitors.
Deck Preparation
The professor should not begin teaching after only slide 1 is processed. Navigation and question answering depend on knowing what exists across the whole lecture.
On upload:
- Hash the PDF to identify an existing processed deck.
- If cached, load its manifest and slide assets.
- Otherwise, render every page and extract the PDF text layer.
- Run MiniCPM-V over every slide.
- Build a compact deck index from all slide readings.
- Persist the processed result when appropriate.
- Enable the lecture only when the complete index is ready.
Gradio displays preparation progress while this runs. It does not need to hold the heavy processing itself; it can call Modal jobs and stream their progress.
Processed Deck Format
decks/<pdf_sha256>/
manifest.json
source.pdf
slides/
001.png
002.png
readings/
001.json
002.json
Each index entry should remain compact enough to keep the entire deck map in the agent context:
{
"slide": 7,
"title": "Convolution",
"summary": "Applying a kernel across an image",
"concepts": ["kernel", "stride", "weighted sum"],
"equations": ["g(x,y) = sum_i sum_j h(i,j)f(x-i,y-j)"],
"visuals": ["A 3 by 3 kernel moving across a pixel grid"]
}
The agent receives the complete compact index, but only the full reading for the current or specifically retrieved slides.
Professor Agent
The model should make decisions at meaningful teaching boundaries, not once per sentence or drawing stroke. One agent turn produces a short teaching beat:
{
"narration": "Imagine this grid is a small patch of the image.",
"actions": [
{"tool": "draw_grid", "args": {"rows": 3, "cols": 3}, "at": 0.4}
],
"next": "continue"
}
The orchestrator executes the actions and speech. After the beat completes, it asks the agent what to do next.
Agent Context
Each decision receives:
- Complete compact deck index
- Current slide number and full cached slide reading
- Current whiteboard state
- Recent conversation and teaching beats
- Saved lecture position
- Trigger: continue lecture or student question
Tools
goto_slide(index)- move to the best supporting slidenext_slide()andprev_slide()- ordinary navigationlook_closer(question)- ask MiniCPM-V to inspect the current slide for a specific visual detail; wire this after the core loopwrite_latex(expression, position)- place a typeset equationdraw_diagram(spec)- render structured Excalidraw-style primitivesclear_whiteboard()- reset the board when the visual context changeshighlight_region(bbox)- optional later enhancement
Tool calls and their results should be logged as a publishable teaching-session trace.
Student Interruption
The lecture is controlled by an explicit state machine:
NARRATING
-> student begins speaking
INTERRUPTING
-> stop TTS and cancel current generation
LISTENING
-> capture speech until push-to-talk release or VAD pause
THINKING
-> transcribe, search deck index, choose slide and visual support
ANSWERING
-> navigate or draw when useful, then speak the answer
RESUMING
-> continue from the saved teaching position
When a student asks a question, the agent first decides whether the current slide is sufficient. It may:
- Answer on the current slide
- Navigate to a more relevant earlier or later slide
- Inspect a slide more closely
- Draw an explanation on the whiteboard
- Combine navigation and drawing
The agent decides whether to return to the previous slide afterward. Automatically returning every time would make the lecture feel mechanical.
The orchestrator saves the interrupted teaching beat and sentence position. For the first implementation, resuming at the beginning of that beat is acceptable and much simpler than resuming at an exact audio sample.
Speech
- TTS: VoxCPM for professor narration.
- STT: faster-whisper or Moonshine for short student questions.
- Transport and turn detection: FastRTC, initially push-to-talk and later VAD barge-in.
Narration should be synthesized in short sentence or beat-sized chunks. This keeps latency low and gives the orchestrator clean cancellation boundaries.
Whiteboard
Avoid unrestricted free-form drawing as the primary path. It requires too many model calls and is difficult to synchronize or reproduce.
Use structured operations:
- LaTeX for equations
- Excalidraw-style primitives for boxes, arrows, labels, grids, and highlights
- Optional Mermaid for diagrams where automatic layout is useful
- Manim only for prepared showcase animations, not the live agent loop
The model emits one structured drawing plan per teaching beat. The frontend animates the resulting primitives locally, so drawing does not require another inference call for every stroke.
Speech and Drawing Synchronization
Each action may include an approximate offset relative to the narration:
{
"narration": "The center pixel is replaced using all nine neighbors.",
"actions": [
{"tool": "highlight_cell", "args": {"row": 1, "column": 1}, "at": 0.2},
{"tool": "write_latex", "args": {"expression": "1/9 sum pixels"}, "at": 2.1}
]
}
The orchestrator:
- Starts TTS for the teaching beat.
- Executes visual actions at their approximate offsets.
- Waits for speech and drawing to finish.
- Requests the next teaching beat.
This produces the feeling of talking while drawing without making an agent call per stroke.
Demo Path
The public Space should offer two entry points:
- Try the prepared lecture: loads an already processed, polished deck from Hugging Face storage immediately.
- Upload your own lecture: preprocesses the complete deck, displays progress, then starts the professor.
The prepared lecture is the reliable judging and demo-video path. User uploads prove the system generalizes without making the first experience depend on a long vision preprocessing wait.
Implementation Order
- Complete-deck manifest and compact index
- Prepared demo deck loading from Hugging Face storage
- Professor teaching-beat schema and tool executor
- Slide navigation tools
- Structured whiteboard tools and local animation
- VoxCPM narration with cancellable chunks
- Push-to-talk interruption and STT
- Automatic slide retrieval during questions
- VAD barge-in
- Targeted
look_closervision calls