# AI Prof Architecture This document captures the intended production and demo architecture for AI Prof. The core experience is a professor-like agent that understands an entire lecture, controls the visible slide and whiteboard, speaks naturally, and responds to student interruptions without losing its place. ## Deployment ```text Hugging Face Space Gradio UI session state professor orchestrator slide and whiteboard rendering | v Modal inference services MiniCPM-V slide vision Nemotron professor agent VoxCPM text-to-speech Whisper or Moonshine speech-to-text | v Hugging Face storage published demo deck processed deck manifests slide images and readings ``` - **Hugging Face Space:** public Gradio application and lightweight orchestration. - **Modal:** GPU-backed, self-hosted model inference that can scale down when idle. - **Local development:** the same Gradio app in mock mode or pointed at local model servers. - **Hugging Face storage:** persistent processed lectures, including one polished public demo deck that works immediately for judges and visitors. ## Deck Preparation The professor should not begin teaching after only slide 1 is processed. Navigation and question answering depend on knowing what exists across the whole lecture. On upload: 1. Hash the PDF to identify an existing processed deck. 2. If cached, load its manifest and slide assets. 3. Otherwise, render every page and extract the PDF text layer. 4. Run MiniCPM-V over every slide. 5. Build a compact deck index from all slide readings. 6. Persist the processed result when appropriate. 7. Enable the lecture only when the complete index is ready. Gradio displays preparation progress while this runs. It does not need to hold the heavy processing itself; it can call Modal jobs and stream their progress. ### Processed Deck Format ```text decks// manifest.json source.pdf slides/ 001.png 002.png readings/ 001.json 002.json ``` Each index entry should remain compact enough to keep the entire deck map in the agent context: ```json { "slide": 7, "title": "Convolution", "summary": "Applying a kernel across an image", "concepts": ["kernel", "stride", "weighted sum"], "equations": ["g(x,y) = sum_i sum_j h(i,j)f(x-i,y-j)"], "visuals": ["A 3 by 3 kernel moving across a pixel grid"] } ``` The agent receives the complete compact index, but only the full reading for the current or specifically retrieved slides. ## Professor Agent The model should make decisions at meaningful teaching boundaries, not once per sentence or drawing stroke. One agent turn produces a short **teaching beat**: ```json { "narration": "Imagine this grid is a small patch of the image.", "actions": [ {"tool": "draw_grid", "args": {"rows": 3, "cols": 3}, "at": 0.4} ], "next": "continue" } ``` The orchestrator executes the actions and speech. After the beat completes, it asks the agent what to do next. ### Agent Context Each decision receives: - Complete compact deck index - Current slide number and full cached slide reading - Current whiteboard state - Recent conversation and teaching beats - Saved lecture position - Trigger: continue lecture or student question ### Tools - `goto_slide(index)` - move to the best supporting slide - `next_slide()` and `prev_slide()` - ordinary navigation - `look_closer(question)` - ask MiniCPM-V to inspect the current slide for a specific visual detail; wire this after the core loop - `write_latex(expression, position)` - place a typeset equation - `draw_diagram(spec)` - render structured Excalidraw-style primitives - `clear_whiteboard()` - reset the board when the visual context changes - `highlight_region(bbox)` - optional later enhancement Tool calls and their results should be logged as a publishable teaching-session trace. ## Student Interruption The lecture is controlled by an explicit state machine: ```text NARRATING -> student begins speaking INTERRUPTING -> stop TTS and cancel current generation LISTENING -> capture speech until push-to-talk release or VAD pause THINKING -> transcribe, search deck index, choose slide and visual support ANSWERING -> navigate or draw when useful, then speak the answer RESUMING -> continue from the saved teaching position ``` When a student asks a question, the agent first decides whether the current slide is sufficient. It may: - Answer on the current slide - Navigate to a more relevant earlier or later slide - Inspect a slide more closely - Draw an explanation on the whiteboard - Combine navigation and drawing The agent decides whether to return to the previous slide afterward. Automatically returning every time would make the lecture feel mechanical. The orchestrator saves the interrupted teaching beat and sentence position. For the first implementation, resuming at the beginning of that beat is acceptable and much simpler than resuming at an exact audio sample. ## Speech - **TTS:** VoxCPM for professor narration. - **STT:** faster-whisper or Moonshine for short student questions. - **Transport and turn detection:** FastRTC, initially push-to-talk and later VAD barge-in. Narration should be synthesized in short sentence or beat-sized chunks. This keeps latency low and gives the orchestrator clean cancellation boundaries. ## Whiteboard Avoid unrestricted free-form drawing as the primary path. It requires too many model calls and is difficult to synchronize or reproduce. Use structured operations: - LaTeX for equations - Excalidraw-style primitives for boxes, arrows, labels, grids, and highlights - Optional Mermaid for diagrams where automatic layout is useful - Manim only for prepared showcase animations, not the live agent loop The model emits one structured drawing plan per teaching beat. The frontend animates the resulting primitives locally, so drawing does not require another inference call for every stroke. ### Speech and Drawing Synchronization Each action may include an approximate offset relative to the narration: ```json { "narration": "The center pixel is replaced using all nine neighbors.", "actions": [ {"tool": "highlight_cell", "args": {"row": 1, "column": 1}, "at": 0.2}, {"tool": "write_latex", "args": {"expression": "1/9 sum pixels"}, "at": 2.1} ] } ``` The orchestrator: 1. Starts TTS for the teaching beat. 2. Executes visual actions at their approximate offsets. 3. Waits for speech and drawing to finish. 4. Requests the next teaching beat. This produces the feeling of talking while drawing without making an agent call per stroke. ## Demo Path The public Space should offer two entry points: 1. **Try the prepared lecture:** loads an already processed, polished deck from Hugging Face storage immediately. 2. **Upload your own lecture:** preprocesses the complete deck, displays progress, then starts the professor. The prepared lecture is the reliable judging and demo-video path. User uploads prove the system generalizes without making the first experience depend on a long vision preprocessing wait. ## Implementation Order 1. Complete-deck manifest and compact index 2. Prepared demo deck loading from Hugging Face storage 3. Professor teaching-beat schema and tool executor 4. Slide navigation tools 5. Structured whiteboard tools and local animation 6. VoxCPM narration with cancellable chunks 7. Push-to-talk interruption and STT 8. Automatic slide retrieval during questions 9. VAD barge-in 10. Targeted `look_closer` vision calls