Spaces:
Running
Running
| # AI Prof Architecture | |
| This document captures the intended production and demo architecture for AI Prof. | |
| The core experience is a professor-like agent that understands an entire lecture, | |
| controls the visible slide and whiteboard, speaks naturally, and responds to student | |
| interruptions without losing its place. | |
| ## Deployment | |
| ```text | |
| Hugging Face Space | |
| Gradio UI | |
| session state | |
| professor orchestrator | |
| slide and whiteboard rendering | |
| | | |
| v | |
| Modal inference services | |
| MiniCPM-V slide vision | |
| Nemotron professor agent | |
| VoxCPM text-to-speech | |
| Whisper or Moonshine speech-to-text | |
| | | |
| v | |
| Hugging Face storage | |
| published demo deck | |
| processed deck manifests | |
| slide images and readings | |
| ``` | |
| - **Hugging Face Space:** public Gradio application and lightweight orchestration. | |
| - **Modal:** GPU-backed, self-hosted model inference that can scale down when idle. | |
| - **Local development:** the same Gradio app in mock mode or pointed at local model | |
| servers. | |
| - **Hugging Face storage:** persistent processed lectures, including one polished | |
| public demo deck that works immediately for judges and visitors. | |
| ## Deck Preparation | |
| The professor should not begin teaching after only slide 1 is processed. Navigation | |
| and question answering depend on knowing what exists across the whole lecture. | |
| On upload: | |
| 1. Hash the PDF to identify an existing processed deck. | |
| 2. If cached, load its manifest and slide assets. | |
| 3. Otherwise, render every page and extract the PDF text layer. | |
| 4. Run MiniCPM-V over every slide. | |
| 5. Build a compact deck index from all slide readings. | |
| 6. Persist the processed result when appropriate. | |
| 7. Enable the lecture only when the complete index is ready. | |
| Gradio displays preparation progress while this runs. It does not need to hold the | |
| heavy processing itself; it can call Modal jobs and stream their progress. | |
| ### Processed Deck Format | |
| ```text | |
| decks/<pdf_sha256>/ | |
| manifest.json | |
| source.pdf | |
| slides/ | |
| 001.png | |
| 002.png | |
| readings/ | |
| 001.json | |
| 002.json | |
| ``` | |
| Each index entry should remain compact enough to keep the entire deck map in the | |
| agent context: | |
| ```json | |
| { | |
| "slide": 7, | |
| "title": "Convolution", | |
| "summary": "Applying a kernel across an image", | |
| "concepts": ["kernel", "stride", "weighted sum"], | |
| "equations": ["g(x,y) = sum_i sum_j h(i,j)f(x-i,y-j)"], | |
| "visuals": ["A 3 by 3 kernel moving across a pixel grid"] | |
| } | |
| ``` | |
| The agent receives the complete compact index, but only the full reading for the | |
| current or specifically retrieved slides. | |
| ## Professor Agent | |
| The model should make decisions at meaningful teaching boundaries, not once per | |
| sentence or drawing stroke. One agent turn produces a short **teaching beat**: | |
| ```json | |
| { | |
| "narration": "Imagine this grid is a small patch of the image.", | |
| "actions": [ | |
| {"tool": "draw_grid", "args": {"rows": 3, "cols": 3}, "at": 0.4} | |
| ], | |
| "next": "continue" | |
| } | |
| ``` | |
| The orchestrator executes the actions and speech. After the beat completes, it asks | |
| the agent what to do next. | |
| ### Agent Context | |
| Each decision receives: | |
| - Complete compact deck index | |
| - Current slide number and full cached slide reading | |
| - Current whiteboard state | |
| - Recent conversation and teaching beats | |
| - Saved lecture position | |
| - Trigger: continue lecture or student question | |
| ### Tools | |
| - `goto_slide(index)` - move to the best supporting slide | |
| - `next_slide()` and `prev_slide()` - ordinary navigation | |
| - `look_closer(question)` - ask MiniCPM-V to inspect the current slide for a | |
| specific visual detail; wire this after the core loop | |
| - `write_latex(expression, position)` - place a typeset equation | |
| - `draw_diagram(spec)` - render structured Excalidraw-style primitives | |
| - `clear_whiteboard()` - reset the board when the visual context changes | |
| - `highlight_region(bbox)` - optional later enhancement | |
| Tool calls and their results should be logged as a publishable teaching-session | |
| trace. | |
| ## Student Interruption | |
| The lecture is controlled by an explicit state machine: | |
| ```text | |
| NARRATING | |
| -> student begins speaking | |
| INTERRUPTING | |
| -> stop TTS and cancel current generation | |
| LISTENING | |
| -> capture speech until push-to-talk release or VAD pause | |
| THINKING | |
| -> transcribe, search deck index, choose slide and visual support | |
| ANSWERING | |
| -> navigate or draw when useful, then speak the answer | |
| RESUMING | |
| -> continue from the saved teaching position | |
| ``` | |
| When a student asks a question, the agent first decides whether the current slide | |
| is sufficient. It may: | |
| - Answer on the current slide | |
| - Navigate to a more relevant earlier or later slide | |
| - Inspect a slide more closely | |
| - Draw an explanation on the whiteboard | |
| - Combine navigation and drawing | |
| The agent decides whether to return to the previous slide afterward. Automatically | |
| returning every time would make the lecture feel mechanical. | |
| The orchestrator saves the interrupted teaching beat and sentence position. For the | |
| first implementation, resuming at the beginning of that beat is acceptable and much | |
| simpler than resuming at an exact audio sample. | |
| ## Speech | |
| - **TTS:** VoxCPM for professor narration. | |
| - **STT:** faster-whisper or Moonshine for short student questions. | |
| - **Transport and turn detection:** FastRTC, initially push-to-talk and later VAD | |
| barge-in. | |
| Narration should be synthesized in short sentence or beat-sized chunks. This keeps | |
| latency low and gives the orchestrator clean cancellation boundaries. | |
| ## Whiteboard | |
| Avoid unrestricted free-form drawing as the primary path. It requires too many model | |
| calls and is difficult to synchronize or reproduce. | |
| Use structured operations: | |
| - LaTeX for equations | |
| - Excalidraw-style primitives for boxes, arrows, labels, grids, and highlights | |
| - Optional Mermaid for diagrams where automatic layout is useful | |
| - Manim only for prepared showcase animations, not the live agent loop | |
| The model emits one structured drawing plan per teaching beat. The frontend animates | |
| the resulting primitives locally, so drawing does not require another inference call | |
| for every stroke. | |
| ### Speech and Drawing Synchronization | |
| Each action may include an approximate offset relative to the narration: | |
| ```json | |
| { | |
| "narration": "The center pixel is replaced using all nine neighbors.", | |
| "actions": [ | |
| {"tool": "highlight_cell", "args": {"row": 1, "column": 1}, "at": 0.2}, | |
| {"tool": "write_latex", "args": {"expression": "1/9 sum pixels"}, "at": 2.1} | |
| ] | |
| } | |
| ``` | |
| The orchestrator: | |
| 1. Starts TTS for the teaching beat. | |
| 2. Executes visual actions at their approximate offsets. | |
| 3. Waits for speech and drawing to finish. | |
| 4. Requests the next teaching beat. | |
| This produces the feeling of talking while drawing without making an agent call per | |
| stroke. | |
| ## Demo Path | |
| The public Space should offer two entry points: | |
| 1. **Try the prepared lecture:** loads an already processed, polished deck from | |
| Hugging Face storage immediately. | |
| 2. **Upload your own lecture:** preprocesses the complete deck, displays progress, | |
| then starts the professor. | |
| The prepared lecture is the reliable judging and demo-video path. User uploads prove | |
| the system generalizes without making the first experience depend on a long vision | |
| preprocessing wait. | |
| ## Implementation Order | |
| 1. Complete-deck manifest and compact index | |
| 2. Prepared demo deck loading from Hugging Face storage | |
| 3. Professor teaching-beat schema and tool executor | |
| 4. Slide navigation tools | |
| 5. Structured whiteboard tools and local animation | |
| 6. VoxCPM narration with cancellable chunks | |
| 7. Push-to-talk interruption and STT | |
| 8. Automatic slide retrieval during questions | |
| 9. VAD barge-in | |
| 10. Targeted `look_closer` vision calls | |