Spaces:

build-small-hackathon
/

ai-prof

Running

App Files Files Community

ai-prof / ARCHITECTURE.md

pranavkarthik10

Deploy AI Prof hackathon submission

81e3ca2 verified 17 days ago

preview code

Raw

History Blame Contribute Delete

7.59 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

AI Prof Architecture

This document captures the intended production and demo architecture for AI Prof. The core experience is a professor-like agent that understands an entire lecture, controls the visible slide and whiteboard, speaks naturally, and responds to student interruptions without losing its place.

Deployment

Hugging Face Space
  Gradio UI
  session state
  professor orchestrator
  slide and whiteboard rendering
            |
            v
Modal inference services
  MiniCPM-V slide vision
  Nemotron professor agent
  VoxCPM text-to-speech
  Whisper or Moonshine speech-to-text
            |
            v
Hugging Face storage
  published demo deck
  processed deck manifests
  slide images and readings

Hugging Face Space: public Gradio application and lightweight orchestration.
Modal: GPU-backed, self-hosted model inference that can scale down when idle.
Local development: the same Gradio app in mock mode or pointed at local model servers.
Hugging Face storage: persistent processed lectures, including one polished public demo deck that works immediately for judges and visitors.

Deck Preparation

The professor should not begin teaching after only slide 1 is processed. Navigation and question answering depend on knowing what exists across the whole lecture.

On upload:

Hash the PDF to identify an existing processed deck.
If cached, load its manifest and slide assets.
Otherwise, render every page and extract the PDF text layer.
Run MiniCPM-V over every slide.
Build a compact deck index from all slide readings.
Persist the processed result when appropriate.
Enable the lecture only when the complete index is ready.

Gradio displays preparation progress while this runs. It does not need to hold the heavy processing itself; it can call Modal jobs and stream their progress.

Processed Deck Format

decks/<pdf_sha256>/
  manifest.json
  source.pdf
  slides/
    001.png
    002.png
  readings/
    001.json
    002.json

Each index entry should remain compact enough to keep the entire deck map in the agent context:

{
  "slide": 7,
  "title": "Convolution",
  "summary": "Applying a kernel across an image",
  "concepts": ["kernel", "stride", "weighted sum"],
  "equations": ["g(x,y) = sum_i sum_j h(i,j)f(x-i,y-j)"],
  "visuals": ["A 3 by 3 kernel moving across a pixel grid"]
}

The agent receives the complete compact index, but only the full reading for the current or specifically retrieved slides.

Professor Agent

The model should make decisions at meaningful teaching boundaries, not once per sentence or drawing stroke. One agent turn produces a short teaching beat:

{
  "narration": "Imagine this grid is a small patch of the image.",
  "actions": [
    {"tool": "draw_grid", "args": {"rows": 3, "cols": 3}, "at": 0.4}
  ],
  "next": "continue"
}

The orchestrator executes the actions and speech. After the beat completes, it asks the agent what to do next.

Agent Context

Each decision receives:

Complete compact deck index
Current slide number and full cached slide reading
Current whiteboard state
Recent conversation and teaching beats
Saved lecture position
Trigger: continue lecture or student question

Tools

goto_slide(index) - move to the best supporting slide
next_slide() and prev_slide() - ordinary navigation
look_closer(question) - ask MiniCPM-V to inspect the current slide for a specific visual detail; wire this after the core loop
write_latex(expression, position) - place a typeset equation
draw_diagram(spec) - render structured Excalidraw-style primitives
clear_whiteboard() - reset the board when the visual context changes
highlight_region(bbox) - optional later enhancement

Tool calls and their results should be logged as a publishable teaching-session trace.

Student Interruption

The lecture is controlled by an explicit state machine:

NARRATING
  -> student begins speaking
INTERRUPTING
  -> stop TTS and cancel current generation
LISTENING
  -> capture speech until push-to-talk release or VAD pause
THINKING
  -> transcribe, search deck index, choose slide and visual support
ANSWERING
  -> navigate or draw when useful, then speak the answer
RESUMING
  -> continue from the saved teaching position

When a student asks a question, the agent first decides whether the current slide is sufficient. It may:

Answer on the current slide
Navigate to a more relevant earlier or later slide
Inspect a slide more closely
Draw an explanation on the whiteboard
Combine navigation and drawing

The agent decides whether to return to the previous slide afterward. Automatically returning every time would make the lecture feel mechanical.

The orchestrator saves the interrupted teaching beat and sentence position. For the first implementation, resuming at the beginning of that beat is acceptable and much simpler than resuming at an exact audio sample.

Speech

TTS: VoxCPM for professor narration.
STT: faster-whisper or Moonshine for short student questions.
Transport and turn detection: FastRTC, initially push-to-talk and later VAD barge-in.

Narration should be synthesized in short sentence or beat-sized chunks. This keeps latency low and gives the orchestrator clean cancellation boundaries.

Whiteboard

Avoid unrestricted free-form drawing as the primary path. It requires too many model calls and is difficult to synchronize or reproduce.

Use structured operations:

LaTeX for equations
Excalidraw-style primitives for boxes, arrows, labels, grids, and highlights
Optional Mermaid for diagrams where automatic layout is useful
Manim only for prepared showcase animations, not the live agent loop

The model emits one structured drawing plan per teaching beat. The frontend animates the resulting primitives locally, so drawing does not require another inference call for every stroke.

Speech and Drawing Synchronization

Each action may include an approximate offset relative to the narration:

{
  "narration": "The center pixel is replaced using all nine neighbors.",
  "actions": [
    {"tool": "highlight_cell", "args": {"row": 1, "column": 1}, "at": 0.2},
    {"tool": "write_latex", "args": {"expression": "1/9 sum pixels"}, "at": 2.1}
  ]
}

The orchestrator:

Starts TTS for the teaching beat.
Executes visual actions at their approximate offsets.
Waits for speech and drawing to finish.
Requests the next teaching beat.

This produces the feeling of talking while drawing without making an agent call per stroke.

Demo Path

The public Space should offer two entry points:

Try the prepared lecture: loads an already processed, polished deck from Hugging Face storage immediately.
Upload your own lecture: preprocesses the complete deck, displays progress, then starts the professor.

The prepared lecture is the reliable judging and demo-video path. User uploads prove the system generalizes without making the first experience depend on a long vision preprocessing wait.

Implementation Order

Complete-deck manifest and compact index
Prepared demo deck loading from Hugging Face storage
Professor teaching-beat schema and tool executor
Slide navigation tools
Structured whiteboard tools and local animation
VoxCPM narration with cancellable chunks
Push-to-talk interruption and STT
Automatic slide retrieval during questions
VAD barge-in
Targeted look_closer vision calls