Spaces:

build-small-hackathon
/

ai-prof

Running

App Files Files Community

ai-prof / ARCHITECTURE.md

pranavkarthik10

Deploy AI Prof hackathon submission

81e3ca2 verified 17 days ago

preview code

Raw

History Blame Contribute Delete

7.59 kB

	# AI Prof Architecture

	This document captures the intended production and demo architecture for AI Prof.
	The core experience is a professor-like agent that understands an entire lecture,
	controls the visible slide and whiteboard, speaks naturally, and responds to student
	interruptions without losing its place.

	## Deployment

	```text
	Hugging Face Space
	Gradio UI
	session state
	professor orchestrator
	slide and whiteboard rendering
	\|
	v
	Modal inference services
	MiniCPM-V slide vision
	Nemotron professor agent
	VoxCPM text-to-speech
	Whisper or Moonshine speech-to-text
	\|
	v
	Hugging Face storage
	published demo deck
	processed deck manifests
	slide images and readings
	```

	- Hugging Face Space: public Gradio application and lightweight orchestration.
	- Modal: GPU-backed, self-hosted model inference that can scale down when idle.
	- Local development: the same Gradio app in mock mode or pointed at local model
	servers.
	- Hugging Face storage: persistent processed lectures, including one polished
	public demo deck that works immediately for judges and visitors.

	## Deck Preparation

	The professor should not begin teaching after only slide 1 is processed. Navigation
	and question answering depend on knowing what exists across the whole lecture.

	On upload:

	1. Hash the PDF to identify an existing processed deck.
	2. If cached, load its manifest and slide assets.
	3. Otherwise, render every page and extract the PDF text layer.
	4. Run MiniCPM-V over every slide.
	5. Build a compact deck index from all slide readings.
	6. Persist the processed result when appropriate.
	7. Enable the lecture only when the complete index is ready.

	Gradio displays preparation progress while this runs. It does not need to hold the
	heavy processing itself; it can call Modal jobs and stream their progress.

	### Processed Deck Format

	```text
	decks/<pdf_sha256>/
	manifest.json
	source.pdf
	slides/
	001.png
	002.png
	readings/
	001.json
	002.json
	```

	Each index entry should remain compact enough to keep the entire deck map in the
	agent context:

	```json
	{
	"slide": 7,
	"title": "Convolution",
	"summary": "Applying a kernel across an image",
	"concepts": ["kernel", "stride", "weighted sum"],
	"equations": ["g(x,y) = sum_i sum_j h(i,j)f(x-i,y-j)"],
	"visuals": ["A 3 by 3 kernel moving across a pixel grid"]
	}
	```

	The agent receives the complete compact index, but only the full reading for the
	current or specifically retrieved slides.

	## Professor Agent

	The model should make decisions at meaningful teaching boundaries, not once per
	sentence or drawing stroke. One agent turn produces a short teaching beat:

	```json
	{
	"narration": "Imagine this grid is a small patch of the image.",
	"actions": [
	{"tool": "draw_grid", "args": {"rows": 3, "cols": 3}, "at": 0.4}
	],
	"next": "continue"
	}
	```

	The orchestrator executes the actions and speech. After the beat completes, it asks
	the agent what to do next.

	### Agent Context

	Each decision receives:

	- Complete compact deck index
	- Current slide number and full cached slide reading
	- Current whiteboard state
	- Recent conversation and teaching beats
	- Saved lecture position
	- Trigger: continue lecture or student question

	### Tools

	- `goto_slide(index)` - move to the best supporting slide
	- `next_slide()` and `prev_slide()` - ordinary navigation
	- `look_closer(question)` - ask MiniCPM-V to inspect the current slide for a
	specific visual detail; wire this after the core loop
	- `write_latex(expression, position)` - place a typeset equation
	- `draw_diagram(spec)` - render structured Excalidraw-style primitives
	- `clear_whiteboard()` - reset the board when the visual context changes
	- `highlight_region(bbox)` - optional later enhancement

	Tool calls and their results should be logged as a publishable teaching-session
	trace.

	## Student Interruption

	The lecture is controlled by an explicit state machine:

	```text
	NARRATING
	-> student begins speaking
	INTERRUPTING
	-> stop TTS and cancel current generation
	LISTENING
	-> capture speech until push-to-talk release or VAD pause
	THINKING
	-> transcribe, search deck index, choose slide and visual support
	ANSWERING
	-> navigate or draw when useful, then speak the answer
	RESUMING
	-> continue from the saved teaching position
	```

	When a student asks a question, the agent first decides whether the current slide
	is sufficient. It may:

	- Answer on the current slide
	- Navigate to a more relevant earlier or later slide
	- Inspect a slide more closely
	- Draw an explanation on the whiteboard
	- Combine navigation and drawing

	The agent decides whether to return to the previous slide afterward. Automatically
	returning every time would make the lecture feel mechanical.

	The orchestrator saves the interrupted teaching beat and sentence position. For the
	first implementation, resuming at the beginning of that beat is acceptable and much
	simpler than resuming at an exact audio sample.

	## Speech

	- TTS: VoxCPM for professor narration.
	- STT: faster-whisper or Moonshine for short student questions.
	- Transport and turn detection: FastRTC, initially push-to-talk and later VAD
	barge-in.

	Narration should be synthesized in short sentence or beat-sized chunks. This keeps
	latency low and gives the orchestrator clean cancellation boundaries.

	## Whiteboard

	Avoid unrestricted free-form drawing as the primary path. It requires too many model
	calls and is difficult to synchronize or reproduce.

	Use structured operations:

	- LaTeX for equations
	- Excalidraw-style primitives for boxes, arrows, labels, grids, and highlights
	- Optional Mermaid for diagrams where automatic layout is useful
	- Manim only for prepared showcase animations, not the live agent loop

	The model emits one structured drawing plan per teaching beat. The frontend animates
	the resulting primitives locally, so drawing does not require another inference call
	for every stroke.

	### Speech and Drawing Synchronization

	Each action may include an approximate offset relative to the narration:

	```json
	{
	"narration": "The center pixel is replaced using all nine neighbors.",
	"actions": [
	{"tool": "highlight_cell", "args": {"row": 1, "column": 1}, "at": 0.2},
	{"tool": "write_latex", "args": {"expression": "1/9 sum pixels"}, "at": 2.1}
	]
	}
	```

	The orchestrator:

	1. Starts TTS for the teaching beat.
	2. Executes visual actions at their approximate offsets.
	3. Waits for speech and drawing to finish.
	4. Requests the next teaching beat.

	This produces the feeling of talking while drawing without making an agent call per
	stroke.

	## Demo Path

	The public Space should offer two entry points:

	1. Try the prepared lecture: loads an already processed, polished deck from
	Hugging Face storage immediately.
	2. Upload your own lecture: preprocesses the complete deck, displays progress,
	then starts the professor.

	The prepared lecture is the reliable judging and demo-video path. User uploads prove
	the system generalizes without making the first experience depend on a long vision
	preprocessing wait.

	## Implementation Order

	1. Complete-deck manifest and compact index
	2. Prepared demo deck loading from Hugging Face storage
	3. Professor teaching-beat schema and tool executor
	4. Slide navigation tools
	5. Structured whiteboard tools and local animation
	6. VoxCPM narration with cancellable chunks
	7. Push-to-talk interruption and STT
	8. Automatic slide retrieval during questions
	9. VAD barge-in
	10. Targeted `look_closer` vision calls