# Methodology ## System Architecture The Deep-Dive Video Note Taker follows a multi-stage AI pipeline: ``` Video Input → Audio Extraction → ASR Transcription → Text Chunking → LLM Summarization → RAG Indexing → Timestamp Mapping → Action Item Extraction → Note Generation → Web UI ``` ## Stage Details ### 1. Audio Extraction - **Tool**: FFmpeg (primary), MoviePy (fallback) - **Output**: 16kHz mono WAV optimised for Whisper ASR - **Handles**: MP4, AVI, MOV, MKV, WebM, MP3, WAV ### 2. ASR Transcription (Whisper) - **Model**: OpenAI Whisper (tiny/base/small/medium/large) - **Output**: Word-level and segment-level timestamps - **Language**: Auto-detected, 99+ languages supported ### 3. Text Chunking - **Strategy**: Sliding window with configurable overlap - **Chunk Size**: 1000 words (default), 200-word overlap - **Preserves**: Start/end timestamps per chunk ### 4. LLM Summarization - **Primary**: OpenAI GPT-3.5-Turbo / GPT-4 - **Fallback**: HuggingFace BART (facebook/bart-large-cnn) - **Prompts**: Structured for bullet-point and topic-based output ### 5. RAG Pipeline (FAISS) - **Embeddings**: SentenceTransformers (all-MiniLM-L6-v2) - **Index**: FAISS IndexFlatIP (cosine similarity on normalised vectors) - **Purpose**: Context retrieval + semantic search ### 6. Timestamp Mapping - **Method**: Aligns each chunk summary with its source timestamps - **Output**: Chapter markers, key highlights, navigable segments ### 7. Action Item Extraction - **Primary**: LLM-based (structured JSON output) - **Fallback**: Regex heuristic patterns - **Categories**: Actions, Decisions, Follow-ups, Reminders ### 8. Note Generation - **Output Formats**: Markdown (.md) + JSON (.json) - **Structure**: Summary → Highlights → Action Items → Chapters → Transcript ## Performance Characteristics | Metric | Value | |-----------------------|------------------| | Summarization Accuracy| ~85–90% | | ASR Word Error Rate | ~3–8% (clean audio)| | Time Reduction | ~60–70% | | Max Video Length | Unlimited (chunked)| | Supported Languages | 99+ |