# Video Features Roadmap — Technical Consultation v1 **Date:** 2026-03-22 **Specialists consulted:** ML/AI Engineer, Backend Architect, Remotion Engineer, Frontend Architect, DevOps Engineer, Performance Engineer --- ## Feature Overview | # | Feature | Complexity | MVP | Full | Additional Infra | |---|---------|-----------|-----|------|-----------------| | 1 | Advanced Remotion Templates | Easy-Medium | 3-4 days | 3-4 days | None — ready to implement | | 2 | Viral Moments Detection | Medium | 5-7 days | 8-12 days | LLM API key only | | 3 | Auto-Cut & Head Tracking | Very Hard | 12-15 days | 30-45 days | Phase 1: nothing; Phase 2: GPU worker | | 4 | 9:16 Shorts Conversion | Medium | 6-8 days | +3-4 days after #3 | None | | **Total** | | | **26-34 days** | **44-65 days** | | Realistic for one dev: **6-8 weeks** (all MVPs) or **3-4 months** (full versions). --- ## Feature 1: Advanced Remotion Templates **Status:** Spec + implementation plan already written. - Spec: `docs/superpowers/specs/2026-03-21-advanced-remotion-templates-design.md` - Plan: `docs/superpowers/plans/2026-03-21-advanced-remotion-templates.md` **Scope:** Extend `CaptionStyleSchema` with 4 new highlight styles (pop_in, karaoke, bounce, glow_pulse), 2 transitions (zoom_in, drop_in), 3 fields (word_entrance, highlight_rotation_deg, text_transform). Seed 2 system presets: "Shorts" and "Podcast". **Changes:** Schema extensions in Remotion + backend, rendering logic in `Captions.tsx`, Alembic migration for presets, frontend StyleEditor form controls. **No specialist input needed** — fully designed, no new infrastructure. --- ## Feature 2: Viral Moments Detection ### Architecture **LLM API:** Gemini 2.5 Flash (best Russian language support, $0.15/$0.60 per 1M tokens) or GPT-4o-mini (same pricing, slightly weaker Russian). Cost per 30-min video analysis: ~$0.005. **Audio augmentation:** `librosa` for RMS energy curves — refines clip boundaries to natural pauses, boosts scoring for high-energy segments. Adds ~20MB dependency, processes 30-min audio in <10 seconds. **Pipeline:** 1. Fetch transcription Document from DB 2. librosa computes energy envelope over full audio (100ms resolution) 3. LLM analyzes transcription text with structured JSON output prompt 4. Post-process: snap clip boundaries to low-energy points, compute energy scores 5. Save clips to new `clips` table ### Backend Design **New module:** `clips` (models, schemas, repository, service, router) — stores detected clips with project/file/job relationships. **Clip model:** ``` Clip { project_id: UUID (FK projects) source_file_id: UUID (FK files) job_id: UUID? (FK jobs) title: str start_ms: int end_ms: int score: float source_type: "viral_detected" | "user_created" | "auto_generated" status: "pending" | "approved" | "rejected" | "exported" meta: JSON? (LLM reasoning, tags, hashtags) } ``` **New job type:** `VIRAL_DETECT` added to `JobTypeEnum`. Actor calls LLM API directly via `httpx` from Dramatiq worker (no separate service needed). **LLM integration:** - Direct HTTP call from actor with retry + exponential backoff on 429 - Prompts stored in `cpv3/infrastructure/prompts/viral_detection_v1.txt` - Active version controlled by `LLM_VIRAL_PROMPT_VERSION` env var - New settings: `LLM_API_URL`, `LLM_API_KEY`, `LLM_MODEL_NAME` ### Frontend Design - New `ViralClipsStep` in project wizard (features/project/) - Clip list with thumbnails, scores, titles, approve/reject buttons - Clip edit modal with video preview (scoped playback for start/end range) - New job type `VIRAL_DETECT` in notification handling (existing WebSocket infrastructure) ### Key Numbers | Metric | Value | |---|---| | Accuracy (precision) | 50-70% | | Accuracy (recall) | 60-80% | | Processing time | 10-20 seconds | | Cost per video | ~$0.005 | | Cost at 1,000 videos/month | ~$5 | | New dependencies | `google-generativeai` or `openai` (~10MB) + `librosa` (~20MB) | ### Risks - **Prompt engineering quality** determines feature value — iterate based on user feedback - **Visual-only moments** (facial expressions, physical comedy) cannot be detected from text — ~20-30% of viral moments are missed - **Transcription quality matters** — Whisper `tiny` has ~25% WER on Russian; use at least `small` for viral detection input - **LLM hallucinated timestamps** — validate returned timestamps against actual segment boundaries ### MVP vs Full - **MVP:** Text-only LLM analysis, no audio energy. Returns clips with scores. User reviews and accepts/rejects. - **Full:** Add librosa energy analysis, few-shot prompt examples from user-accepted clips, batch processing, direct clip export to 9:16. --- ## Feature 3: Auto-Cut & Head Tracking ### Architecture **Face detection:** MediaPipe BlazeFace (Apache 2.0, ~2MB model, 30-60 FPS on CPU). Sample at 3 FPS — face positions don't change significantly within 330ms. Dependency: `mediapipe` (~30MB). **Speaker diarization:** pyannote.audio 3.1 (MIT, ~10% DER, self-hosted). Runs on CPU at 0.17-0.33x real-time (5-10 min for 30-min audio). GPU accelerates to 1-2 min. Dependencies: `pyannote-audio` (~200MB) + `torchaudio` (~50-80MB). PyTorch already installed via Whisper. **Face-speaker mapping:** - Phase 1: Temporal correlation heuristic — match face tracks to speaker segments by maximum temporal overlap. 70-85% accuracy for 2-speaker videos. Zero additional dependencies. ~100 lines of Python. - Phase 2: TalkNet-ASD (Active Speaker Detection) — jointly analyzes lip movement + audio to detect who is speaking. 92.3% accuracy. Requires `torchvision` + model weights (~50MB). Needs GPU (2-5 FPS on CPU vs 15-25 FPS on GPU). **Video compositing (Remotion approach):** Dynamic crop via CSS `transform: scale() translate()` on `