Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
19 KiB
Video Features Roadmap — Technical Consultation v1
Date: 2026-03-22 Specialists consulted: ML/AI Engineer, Backend Architect, Remotion Engineer, Frontend Architect, DevOps Engineer, Performance Engineer
Feature Overview
| # | Feature | Complexity | MVP | Full | Additional Infra |
|---|---|---|---|---|---|
| 1 | Advanced Remotion Templates | Easy-Medium | 3-4 days | 3-4 days | None — ready to implement |
| 2 | Viral Moments Detection | Medium | 5-7 days | 8-12 days | LLM API key only |
| 3 | Auto-Cut & Head Tracking | Very Hard | 12-15 days | 30-45 days | Phase 1: nothing; Phase 2: GPU worker |
| 4 | 9:16 Shorts Conversion | Medium | 6-8 days | +3-4 days after #3 | None |
| Total | 26-34 days | 44-65 days |
Realistic for one dev: 6-8 weeks (all MVPs) or 3-4 months (full versions).
Feature 1: Advanced Remotion Templates
Status: Spec + implementation plan already written.
- Spec:
docs/superpowers/specs/2026-03-21-advanced-remotion-templates-design.md - Plan:
docs/superpowers/plans/2026-03-21-advanced-remotion-templates.md
Scope: Extend CaptionStyleSchema with 4 new highlight styles (pop_in, karaoke, bounce, glow_pulse), 2 transitions (zoom_in, drop_in), 3 fields (word_entrance, highlight_rotation_deg, text_transform). Seed 2 system presets: "Shorts" and "Podcast".
Changes: Schema extensions in Remotion + backend, rendering logic in Captions.tsx, Alembic migration for presets, frontend StyleEditor form controls.
No specialist input needed — fully designed, no new infrastructure.
Feature 2: Viral Moments Detection
Architecture
LLM API: Gemini 2.5 Flash (best Russian language support, $0.15/$0.60 per 1M tokens) or GPT-4o-mini (same pricing, slightly weaker Russian). Cost per 30-min video analysis: ~$0.005.
Audio augmentation: librosa for RMS energy curves — refines clip boundaries to natural pauses, boosts scoring for high-energy segments. Adds ~20MB dependency, processes 30-min audio in <10 seconds.
Pipeline:
- Fetch transcription Document from DB
- librosa computes energy envelope over full audio (100ms resolution)
- LLM analyzes transcription text with structured JSON output prompt
- Post-process: snap clip boundaries to low-energy points, compute energy scores
- Save clips to new
clipstable
Backend Design
New module: clips (models, schemas, repository, service, router) — stores detected clips with project/file/job relationships.
Clip model:
Clip {
project_id: UUID (FK projects)
source_file_id: UUID (FK files)
job_id: UUID? (FK jobs)
title: str
start_ms: int
end_ms: int
score: float
source_type: "viral_detected" | "user_created" | "auto_generated"
status: "pending" | "approved" | "rejected" | "exported"
meta: JSON? (LLM reasoning, tags, hashtags)
}
New job type: VIRAL_DETECT added to JobTypeEnum. Actor calls LLM API directly via httpx from Dramatiq worker (no separate service needed).
LLM integration:
- Direct HTTP call from actor with retry + exponential backoff on 429
- Prompts stored in
cpv3/infrastructure/prompts/viral_detection_v1.txt - Active version controlled by
LLM_VIRAL_PROMPT_VERSIONenv var - New settings:
LLM_API_URL,LLM_API_KEY,LLM_MODEL_NAME
Frontend Design
- New
ViralClipsStepin project wizard (features/project/) - Clip list with thumbnails, scores, titles, approve/reject buttons
- Clip edit modal with video preview (scoped playback for start/end range)
- New job type
VIRAL_DETECTin notification handling (existing WebSocket infrastructure)
Key Numbers
| Metric | Value |
|---|---|
| Accuracy (precision) | 50-70% |
| Accuracy (recall) | 60-80% |
| Processing time | 10-20 seconds |
| Cost per video | ~$0.005 |
| Cost at 1,000 videos/month | ~$5 |
| New dependencies | google-generativeai or openai (~10MB) + librosa (~20MB) |
Risks
- Prompt engineering quality determines feature value — iterate based on user feedback
- Visual-only moments (facial expressions, physical comedy) cannot be detected from text — ~20-30% of viral moments are missed
- Transcription quality matters — Whisper
tinyhas ~25% WER on Russian; use at leastsmallfor viral detection input - LLM hallucinated timestamps — validate returned timestamps against actual segment boundaries
MVP vs Full
- MVP: Text-only LLM analysis, no audio energy. Returns clips with scores. User reviews and accepts/rejects.
- Full: Add librosa energy analysis, few-shot prompt examples from user-accepted clips, batch processing, direct clip export to 9:16.
Feature 3: Auto-Cut & Head Tracking
Architecture
Face detection: MediaPipe BlazeFace (Apache 2.0, ~2MB model, 30-60 FPS on CPU). Sample at 3 FPS — face positions don't change significantly within 330ms. Dependency: mediapipe (~30MB).
Speaker diarization: pyannote.audio 3.1 (MIT, ~10% DER, self-hosted). Runs on CPU at 0.17-0.33x real-time (5-10 min for 30-min audio). GPU accelerates to 1-2 min. Dependencies: pyannote-audio (~200MB) + torchaudio (~50-80MB). PyTorch already installed via Whisper.
Face-speaker mapping:
- Phase 1: Temporal correlation heuristic — match face tracks to speaker segments by maximum temporal overlap. 70-85% accuracy for 2-speaker videos. Zero additional dependencies. ~100 lines of Python.
- Phase 2: TalkNet-ASD (Active Speaker Detection) — jointly analyzes lip movement + audio to detect who is speaking. 92.3% accuracy. Requires
torchvision+ model weights (~50MB). Needs GPU (2-5 FPS on CPU vs 15-25 FPS on GPU).
Video compositing (Remotion approach):
Dynamic crop via CSS transform: scale() translate() on <Video> element inside overflow: hidden container. This is a GPU-composited browser operation — essentially free performance-wise. No FFmpeg re-encoding needed for the crop itself.
New Remotion compositions:
| Composition | Purpose | Phase |
|---|---|---|
CaptionedVideo (existing) |
Caption overlay on native video | Current |
ShortsVideo (new) |
Static/keyframe crop + captions at 9:16 | Feature 4 |
AutoEditVideo (new) |
Face-tracking crop + cuts + captions | Feature 3 full |
All compositions share the <Captions> component and useCaptions hook.
Crop data format (keyframes):
type FaceKeyframe = {
time: number; // seconds
x: number; // center of face, 0.0-1.0 normalized
y: number; // center of face, 0.0-1.0 normalized
width: number; // bounding box width, 0.0-1.0
height: number; // bounding box height, 0.0-1.0
speakerId?: string;
};
type CropTrack = {
keyframes: FaceKeyframe[];
interpolation: "linear" | "ease" | "smooth";
zoom: number; // base zoom multiplier
safeMargin: number; // margin around face (0.1 = 10%)
};
Remotion interpolate() between keyframes for smooth pan/zoom. Use spring() only for hard cuts between speakers.
Backend Design
New job types: FACE_DETECT, SPEAKER_DIARIZE added to JobTypeEnum. Results stored in Job.output_data (JSON) — no new table needed for face/diarization data.
ML service separation:
- Phase 1: Keep in Dramatiq workers (same image). MediaPipe + pyannote add only ~280MB to image.
- Phase 2: Separate
ml-workerDocker container on dedicated Dramatiq queues (ml_head_tracking,ml_diarization). Same codebase, different image, different resource limits.
Remotion service changes: POST /api/render needs a compositionId request parameter to select which composition to render. Props extend with crop, outputWidth, outputHeight.
Processing Time (30-min 1080p video)
| Step | CPU | GPU |
|---|---|---|
| Audio extraction (FFmpeg) | 10-20 sec | 10-20 sec |
| Face detection (MediaPipe, 3 FPS) | 1-2 min | 10-15 sec |
| Speaker diarization (pyannote) | 15-30 min | 1-2 min |
| Face-speaker mapping | < 1 sec | < 1 sec |
| Remotion render (crop + captions) | 10-30 min | 10-30 min |
| Total (parallelized) | 35-80 min | 16-40 min |
Face detection + diarization can run in parallel (different input: video frames vs audio track).
Memory Requirements
| Config | Peak RAM |
|---|---|
| Whisper base + pyannote (parallel) | 8-12 GB |
| Whisper medium + pyannote (parallel) | 12-16 GB |
| Recommended ML worker limit | 16 GB, --threads 1 |
Frontend Design
- Head tracking preview: video player with face bounding box overlay (canvas)
- Speaker timeline track in TimelinePanel (extends existing 4-track system)
- Controls: zoom level slider, transition speed, speaker selection
- Before/after comparison toggle
- UX flow: upload podcast → trigger analysis (ProcessingStep) → review speaker assignments → adjust → export
Key Numbers
| Metric | Value |
|---|---|
| Face detection accuracy | ~90% (MediaPipe on talking-head content) |
| Diarization DER | ~10% (pyannote 3.1) |
| Face-speaker mapping (Phase 1) | 70-85% accuracy |
| Face-speaker mapping (Phase 2, TalkNet) | ~92% accuracy |
| New dependencies | ~280MB (mediapipe + pyannote + torchaudio) |
| GPU mandatory? | No for Phase 1; recommended for Phase 2 |
Risks
- Face-to-speaker mapping is the hardest unsolved subproblem — 70-85% accuracy means 1 in 5 assignments may be wrong. Must let users manually correct.
- Diarization on CPU is the bottleneck — 15-30 min for 30-min video. GPU reduces to 1-2 min.
- PyTorch version conflicts between Whisper and pyannote — test
uv syncbefore committing. - Video quality loss when cropping 16:9 to 9:16 — only ~31.6% of frame width is kept. Source must be at least 1080p.
- Model download on first run — pyannote models (~100MB) require Hugging Face license acceptance. Handle in Dockerfile, not at runtime.
MVP vs Full
- MVP (12-15 days): Face detection on sampled frames. User manually selects which face to follow. Static crop to selected face. No speaker switching, no diarization. Works for single-speaker content.
- Full (30-45 days): Speaker diarization + face-speaker mapping. Dynamic crop following active speaker. Smooth spring() transitions on speaker changes. Split-screen for reactions. Multi-speaker support.
Feature 4: 9:16 Shorts Conversion
Architecture
Pipeline: Crop-then-caption, always. Single Remotion render pass using new ShortsVideo composition. The composition renders at target 9:16 dimensions, applies CSS crop transform to <Video>, and renders captions on top.
Caption positioning: No new schema fields needed. Backend adjusts font_size, padding_px, max_width_pct in styleConfig for 9:16 aspect ratio. Remotion is a "dumb renderer" — intelligence about what looks good at 9:16 belongs in presets.
Crop specification:
type CropConfig = {
mode: "static" | "keyframe";
staticCrop?: { x: number; y: number; zoom: number }; // 0-1 normalized
keyframes?: Array<{ time: number; x: number; y: number; zoom: number }>;
interpolation?: "linear" | "ease" | "smooth";
};
Static crop is a degenerate case of keyframe crop (single keyframe).
Backend Design
New job type: ASPECT_CONVERT in JobTypeEnum. New function crop_to_vertical() in media/service.py using FFmpeg crop+scale filter.
New artifact type: VERTICAL_VIDEO in ArtifactTypeEnum.
Pipeline:
- Trim source video to clip time range (if from viral detection)
- Apply crop (static center crop or face-tracking crop from Feature 3)
- Upload to S3 at
{folder}/vertical/{filename} - Webhook + notification
Frontend Design
- Crop preview: draggable 9:16 rectangle overlay on video player (CSS
object-fit: cover+object-position) - Side-by-side preview toggle: original 16:9 vs cropped 9:16
- Integration with Feature 2: "Convert to Short" button on each approved viral clip
- Integration with Feature 3: auto-populate crop region from face detection data
Processing Time
| Approach | Time (30-min video) |
|---|---|
| FFmpeg crop-only (no captions) | 12-36 min |
| Remotion crop + captions (single pass) | 11-45 min |
| FFmpeg with NVENC hardware encoding | 3-5 min |
MVP vs Full
- MVP (6-8 days): Manual crop region selection with preview. User drags a 9:16 rectangle over video. New
ShortsVideoRemotion composition renders crop + captions. - Full (+3-4 days after Feature 3): Auto-crop based on face detection data. One-click vertical conversion. Batch conversion of viral clips.
Recommended Build Order
Week 1-2: Feature 1 (Templates) ████████
Week 2-4: Feature 2 (Viral Detection) ████████████████
Week 4-6: Feature 4 MVP (9:16 crop) ████████████████
Week 6-14: Feature 3 (Head Tracking) ████████████████████████████████████████
Week 14-15: Feature 4 upgrade ████████
Rationale:
- Templates first — ready to implement, zero risk, immediate user value
- Viral detection second — highest value/effort ratio ($0.005/video, 5-7 days MVP), validates that users want automated editing
- 9:16 MVP third — builds the
ShortsVideocomposition that Feature 3 extends, useful standalone with manual crop - Head tracking last — most complex, biggest investment, validates demand from Features 2+4 first
- 9:16 upgrade — trivial once head tracking provides face position data
Cost Analysis
Per-Video Processing Cost
| Tier | Components | Compute | LLM API | Total | Wait Time |
|---|---|---|---|---|---|
| CPU-only | All on CPU | $0.05 | $0.06 | $0.11 | 35-80 min |
| GPU (T4) | ML on GPU, FFmpeg on CPU | $0.11 | $0.06 | $0.17 | 16-40 min |
| GPU + NVENC | Everything on GPU | $0.13 | $0.06 | $0.19 | 10-15 min |
Monthly Infrastructure Cost (100 videos/month)
| Scenario | Cost |
|---|---|
| CPU-only (existing infra) | ~$11 + server |
| Modal serverless GPU | ~$21/month |
| Spot GPU (g4dn.xlarge) | ~$115/month |
| Standing GPU (g4dn.xlarge 24/7) | ~$380/month |
Recommendation: Start CPU-only. Move to Modal serverless GPU when queue wait times exceed 15 minutes. At 500+ videos/day, evaluate spot instances.
Suggested SaaS Pricing Tiers
| Tier | Price | Limits | Compute Cost | Margin |
|---|---|---|---|---|
| Free | $0 | 10-min videos, queue priority low | ~$0.04/video | Marketing |
| Pro | $15-30/mo | 30-min videos, GPU ML | ~$0.17/video at 50 videos | 60-80% |
| Business | $50-100/mo | 60-min videos, priority queue, NVENC | ~$0.38/video | 70-85% |
Infrastructure Decisions
ML Service Separation
Phase 1: Keep ML in existing Dramatiq workers. MediaPipe + pyannote add only ~280MB to image. PyTorch is already installed via Whisper.
Phase 2: Separate ml-worker Docker container on dedicated queues. Same codebase, different image (Dockerfile.ml), different resource limits. Use Docker Compose profiles:
docker-compose up # Default: no ML worker
docker-compose --profile ml up # With ML worker
Do NOT build a separate HTTP microservice. Dramatiq already handles job queuing, retries, progress, and cancellation. Adding HTTP service discovery, API contracts, and health checks is overhead with zero benefit for async workloads.
Immediate Optimizations (Before New Features)
| Action | Impact | Effort |
|---|---|---|
| Switch PyTorch to CPU-only index | -800MB image size | 1 hour |
Fix worker REMOTION_SERVICE_URL default |
Bug fix | 5 min |
| Add resource limits to docker-compose services | Prevent OOM cascades | 30 min |
| Split Dramatiq into queue pools (lightweight vs ML vs compute) | Prevent worker starvation | 2-3 hours |
Technology Stack Summary
New Dependencies
| Package | Size | Purpose | Feature |
|---|---|---|---|
google-generativeai or openai |
~10 MB | LLM API client | 2 |
librosa |
~20 MB | Audio energy analysis | 2 |
mediapipe |
~30 MB | Face detection | 3 |
pyannote-audio |
~200 MB | Speaker diarization | 3 |
torchaudio |
~50-80 MB | Audio processing for pyannote | 3 |
| Total new deps | ~310-340 MB |
New Backend Modules
| Module | Purpose | Feature |
|---|---|---|
clips |
Clip CRUD, review workflow | 2 |
New Remotion Compositions
| Composition | Purpose | Feature |
|---|---|---|
ShortsVideo |
Static/keyframe crop + captions at 9:16 | 4 |
AutoEditVideo |
Face-tracking dynamic crop + captions | 3 |
New Job Types
| Job Type | Purpose | Feature |
|---|---|---|
VIRAL_DETECT |
LLM analysis of transcription | 2 |
ASPECT_CONVERT |
9:16 crop + re-encode | 4 |
FACE_DETECT |
Face bounding box detection | 3 |
SPEAKER_DIARIZE |
Speaker diarization | 3 |
Cross-Cutting Issues
| Issue | Flagged By | Priority | Action |
|---|---|---|---|
| PyTorch installs CUDA libs on CPU-only infra (+800MB) | DevOps | High | Switch to CPU-only PyTorch index |
Worker --processes 1 --threads 2 will OOM with ML jobs |
Performance | High | Split into queue pools, --threads 1 for ML |
_get_job_status_sync() leaks DB connections |
Performance | High | Fix before adding more actors |
| No temp file cleanup on OOM crash | Performance | Medium | Add periodic /tmp cleanup or cron |
tasks/service.py at 1,674 lines, will exceed 2K |
Backend | Medium | Extract actor boilerplate into decorator/context manager |
Worker REMOTION_SERVICE_URL default wrong (localhost:8001) |
DevOps | Medium | Fix to http://remotion:3001 in docker-compose |
| No resource limits on any Docker service | DevOps | Medium | Add memory/CPU limits to all services |
| Whisper should move to ML service eventually | Backend | Low | Plan for Phase 2 when ML worker is split out |
isCurrent word identity check in Captions.tsx is fragile |
Remotion | Low | Compare by index, not text + start time |
Specialist Reports (Full Transcripts)
Full specialist outputs are available in the session transcript. Key files each specialist examined:
- ML Engineer:
cpv3/modules/transcription/service.py,cpv3/modules/tasks/service.py,pyproject.toml - Backend Architect:
cpv3/modules/tasks/service.py,cpv3/modules/jobs/schemas.py,cpv3/modules/media/service.py,cpv3/modules/captions/service.py,docker-compose.yml - Remotion Engineer:
remotion_service/src/components/Composition.tsx,Captions.tsx,Root.tsx,useCaptions.ts,useVideoMeta.ts, all type definitions - Frontend Architect:
src/widgets/TimelinePanel/,src/features/project/FragmentsStep/,src/shared/context/WizardContext.tsx,src/shared/store/notifications/ - DevOps Engineer:
docker-compose.yml,Dockerfile,pyproject.toml,uv.lock - Performance Engineer:
cpv3/modules/tasks/service.py,cpv3/modules/media/service.py,cpv3/modules/transcription/service.py,docker-compose.yml