Files
remotion_service/docs/consults/video-features-roadmap_v1.md
2026-03-22 22:42:35 +03:00

19 KiB

Video Features Roadmap — Technical Consultation v1

Date: 2026-03-22 Specialists consulted: ML/AI Engineer, Backend Architect, Remotion Engineer, Frontend Architect, DevOps Engineer, Performance Engineer


Feature Overview

# Feature Complexity MVP Full Additional Infra
1 Advanced Remotion Templates Easy-Medium 3-4 days 3-4 days None — ready to implement
2 Viral Moments Detection Medium 5-7 days 8-12 days LLM API key only
3 Auto-Cut & Head Tracking Very Hard 12-15 days 30-45 days Phase 1: nothing; Phase 2: GPU worker
4 9:16 Shorts Conversion Medium 6-8 days +3-4 days after #3 None
Total 26-34 days 44-65 days

Realistic for one dev: 6-8 weeks (all MVPs) or 3-4 months (full versions).


Feature 1: Advanced Remotion Templates

Status: Spec + implementation plan already written.

  • Spec: docs/superpowers/specs/2026-03-21-advanced-remotion-templates-design.md
  • Plan: docs/superpowers/plans/2026-03-21-advanced-remotion-templates.md

Scope: Extend CaptionStyleSchema with 4 new highlight styles (pop_in, karaoke, bounce, glow_pulse), 2 transitions (zoom_in, drop_in), 3 fields (word_entrance, highlight_rotation_deg, text_transform). Seed 2 system presets: "Shorts" and "Podcast".

Changes: Schema extensions in Remotion + backend, rendering logic in Captions.tsx, Alembic migration for presets, frontend StyleEditor form controls.

No specialist input needed — fully designed, no new infrastructure.


Feature 2: Viral Moments Detection

Architecture

LLM API: Gemini 2.5 Flash (best Russian language support, $0.15/$0.60 per 1M tokens) or GPT-4o-mini (same pricing, slightly weaker Russian). Cost per 30-min video analysis: ~$0.005.

Audio augmentation: librosa for RMS energy curves — refines clip boundaries to natural pauses, boosts scoring for high-energy segments. Adds ~20MB dependency, processes 30-min audio in <10 seconds.

Pipeline:

  1. Fetch transcription Document from DB
  2. librosa computes energy envelope over full audio (100ms resolution)
  3. LLM analyzes transcription text with structured JSON output prompt
  4. Post-process: snap clip boundaries to low-energy points, compute energy scores
  5. Save clips to new clips table

Backend Design

New module: clips (models, schemas, repository, service, router) — stores detected clips with project/file/job relationships.

Clip model:

Clip {
  project_id: UUID (FK projects)
  source_file_id: UUID (FK files)
  job_id: UUID? (FK jobs)
  title: str
  start_ms: int
  end_ms: int
  score: float
  source_type: "viral_detected" | "user_created" | "auto_generated"
  status: "pending" | "approved" | "rejected" | "exported"
  meta: JSON? (LLM reasoning, tags, hashtags)
}

New job type: VIRAL_DETECT added to JobTypeEnum. Actor calls LLM API directly via httpx from Dramatiq worker (no separate service needed).

LLM integration:

  • Direct HTTP call from actor with retry + exponential backoff on 429
  • Prompts stored in cpv3/infrastructure/prompts/viral_detection_v1.txt
  • Active version controlled by LLM_VIRAL_PROMPT_VERSION env var
  • New settings: LLM_API_URL, LLM_API_KEY, LLM_MODEL_NAME

Frontend Design

  • New ViralClipsStep in project wizard (features/project/)
  • Clip list with thumbnails, scores, titles, approve/reject buttons
  • Clip edit modal with video preview (scoped playback for start/end range)
  • New job type VIRAL_DETECT in notification handling (existing WebSocket infrastructure)

Key Numbers

Metric Value
Accuracy (precision) 50-70%
Accuracy (recall) 60-80%
Processing time 10-20 seconds
Cost per video ~$0.005
Cost at 1,000 videos/month ~$5
New dependencies google-generativeai or openai (~10MB) + librosa (~20MB)

Risks

  • Prompt engineering quality determines feature value — iterate based on user feedback
  • Visual-only moments (facial expressions, physical comedy) cannot be detected from text — ~20-30% of viral moments are missed
  • Transcription quality matters — Whisper tiny has ~25% WER on Russian; use at least small for viral detection input
  • LLM hallucinated timestamps — validate returned timestamps against actual segment boundaries

MVP vs Full

  • MVP: Text-only LLM analysis, no audio energy. Returns clips with scores. User reviews and accepts/rejects.
  • Full: Add librosa energy analysis, few-shot prompt examples from user-accepted clips, batch processing, direct clip export to 9:16.

Feature 3: Auto-Cut & Head Tracking

Architecture

Face detection: MediaPipe BlazeFace (Apache 2.0, ~2MB model, 30-60 FPS on CPU). Sample at 3 FPS — face positions don't change significantly within 330ms. Dependency: mediapipe (~30MB).

Speaker diarization: pyannote.audio 3.1 (MIT, ~10% DER, self-hosted). Runs on CPU at 0.17-0.33x real-time (5-10 min for 30-min audio). GPU accelerates to 1-2 min. Dependencies: pyannote-audio (~200MB) + torchaudio (~50-80MB). PyTorch already installed via Whisper.

Face-speaker mapping:

  • Phase 1: Temporal correlation heuristic — match face tracks to speaker segments by maximum temporal overlap. 70-85% accuracy for 2-speaker videos. Zero additional dependencies. ~100 lines of Python.
  • Phase 2: TalkNet-ASD (Active Speaker Detection) — jointly analyzes lip movement + audio to detect who is speaking. 92.3% accuracy. Requires torchvision + model weights (~50MB). Needs GPU (2-5 FPS on CPU vs 15-25 FPS on GPU).

Video compositing (Remotion approach):

Dynamic crop via CSS transform: scale() translate() on <Video> element inside overflow: hidden container. This is a GPU-composited browser operation — essentially free performance-wise. No FFmpeg re-encoding needed for the crop itself.

New Remotion compositions:

Composition Purpose Phase
CaptionedVideo (existing) Caption overlay on native video Current
ShortsVideo (new) Static/keyframe crop + captions at 9:16 Feature 4
AutoEditVideo (new) Face-tracking crop + cuts + captions Feature 3 full

All compositions share the <Captions> component and useCaptions hook.

Crop data format (keyframes):

type FaceKeyframe = {
  time: number;       // seconds
  x: number;          // center of face, 0.0-1.0 normalized
  y: number;          // center of face, 0.0-1.0 normalized
  width: number;      // bounding box width, 0.0-1.0
  height: number;     // bounding box height, 0.0-1.0
  speakerId?: string;
};

type CropTrack = {
  keyframes: FaceKeyframe[];
  interpolation: "linear" | "ease" | "smooth";
  zoom: number;       // base zoom multiplier
  safeMargin: number; // margin around face (0.1 = 10%)
};

Remotion interpolate() between keyframes for smooth pan/zoom. Use spring() only for hard cuts between speakers.

Backend Design

New job types: FACE_DETECT, SPEAKER_DIARIZE added to JobTypeEnum. Results stored in Job.output_data (JSON) — no new table needed for face/diarization data.

ML service separation:

  • Phase 1: Keep in Dramatiq workers (same image). MediaPipe + pyannote add only ~280MB to image.
  • Phase 2: Separate ml-worker Docker container on dedicated Dramatiq queues (ml_head_tracking, ml_diarization). Same codebase, different image, different resource limits.

Remotion service changes: POST /api/render needs a compositionId request parameter to select which composition to render. Props extend with crop, outputWidth, outputHeight.

Processing Time (30-min 1080p video)

Step CPU GPU
Audio extraction (FFmpeg) 10-20 sec 10-20 sec
Face detection (MediaPipe, 3 FPS) 1-2 min 10-15 sec
Speaker diarization (pyannote) 15-30 min 1-2 min
Face-speaker mapping < 1 sec < 1 sec
Remotion render (crop + captions) 10-30 min 10-30 min
Total (parallelized) 35-80 min 16-40 min

Face detection + diarization can run in parallel (different input: video frames vs audio track).

Memory Requirements

Config Peak RAM
Whisper base + pyannote (parallel) 8-12 GB
Whisper medium + pyannote (parallel) 12-16 GB
Recommended ML worker limit 16 GB, --threads 1

Frontend Design

  • Head tracking preview: video player with face bounding box overlay (canvas)
  • Speaker timeline track in TimelinePanel (extends existing 4-track system)
  • Controls: zoom level slider, transition speed, speaker selection
  • Before/after comparison toggle
  • UX flow: upload podcast → trigger analysis (ProcessingStep) → review speaker assignments → adjust → export

Key Numbers

Metric Value
Face detection accuracy ~90% (MediaPipe on talking-head content)
Diarization DER ~10% (pyannote 3.1)
Face-speaker mapping (Phase 1) 70-85% accuracy
Face-speaker mapping (Phase 2, TalkNet) ~92% accuracy
New dependencies ~280MB (mediapipe + pyannote + torchaudio)
GPU mandatory? No for Phase 1; recommended for Phase 2

Risks

  • Face-to-speaker mapping is the hardest unsolved subproblem — 70-85% accuracy means 1 in 5 assignments may be wrong. Must let users manually correct.
  • Diarization on CPU is the bottleneck — 15-30 min for 30-min video. GPU reduces to 1-2 min.
  • PyTorch version conflicts between Whisper and pyannote — test uv sync before committing.
  • Video quality loss when cropping 16:9 to 9:16 — only ~31.6% of frame width is kept. Source must be at least 1080p.
  • Model download on first run — pyannote models (~100MB) require Hugging Face license acceptance. Handle in Dockerfile, not at runtime.

MVP vs Full

  • MVP (12-15 days): Face detection on sampled frames. User manually selects which face to follow. Static crop to selected face. No speaker switching, no diarization. Works for single-speaker content.
  • Full (30-45 days): Speaker diarization + face-speaker mapping. Dynamic crop following active speaker. Smooth spring() transitions on speaker changes. Split-screen for reactions. Multi-speaker support.

Feature 4: 9:16 Shorts Conversion

Architecture

Pipeline: Crop-then-caption, always. Single Remotion render pass using new ShortsVideo composition. The composition renders at target 9:16 dimensions, applies CSS crop transform to <Video>, and renders captions on top.

Caption positioning: No new schema fields needed. Backend adjusts font_size, padding_px, max_width_pct in styleConfig for 9:16 aspect ratio. Remotion is a "dumb renderer" — intelligence about what looks good at 9:16 belongs in presets.

Crop specification:

type CropConfig = {
  mode: "static" | "keyframe";
  staticCrop?: { x: number; y: number; zoom: number };  // 0-1 normalized
  keyframes?: Array<{ time: number; x: number; y: number; zoom: number }>;
  interpolation?: "linear" | "ease" | "smooth";
};

Static crop is a degenerate case of keyframe crop (single keyframe).

Backend Design

New job type: ASPECT_CONVERT in JobTypeEnum. New function crop_to_vertical() in media/service.py using FFmpeg crop+scale filter.

New artifact type: VERTICAL_VIDEO in ArtifactTypeEnum.

Pipeline:

  1. Trim source video to clip time range (if from viral detection)
  2. Apply crop (static center crop or face-tracking crop from Feature 3)
  3. Upload to S3 at {folder}/vertical/{filename}
  4. Webhook + notification

Frontend Design

  • Crop preview: draggable 9:16 rectangle overlay on video player (CSS object-fit: cover + object-position)
  • Side-by-side preview toggle: original 16:9 vs cropped 9:16
  • Integration with Feature 2: "Convert to Short" button on each approved viral clip
  • Integration with Feature 3: auto-populate crop region from face detection data

Processing Time

Approach Time (30-min video)
FFmpeg crop-only (no captions) 12-36 min
Remotion crop + captions (single pass) 11-45 min
FFmpeg with NVENC hardware encoding 3-5 min

MVP vs Full

  • MVP (6-8 days): Manual crop region selection with preview. User drags a 9:16 rectangle over video. New ShortsVideo Remotion composition renders crop + captions.
  • Full (+3-4 days after Feature 3): Auto-crop based on face detection data. One-click vertical conversion. Batch conversion of viral clips.

Week 1-2:    Feature 1 (Templates)        ████████
Week 2-4:    Feature 2 (Viral Detection)  ████████████████
Week 4-6:    Feature 4 MVP (9:16 crop)    ████████████████
Week 6-14:   Feature 3 (Head Tracking)    ████████████████████████████████████████
Week 14-15:  Feature 4 upgrade            ████████

Rationale:

  1. Templates first — ready to implement, zero risk, immediate user value
  2. Viral detection second — highest value/effort ratio ($0.005/video, 5-7 days MVP), validates that users want automated editing
  3. 9:16 MVP third — builds the ShortsVideo composition that Feature 3 extends, useful standalone with manual crop
  4. Head tracking last — most complex, biggest investment, validates demand from Features 2+4 first
  5. 9:16 upgrade — trivial once head tracking provides face position data

Cost Analysis

Per-Video Processing Cost

Tier Components Compute LLM API Total Wait Time
CPU-only All on CPU $0.05 $0.06 $0.11 35-80 min
GPU (T4) ML on GPU, FFmpeg on CPU $0.11 $0.06 $0.17 16-40 min
GPU + NVENC Everything on GPU $0.13 $0.06 $0.19 10-15 min

Monthly Infrastructure Cost (100 videos/month)

Scenario Cost
CPU-only (existing infra) ~$11 + server
Modal serverless GPU ~$21/month
Spot GPU (g4dn.xlarge) ~$115/month
Standing GPU (g4dn.xlarge 24/7) ~$380/month

Recommendation: Start CPU-only. Move to Modal serverless GPU when queue wait times exceed 15 minutes. At 500+ videos/day, evaluate spot instances.

Suggested SaaS Pricing Tiers

Tier Price Limits Compute Cost Margin
Free $0 10-min videos, queue priority low ~$0.04/video Marketing
Pro $15-30/mo 30-min videos, GPU ML ~$0.17/video at 50 videos 60-80%
Business $50-100/mo 60-min videos, priority queue, NVENC ~$0.38/video 70-85%

Infrastructure Decisions

ML Service Separation

Phase 1: Keep ML in existing Dramatiq workers. MediaPipe + pyannote add only ~280MB to image. PyTorch is already installed via Whisper.

Phase 2: Separate ml-worker Docker container on dedicated queues. Same codebase, different image (Dockerfile.ml), different resource limits. Use Docker Compose profiles:

docker-compose up                    # Default: no ML worker
docker-compose --profile ml up       # With ML worker

Do NOT build a separate HTTP microservice. Dramatiq already handles job queuing, retries, progress, and cancellation. Adding HTTP service discovery, API contracts, and health checks is overhead with zero benefit for async workloads.

Immediate Optimizations (Before New Features)

Action Impact Effort
Switch PyTorch to CPU-only index -800MB image size 1 hour
Fix worker REMOTION_SERVICE_URL default Bug fix 5 min
Add resource limits to docker-compose services Prevent OOM cascades 30 min
Split Dramatiq into queue pools (lightweight vs ML vs compute) Prevent worker starvation 2-3 hours

Technology Stack Summary

New Dependencies

Package Size Purpose Feature
google-generativeai or openai ~10 MB LLM API client 2
librosa ~20 MB Audio energy analysis 2
mediapipe ~30 MB Face detection 3
pyannote-audio ~200 MB Speaker diarization 3
torchaudio ~50-80 MB Audio processing for pyannote 3
Total new deps ~310-340 MB

New Backend Modules

Module Purpose Feature
clips Clip CRUD, review workflow 2

New Remotion Compositions

Composition Purpose Feature
ShortsVideo Static/keyframe crop + captions at 9:16 4
AutoEditVideo Face-tracking dynamic crop + captions 3

New Job Types

Job Type Purpose Feature
VIRAL_DETECT LLM analysis of transcription 2
ASPECT_CONVERT 9:16 crop + re-encode 4
FACE_DETECT Face bounding box detection 3
SPEAKER_DIARIZE Speaker diarization 3

Cross-Cutting Issues

Issue Flagged By Priority Action
PyTorch installs CUDA libs on CPU-only infra (+800MB) DevOps High Switch to CPU-only PyTorch index
Worker --processes 1 --threads 2 will OOM with ML jobs Performance High Split into queue pools, --threads 1 for ML
_get_job_status_sync() leaks DB connections Performance High Fix before adding more actors
No temp file cleanup on OOM crash Performance Medium Add periodic /tmp cleanup or cron
tasks/service.py at 1,674 lines, will exceed 2K Backend Medium Extract actor boilerplate into decorator/context manager
Worker REMOTION_SERVICE_URL default wrong (localhost:8001) DevOps Medium Fix to http://remotion:3001 in docker-compose
No resource limits on any Docker service DevOps Medium Add memory/CPU limits to all services
Whisper should move to ML service eventually Backend Low Plan for Phase 2 when ML worker is split out
isCurrent word identity check in Captions.tsx is fragile Remotion Low Compare by index, not text + start time

Specialist Reports (Full Transcripts)

Full specialist outputs are available in the session transcript. Key files each specialist examined:

  • ML Engineer: cpv3/modules/transcription/service.py, cpv3/modules/tasks/service.py, pyproject.toml
  • Backend Architect: cpv3/modules/tasks/service.py, cpv3/modules/jobs/schemas.py, cpv3/modules/media/service.py, cpv3/modules/captions/service.py, docker-compose.yml
  • Remotion Engineer: remotion_service/src/components/Composition.tsx, Captions.tsx, Root.tsx, useCaptions.ts, useVideoMeta.ts, all type definitions
  • Frontend Architect: src/widgets/TimelinePanel/, src/features/project/FragmentsStep/, src/shared/context/WizardContext.tsx, src/shared/store/notifications/
  • DevOps Engineer: docker-compose.yml, Dockerfile, pyproject.toml, uv.lock
  • Performance Engineer: cpv3/modules/tasks/service.py, cpv3/modules/media/service.py, cpv3/modules/transcription/service.py, docker-compose.yml