Files
remotion_service/docs/consults/video-features-roadmap_v2.md
T
2026-03-22 22:42:35 +03:00

23 KiB

Video Features Roadmap — Technical Consultation v2 (API-First)

Date: 2026-03-22 Specialists consulted: ML/AI Engineer, Backend Architect, Remotion Engineer, Frontend Architect, DevOps Engineer, Performance Engineer Revision: v2 — switched to API-first architecture using Deepgram, GigaChat, and DeepInfra


What Changed from v1

v2 replaces local ML models with managed API services. This is the single biggest architectural change — it eliminates PyTorch, GPU infrastructure, ML worker separation, and most memory/processing bottlenecks.

API Substitutions

v1 (Local ML) v2 (API-First) Impact
Local Whisper (PyTorch, 20-60 min CPU) Deepgram Nova-3 API (~30 sec) Eliminates PyTorch dependency entirely
Local pyannote.audio (PyTorch, 15-30 min CPU) Deepgram diarize=true (included in transcription call) Eliminates pyannote + torchaudio deps
Gemini 2.5 Flash / GPT-4o-mini for viral detection GigaChat Pro (native Russian LLM by Sber) Better Russian cultural context, humor, slang
librosa for audio energy analysis Deepgram sentiment=true per utterance Sentiment replaces energy analysis for most cases
N/A DeepInfra (Llama, Mistral, Qwen via API) Fallback/A/B testing for LLM analysis

Key Metrics Changed

Metric v1 v2 Change
Docker image size 1.72 GB ~400-500 MB -75% (no PyTorch)
Peak worker RAM 8-16 GB ~400 MB (MediaPipe only) -95%
Processing time (30-min video, full pipeline) 35-80 min (CPU) 5-10 min -85%
Per-video cost $0.11 $0.20 +80% (API costs)
Monthly cost (100 videos) $11 compute + server + $0-380 GPU $20 APIs + server Simpler, cheaper at low volume
GPU needed? Phase 2 for diarization Never Eliminated
New Python dependencies ~310-340 MB ~40 MB (mediapipe + HTTP clients) -88%
MVP total timeline 26-34 dev-days 20-27 dev-days -20-25%

Issues Eliminated

These v1 cross-cutting issues no longer apply:

v1 Issue Why It's Gone
Switch PyTorch to CPU-only index PyTorch removed entirely (Whisper replaced by Deepgram)
Worker OOM with concurrent ML jobs No heavy ML — standard 4 GB worker
Separate ML worker Docker image Single lightweight image
GPU infrastructure planning All ML is API-based
PyTorch version conflicts No PyTorch
Model download on first run No local models (except MediaPipe, ~2 MB)
ML worker separation via Docker Compose profiles Not needed

New Issues Introduced

Issue Priority Mitigation
API key management (Deepgram, GigaChat, DeepInfra) High Store in settings via env vars, never in code
API rate limits High Retry with exponential backoff in actors
API vendor lock-in Medium Abstract behind engine interfaces (like current engine: "whisper" | "google")
Network dependency (API downtime = no processing) Medium Keep Whisper as optional fallback engine
Higher per-video cost ($0.20 vs $0.11) Low Offset by zero infrastructure cost; profitable at any SaaS tier

Feature Overview

# Feature Complexity MVP Full Additional Infra
1 Advanced Remotion Templates Easy-Medium 3-4 days 3-4 days None — ready to implement
2 Viral Moments Detection Medium 3-5 days 6-10 days API keys (GigaChat, Deepgram)
3 Auto-Cut & Head Tracking Hard 8-10 days 20-30 days MediaPipe only (CPU, ~30 MB)
4 9:16 Shorts Conversion Medium 6-8 days +3-4 days after #3 None
Total 20-27 days 35-47 days

Realistic for one dev: 5-7 weeks (all MVPs) or 2-3 months (full versions).


Feature 1: Advanced Remotion Templates

No changes from v1. This feature has no ML dependencies.

Status: Spec + implementation plan already written.

  • Spec: docs/superpowers/specs/2026-03-21-advanced-remotion-templates-design.md
  • Plan: docs/superpowers/plans/2026-03-21-advanced-remotion-templates.md

Scope: Extend CaptionStyleSchema with 4 new highlight styles (pop_in, karaoke, bounce, glow_pulse), 2 transitions (zoom_in, drop_in), 3 fields (word_entrance, highlight_rotation_deg, text_transform). Seed 2 system presets: "Shorts" and "Podcast".

Changes: Schema extensions in Remotion + backend, rendering logic in Captions.tsx, Alembic migration for presets, frontend StyleEditor form controls.


Feature 2: Viral Moments Detection

Architecture (v2 — API-First)

Transcription: Deepgram Nova-3 API with diarize=true + sentiment=true. Single API call returns word-level timestamps, speaker labels, and per-utterance sentiment scores. Cost: $0.0053/min ($0.16 for 30-min video). Processing: ~30 seconds.

LLM analysis: GigaChat Pro (by Sber) — native Russian LLM trained on Russian internet content. Better detection of Russian humor, cultural references, slang, and viral patterns than English-first models. Fallback: DeepInfra (Llama 3.1 70B or Qwen) for A/B testing.

Audio augmentation: Deepgram's per-utterance sentiment scores replace librosa energy analysis for most use cases. High-sentiment utterances correlate with viral moments. Optional: keep librosa for audio loudness analysis (laughter, raised voice) as an enhancement.

Pipeline:

  1. Deepgram transcription with diarize=true + sentiment=true → timestamps + speakers + sentiment
  2. Convert Deepgram response to existing Document schema (segments, lines, words)
  3. GigaChat analyzes transcription text + sentiment data → viral clip candidates
  4. Post-process: snap boundaries to segment edges, compute composite scores
  5. Save clips to clips table

Backend Design

New module: clips (models, schemas, repository, service, router) — stores detected clips with project/file/job relationships.

Clip model:

Clip {
  project_id: UUID (FK projects)
  source_file_id: UUID (FK files)
  job_id: UUID? (FK jobs)
  title: str
  start_ms: int
  end_ms: int
  score: float
  source_type: "viral_detected" | "user_created" | "auto_generated"
  status: "pending" | "approved" | "rejected" | "exported"
  meta: JSON? (LLM reasoning, tags, hashtags, sentiment data)
}

New job type: VIRAL_DETECT added to JobTypeEnum. Actor calls GigaChat API via httpx from Dramatiq worker.

Transcription engine extension: Add "deepgram" to the existing engine selection (engine: "whisper" | "google" | "deepgram"). Deepgram becomes the default for new transcriptions. Whisper remains as a fallback.

LLM integration:

  • GigaChat API via httpx (OAuth2 token auth via Sber ID)
  • DeepInfra as fallback (OpenAI-compatible API)
  • Prompts stored in cpv3/infrastructure/prompts/viral_detection_v1.txt
  • Active version controlled by LLM_VIRAL_PROMPT_VERSION env var
  • New settings: GIGACHAT_CLIENT_ID, GIGACHAT_CLIENT_SECRET, DEEPINFRA_API_KEY, DEEPGRAM_API_KEY

Frontend Design

  • New ViralClipsStep in project wizard (features/project/)
  • Clip list with thumbnails, scores, titles, approve/reject buttons
  • Clip edit modal with video preview (scoped playback for start/end range)
  • New job type VIRAL_DETECT in notification handling (existing WebSocket infrastructure)

Key Numbers

Metric v1 v2
Transcription time Depends on Whisper (already done) ~30 sec (Deepgram, if not already transcribed)
LLM analysis time 10-20 sec 10-20 sec (same)
Total processing 10-20 sec (after transcription) 40-50 sec (including Deepgram transcription)
Cost per video ~$0.005 (LLM only) ~$0.17 ($0.16 Deepgram + $0.01 GigaChat)
Accuracy (precision) 50-70% 60-80% (GigaChat better at Russian + sentiment data)
New dependencies google-generativeai + librosa (~30 MB) HTTP client only (~0 MB new)
MVP time 5-7 days 3-5 days

Risks

  • GigaChat API availability — Sber's API may have lower uptime than Google/OpenAI. Mitigation: DeepInfra fallback.
  • GigaChat structured output — verify JSON mode / function calling works reliably for clip extraction. Test early.
  • Deepgram Russian WER — ~10-12% WER on Russian (Nova-3). Comparable to Whisper medium. Sufficient for viral detection.
  • Visual-only moments still missed (~20-30%) — same limitation as v1.

MVP vs Full

  • MVP (3-5 days): Deepgram transcription + GigaChat analysis. Returns clips with scores. User reviews and accepts/rejects. No audio energy analysis.
  • Full (6-10 days): Add sentiment-weighted scoring, few-shot prompt tuning from user feedback, batch processing, direct clip export to 9:16, DeepInfra A/B testing.

Feature 3: Auto-Cut & Head Tracking

Architecture (v2 — API-First)

Face detection: MediaPipe BlazeFace (unchanged from v1). Apache 2.0, ~2MB model, 30-60 FPS on CPU. Sample at 3 FPS. This is the only local ML component remaining. Dependency: mediapipe (~30MB).

Speaker diarization: Deepgram API with diarize=true (~30 seconds for 30-min video). Replaces pyannote.audio entirely. Diarization is included in the transcription call — no additional API cost.

Face-speaker mapping:

  • Phase 1: Temporal correlation heuristic — match face tracks to Deepgram speaker segments by maximum temporal overlap. 70-85% accuracy for 2-speaker videos. Zero additional dependencies. ~100 lines of Python.
  • Phase 2: TalkNet-ASD — if needed for accuracy. This is the only scenario where GPU would be reconsidered, but can be deferred indefinitely if temporal correlation + user correction is sufficient.

Video compositing: Same as v1 — Remotion compositions with CSS transform crop. No changes.

New Remotion compositions: Same as v1.

Composition Purpose Phase
CaptionedVideo (existing) Caption overlay on native video Current
ShortsVideo (new) Static/keyframe crop + captions at 9:16 Feature 4
AutoEditVideo (new) Face-tracking crop + cuts + captions Feature 3 full

Crop data format: Same as v1 (keyframes with normalized 0-1 coordinates).

Backend Design

New job types: FACE_DETECT added to JobTypeEnum. SPEAKER_DIARIZE is no longer needed as a separate job — diarization comes from Deepgram as part of transcription.

ML service separation: Not needed. MediaPipe is lightweight (~30MB, ~400MB RAM). Runs in standard Dramatiq worker.

Remotion service changes: Same as v1 — compositionId parameter, crop/outputWidth/outputHeight props.

Processing Time (30-min 1080p video)

Step v1 (CPU) v2 (API-First)
Deepgram transcription + diarization N/A ~30 sec
Face detection (MediaPipe, 3 FPS) 1-2 min 1-2 min (unchanged)
Speaker diarization (pyannote) 15-30 min Included in Deepgram
Face-speaker mapping < 1 sec < 1 sec
Remotion render (crop + captions) 10-30 min 10-30 min (unchanged)
Total (parallelized) 35-80 min 12-33 min

The 15-30 min diarization bottleneck is completely eliminated.

Memory Requirements

Config v1 v2
Peak RAM 8-16 GB ~400 MB (MediaPipe only)
Worker config needed --threads 1, 16 GB limit Standard worker, 4 GB limit

Frontend Design

Same as v1:

  • Head tracking preview: video player with face bounding box overlay (canvas)
  • Speaker timeline track in TimelinePanel
  • Controls: zoom level slider, transition speed, speaker selection
  • Before/after comparison toggle

Key Numbers

Metric v1 v2
Diarization time 15-30 min (CPU) / 1-2 min (GPU) ~30 sec (API)
Face detection time 1-2 min 1-2 min (unchanged)
Total analysis time 17-33 min (CPU) ~2 min
Full pipeline (with render) 35-80 min (CPU) 12-33 min
Peak RAM 8-16 GB ~400 MB
New dependencies ~280 MB (mediapipe + pyannote + torchaudio) ~30 MB (mediapipe only)
GPU needed? Phase 2 recommended Never
MVP time 12-15 days 8-10 days

Risks

  • Face-to-speaker mapping accuracy unchanged (70-85% with heuristic) — still the hardest subproblem
  • Deepgram diarization accuracy — DER may be slightly worse than pyannote 3.1 (~12-15% vs ~10%). Acceptable for this use case.
  • Video quality loss when cropping — unchanged from v1
  • TalkNet-ASD deferred — if temporal correlation isn't accurate enough, TalkNet requires GPU. Cross that bridge if needed.

MVP vs Full

  • MVP (8-10 days): Face detection on sampled frames. Deepgram provides speaker labels. Temporal correlation maps faces to speakers. User can manually correct. Static crop to selected face.
  • Full (20-30 days): Dynamic crop following active speaker. Smooth transitions. Split-screen. Multi-speaker. Optional TalkNet-ASD for accuracy.

Feature 4: 9:16 Shorts Conversion

No changes from v1. This feature has no ML dependencies.

Architecture

Pipeline: Crop-then-caption, always. Single Remotion render pass using new ShortsVideo composition.

Caption positioning: No new schema fields needed. Backend adjusts font_size, padding_px, max_width_pct in styleConfig for 9:16.

Crop specification:

type CropConfig = {
  mode: "static" | "keyframe";
  staticCrop?: { x: number; y: number; zoom: number };
  keyframes?: Array<{ time: number; x: number; y: number; zoom: number }>;
  interpolation?: "linear" | "ease" | "smooth";
};

Backend Design

New job type: ASPECT_CONVERT in JobTypeEnum. New function crop_to_vertical() in media/service.py.

New artifact type: VERTICAL_VIDEO in ArtifactTypeEnum.

Frontend Design

  • Crop preview: draggable 9:16 rectangle overlay on video player
  • Side-by-side preview toggle
  • "Convert to Short" button on approved viral clips
  • Auto-populate crop from face detection data (when available)

Processing Time

Approach Time (30-min video)
FFmpeg crop-only (no captions) 12-36 min
Remotion crop + captions (single pass) 11-45 min
FFmpeg with NVENC hardware encoding 3-5 min

MVP vs Full

  • MVP (6-8 days): Manual crop region selection with preview. ShortsVideo Remotion composition.
  • Full (+3-4 days after Feature 3): Auto-crop from face detection. One-click conversion. Batch export.

Week 1-2:    Feature 1 (Templates)        ████████
Week 2-3:    Feature 2 (Viral Detection)  ██████████
Week 3-5:    Feature 4 MVP (9:16 crop)    ████████████████
Week 5-10:   Feature 3 (Head Tracking)    ██████████████████████████████
Week 10-11:  Feature 4 upgrade            ████████

Rationale:

  1. Templates first — ready to implement, zero risk, immediate user value
  2. Viral detection second — fastest ROI with API-first (3-5 days MVP), validates user demand
  3. 9:16 MVP third — builds ShortsVideo composition, useful standalone
  4. Head tracking last — still the most complex, but now much simpler without pyannote/GPU
  5. 9:16 upgrade — trivial once head tracking provides face data

Cost Analysis

Per-Video Processing Cost (30-min video, all features)

Component v1 (Local ML) v2 (API-First)
Transcription + diarization $0.07 compute $0.16 (Deepgram)
LLM viral detection $0.005 (Gemini) $0.01 (GigaChat)
Face detection $0.002 compute $0.002 compute (unchanged)
FFmpeg/Remotion render $0.02 compute $0.02 compute
Total per video $0.11 $0.20

Monthly Cost Comparison

Scale v1 (Local ML) v2 (API-First)
100 videos/month $11 compute + server + $0-380 GPU $20 APIs + server
500 videos/month $55 + $200-380 GPU = $255-435 $100 APIs + server
1,000 videos/month $110 + $380 GPU = $490 $200 APIs + server
5,000 videos/month $550 + $380 GPU = $930 $1,000 APIs + server

Breakeven: ~2,000-3,000 videos/month. Below that, APIs are cheaper.

Suggested SaaS Pricing Tiers

Tier Price Limits Cost/Video Margin
Free $0 10-min videos, 5/month ~$0.07 Marketing
Pro $15-30/mo 30-min videos, 50/month ~$0.20 50-70%
Business $50-100/mo 60-min videos, 200/month ~$0.35 65-80%

Infrastructure (v2 — Simplified)

Architecture

Frontend → Backend API → Dramatiq Worker (lightweight: MediaPipe only)
                              ↕              ↕           ↕
                         PostgreSQL     Deepgram API   GigaChat API
                         Redis          (transcription  (viral detection)
                         S3/MinIO        + diarization)
                         Remotion        DeepInfra
                                         (fallback LLM)

Docker Image

v1 v2
Base python:3.11-slim + PyTorch + Whisper + CUDA libs python:3.11-slim + mediapipe
Size 1.72 GB ~400-500 MB
RAM 16 GB recommended 4 GB sufficient

Can remove from pyproject.toml: openai-whisper (and transitively PyTorch) — if Deepgram fully replaces Whisper. Keep Whisper as optional dependency (uv sync --group whisper) for fallback.

No ML Service Separation Needed

With only MediaPipe (~30MB, ~400MB RAM) running locally, there is no need for:

  • Separate ML worker container
  • Docker Compose profiles for ML
  • GPU infrastructure
  • Dedicated Dramatiq queues for ML

Standard worker with --processes 1 --threads 2 handles everything.

New Settings

# Deepgram
deepgram_api_key: str = Field(default="", alias="DEEPGRAM_API_KEY")

# GigaChat (Sber)
gigachat_client_id: str = Field(default="", alias="GIGACHAT_CLIENT_ID")
gigachat_client_secret: str = Field(default="", alias="GIGACHAT_CLIENT_SECRET")

# DeepInfra (fallback LLM)
deepinfra_api_key: str = Field(default="", alias="DEEPINFRA_API_KEY")

# LLM config
llm_provider: str = Field(default="gigachat", alias="LLM_PROVIDER")  # gigachat | deepinfra
llm_viral_prompt_version: str = Field(default="v1", alias="LLM_VIRAL_PROMPT_VERSION")

Technology Stack Summary

New Dependencies (v2)

Package Size Purpose Feature
mediapipe ~30 MB Face detection (CPU) 3
httpx Already installed API calls to Deepgram, GigaChat, DeepInfra 2, 3
Total new deps ~30 MB

Removed Dependencies (vs v1)

Package Size Saved Was For
openai-whisper ~50 MB + PyTorch ~2 GB Transcription (replaced by Deepgram)
pyannote-audio ~200 MB Diarization (replaced by Deepgram)
torchaudio ~50-80 MB pyannote dependency
librosa ~20 MB Audio energy (replaced by Deepgram sentiment)
Total removed ~2.3 GB

New Backend Modules

Module Purpose Feature
clips Clip CRUD, review workflow 2

New Remotion Compositions

Composition Purpose Feature
ShortsVideo Static/keyframe crop + captions at 9:16 4
AutoEditVideo Face-tracking dynamic crop + captions 3

New Job Types

Job Type Purpose Feature
VIRAL_DETECT GigaChat analysis of transcription 2
ASPECT_CONVERT 9:16 crop + re-encode 4
FACE_DETECT Face bounding box detection (MediaPipe) 3

Note: SPEAKER_DIARIZE is no longer a separate job type — diarization is included in Deepgram transcription.

Transcription Engine Extension

# Extend existing engine selection:
engine: Literal["whisper", "google", "deepgram"] = "deepgram"

Deepgram becomes the default. Whisper remains as optional fallback (requires uv sync --group whisper).


Cross-Cutting Issues (v2)

Remaining from v1

Issue Priority Action
_get_job_status_sync() leaks DB connections High Fix before adding more actors
tasks/service.py at 1,674 lines, will exceed 2K Medium Extract actor boilerplate
Worker REMOTION_SERVICE_URL default wrong Medium Fix to http://remotion:3001
No resource limits on Docker services Medium Add memory/CPU limits
No temp file cleanup on OOM crash Medium Add periodic cleanup
isCurrent word identity check in Captions.tsx fragile Low Compare by index

New in v2

Issue Priority Action
API key management (3 services) High All via env vars in settings, never in code
API rate limit handling High Retry with exponential backoff in all actors
API vendor lock-in Medium Abstract behind engine interface (existing pattern)
Network dependency (API downtime) Medium Keep Whisper as optional fallback engine
Deepgram → Document schema conversion Medium Build converter to match existing Document structure
GigaChat OAuth2 token refresh Medium Token caching with auto-refresh in infrastructure/

Eliminated from v1

Issue Why Gone
PyTorch CPU-only index PyTorch removed entirely
Worker OOM with ML jobs No heavy ML locally
ML worker Docker image Single lightweight image
GPU infrastructure All ML is API-based
PyTorch version conflicts No PyTorch
Model downloads on first run No local models

Specialist Reports (Full Transcripts)

Full specialist outputs are available in the session transcript. Key files each specialist examined:

  • ML Engineer: cpv3/modules/transcription/service.py, cpv3/modules/tasks/service.py, pyproject.toml
  • Backend Architect: cpv3/modules/tasks/service.py, cpv3/modules/jobs/schemas.py, cpv3/modules/media/service.py, cpv3/modules/captions/service.py, docker-compose.yml
  • Remotion Engineer: remotion_service/src/components/Composition.tsx, Captions.tsx, Root.tsx, useCaptions.ts, useVideoMeta.ts, all type definitions
  • Frontend Architect: src/widgets/TimelinePanel/, src/features/project/FragmentsStep/, src/shared/context/WizardContext.tsx, src/shared/store/notifications/
  • DevOps Engineer: docker-compose.yml, Dockerfile, pyproject.toml, uv.lock
  • Performance Engineer: cpv3/modules/tasks/service.py, cpv3/modules/media/service.py, cpv3/modules/transcription/service.py, docker-compose.yml

Note: Specialist reports were produced for v1 architecture (local ML). Their recommendations for Remotion compositions, backend module design, frontend components, and crop data formats remain valid in v2. The infrastructure and ML model recommendations are superseded by the API-first approach.