Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
23 KiB
Video Features Roadmap — Technical Consultation v2 (API-First)
Date: 2026-03-22 Specialists consulted: ML/AI Engineer, Backend Architect, Remotion Engineer, Frontend Architect, DevOps Engineer, Performance Engineer Revision: v2 — switched to API-first architecture using Deepgram, GigaChat, and DeepInfra
What Changed from v1
v2 replaces local ML models with managed API services. This is the single biggest architectural change — it eliminates PyTorch, GPU infrastructure, ML worker separation, and most memory/processing bottlenecks.
API Substitutions
| v1 (Local ML) | v2 (API-First) | Impact |
|---|---|---|
| Local Whisper (PyTorch, 20-60 min CPU) | Deepgram Nova-3 API (~30 sec) | Eliminates PyTorch dependency entirely |
| Local pyannote.audio (PyTorch, 15-30 min CPU) | Deepgram diarize=true (included in transcription call) |
Eliminates pyannote + torchaudio deps |
| Gemini 2.5 Flash / GPT-4o-mini for viral detection | GigaChat Pro (native Russian LLM by Sber) | Better Russian cultural context, humor, slang |
| librosa for audio energy analysis | Deepgram sentiment=true per utterance |
Sentiment replaces energy analysis for most cases |
| N/A | DeepInfra (Llama, Mistral, Qwen via API) | Fallback/A/B testing for LLM analysis |
Key Metrics Changed
| Metric | v1 | v2 | Change |
|---|---|---|---|
| Docker image size | 1.72 GB | ~400-500 MB | -75% (no PyTorch) |
| Peak worker RAM | 8-16 GB | ~400 MB (MediaPipe only) | -95% |
| Processing time (30-min video, full pipeline) | 35-80 min (CPU) | 5-10 min | -85% |
| Per-video cost | $0.11 | $0.20 | +80% (API costs) |
| Monthly cost (100 videos) | $11 compute + server + $0-380 GPU | $20 APIs + server | Simpler, cheaper at low volume |
| GPU needed? | Phase 2 for diarization | Never | Eliminated |
| New Python dependencies | ~310-340 MB | ~40 MB (mediapipe + HTTP clients) | -88% |
| MVP total timeline | 26-34 dev-days | 20-27 dev-days | -20-25% |
Issues Eliminated
These v1 cross-cutting issues no longer apply:
| v1 Issue | Why It's Gone |
|---|---|
| PyTorch removed entirely (Whisper replaced by Deepgram) | |
| No heavy ML — standard 4 GB worker | |
| Single lightweight image | |
| All ML is API-based | |
| No PyTorch | |
| No local models (except MediaPipe, ~2 MB) | |
| Not needed |
New Issues Introduced
| Issue | Priority | Mitigation |
|---|---|---|
| API key management (Deepgram, GigaChat, DeepInfra) | High | Store in settings via env vars, never in code |
| API rate limits | High | Retry with exponential backoff in actors |
| API vendor lock-in | Medium | Abstract behind engine interfaces (like current engine: "whisper" | "google") |
| Network dependency (API downtime = no processing) | Medium | Keep Whisper as optional fallback engine |
| Higher per-video cost ($0.20 vs $0.11) | Low | Offset by zero infrastructure cost; profitable at any SaaS tier |
Feature Overview
| # | Feature | Complexity | MVP | Full | Additional Infra |
|---|---|---|---|---|---|
| 1 | Advanced Remotion Templates | Easy-Medium | 3-4 days | 3-4 days | None — ready to implement |
| 2 | Viral Moments Detection | Medium | 3-5 days | 6-10 days | API keys (GigaChat, Deepgram) |
| 3 | Auto-Cut & Head Tracking | Hard | 8-10 days | 20-30 days | MediaPipe only (CPU, ~30 MB) |
| 4 | 9:16 Shorts Conversion | Medium | 6-8 days | +3-4 days after #3 | None |
| Total | 20-27 days | 35-47 days |
Realistic for one dev: 5-7 weeks (all MVPs) or 2-3 months (full versions).
Feature 1: Advanced Remotion Templates
No changes from v1. This feature has no ML dependencies.
Status: Spec + implementation plan already written.
- Spec:
docs/superpowers/specs/2026-03-21-advanced-remotion-templates-design.md - Plan:
docs/superpowers/plans/2026-03-21-advanced-remotion-templates.md
Scope: Extend CaptionStyleSchema with 4 new highlight styles (pop_in, karaoke, bounce, glow_pulse), 2 transitions (zoom_in, drop_in), 3 fields (word_entrance, highlight_rotation_deg, text_transform). Seed 2 system presets: "Shorts" and "Podcast".
Changes: Schema extensions in Remotion + backend, rendering logic in Captions.tsx, Alembic migration for presets, frontend StyleEditor form controls.
Feature 2: Viral Moments Detection
Architecture (v2 — API-First)
Transcription: Deepgram Nova-3 API with diarize=true + sentiment=true. Single API call returns word-level timestamps, speaker labels, and per-utterance sentiment scores. Cost: $0.0053/min ($0.16 for 30-min video). Processing: ~30 seconds.
LLM analysis: GigaChat Pro (by Sber) — native Russian LLM trained on Russian internet content. Better detection of Russian humor, cultural references, slang, and viral patterns than English-first models. Fallback: DeepInfra (Llama 3.1 70B or Qwen) for A/B testing.
Audio augmentation: Deepgram's per-utterance sentiment scores replace librosa energy analysis for most use cases. High-sentiment utterances correlate with viral moments. Optional: keep librosa for audio loudness analysis (laughter, raised voice) as an enhancement.
Pipeline:
- Deepgram transcription with
diarize=true+sentiment=true→ timestamps + speakers + sentiment - Convert Deepgram response to existing
Documentschema (segments, lines, words) - GigaChat analyzes transcription text + sentiment data → viral clip candidates
- Post-process: snap boundaries to segment edges, compute composite scores
- Save clips to
clipstable
Backend Design
New module: clips (models, schemas, repository, service, router) — stores detected clips with project/file/job relationships.
Clip model:
Clip {
project_id: UUID (FK projects)
source_file_id: UUID (FK files)
job_id: UUID? (FK jobs)
title: str
start_ms: int
end_ms: int
score: float
source_type: "viral_detected" | "user_created" | "auto_generated"
status: "pending" | "approved" | "rejected" | "exported"
meta: JSON? (LLM reasoning, tags, hashtags, sentiment data)
}
New job type: VIRAL_DETECT added to JobTypeEnum. Actor calls GigaChat API via httpx from Dramatiq worker.
Transcription engine extension: Add "deepgram" to the existing engine selection (engine: "whisper" | "google" | "deepgram"). Deepgram becomes the default for new transcriptions. Whisper remains as a fallback.
LLM integration:
- GigaChat API via
httpx(OAuth2 token auth via Sber ID) - DeepInfra as fallback (OpenAI-compatible API)
- Prompts stored in
cpv3/infrastructure/prompts/viral_detection_v1.txt - Active version controlled by
LLM_VIRAL_PROMPT_VERSIONenv var - New settings:
GIGACHAT_CLIENT_ID,GIGACHAT_CLIENT_SECRET,DEEPINFRA_API_KEY,DEEPGRAM_API_KEY
Frontend Design
- New
ViralClipsStepin project wizard (features/project/) - Clip list with thumbnails, scores, titles, approve/reject buttons
- Clip edit modal with video preview (scoped playback for start/end range)
- New job type
VIRAL_DETECTin notification handling (existing WebSocket infrastructure)
Key Numbers
| Metric | v1 | v2 |
|---|---|---|
| Transcription time | Depends on Whisper (already done) | ~30 sec (Deepgram, if not already transcribed) |
| LLM analysis time | 10-20 sec | 10-20 sec (same) |
| Total processing | 10-20 sec (after transcription) | 40-50 sec (including Deepgram transcription) |
| Cost per video | ~$0.005 (LLM only) | ~$0.17 ($0.16 Deepgram + $0.01 GigaChat) |
| Accuracy (precision) | 50-70% | 60-80% (GigaChat better at Russian + sentiment data) |
| New dependencies | google-generativeai + librosa (~30 MB) |
HTTP client only (~0 MB new) |
| MVP time | 5-7 days | 3-5 days |
Risks
- GigaChat API availability — Sber's API may have lower uptime than Google/OpenAI. Mitigation: DeepInfra fallback.
- GigaChat structured output — verify JSON mode / function calling works reliably for clip extraction. Test early.
- Deepgram Russian WER — ~10-12% WER on Russian (Nova-3). Comparable to Whisper
medium. Sufficient for viral detection. - Visual-only moments still missed (~20-30%) — same limitation as v1.
MVP vs Full
- MVP (3-5 days): Deepgram transcription + GigaChat analysis. Returns clips with scores. User reviews and accepts/rejects. No audio energy analysis.
- Full (6-10 days): Add sentiment-weighted scoring, few-shot prompt tuning from user feedback, batch processing, direct clip export to 9:16, DeepInfra A/B testing.
Feature 3: Auto-Cut & Head Tracking
Architecture (v2 — API-First)
Face detection: MediaPipe BlazeFace (unchanged from v1). Apache 2.0, ~2MB model, 30-60 FPS on CPU. Sample at 3 FPS. This is the only local ML component remaining. Dependency: mediapipe (~30MB).
Speaker diarization: Deepgram API with diarize=true (~30 seconds for 30-min video). Replaces pyannote.audio entirely. Diarization is included in the transcription call — no additional API cost.
Face-speaker mapping:
- Phase 1: Temporal correlation heuristic — match face tracks to Deepgram speaker segments by maximum temporal overlap. 70-85% accuracy for 2-speaker videos. Zero additional dependencies. ~100 lines of Python.
- Phase 2: TalkNet-ASD — if needed for accuracy. This is the only scenario where GPU would be reconsidered, but can be deferred indefinitely if temporal correlation + user correction is sufficient.
Video compositing: Same as v1 — Remotion compositions with CSS transform crop. No changes.
New Remotion compositions: Same as v1.
| Composition | Purpose | Phase |
|---|---|---|
CaptionedVideo (existing) |
Caption overlay on native video | Current |
ShortsVideo (new) |
Static/keyframe crop + captions at 9:16 | Feature 4 |
AutoEditVideo (new) |
Face-tracking crop + cuts + captions | Feature 3 full |
Crop data format: Same as v1 (keyframes with normalized 0-1 coordinates).
Backend Design
New job types: FACE_DETECT added to JobTypeEnum. SPEAKER_DIARIZE is no longer needed as a separate job — diarization comes from Deepgram as part of transcription.
ML service separation: Not needed. MediaPipe is lightweight (~30MB, ~400MB RAM). Runs in standard Dramatiq worker.
Remotion service changes: Same as v1 — compositionId parameter, crop/outputWidth/outputHeight props.
Processing Time (30-min 1080p video)
| Step | v1 (CPU) | v2 (API-First) |
|---|---|---|
| Deepgram transcription + diarization | N/A | ~30 sec |
| Face detection (MediaPipe, 3 FPS) | 1-2 min | 1-2 min (unchanged) |
| Included in Deepgram | ||
| Face-speaker mapping | < 1 sec | < 1 sec |
| Remotion render (crop + captions) | 10-30 min | 10-30 min (unchanged) |
| Total (parallelized) | 35-80 min | 12-33 min |
The 15-30 min diarization bottleneck is completely eliminated.
Memory Requirements
| Config | v1 | v2 |
|---|---|---|
| Peak RAM | 8-16 GB | ~400 MB (MediaPipe only) |
| Worker config needed | --threads 1, 16 GB limit |
Standard worker, 4 GB limit |
Frontend Design
Same as v1:
- Head tracking preview: video player with face bounding box overlay (canvas)
- Speaker timeline track in TimelinePanel
- Controls: zoom level slider, transition speed, speaker selection
- Before/after comparison toggle
Key Numbers
| Metric | v1 | v2 |
|---|---|---|
| Diarization time | 15-30 min (CPU) / 1-2 min (GPU) | ~30 sec (API) |
| Face detection time | 1-2 min | 1-2 min (unchanged) |
| Total analysis time | 17-33 min (CPU) | ~2 min |
| Full pipeline (with render) | 35-80 min (CPU) | 12-33 min |
| Peak RAM | 8-16 GB | ~400 MB |
| New dependencies | ~280 MB (mediapipe + pyannote + torchaudio) | ~30 MB (mediapipe only) |
| GPU needed? | Phase 2 recommended | Never |
| MVP time | 12-15 days | 8-10 days |
Risks
- Face-to-speaker mapping accuracy unchanged (70-85% with heuristic) — still the hardest subproblem
- Deepgram diarization accuracy — DER may be slightly worse than pyannote 3.1 (~12-15% vs ~10%). Acceptable for this use case.
- Video quality loss when cropping — unchanged from v1
- TalkNet-ASD deferred — if temporal correlation isn't accurate enough, TalkNet requires GPU. Cross that bridge if needed.
MVP vs Full
- MVP (8-10 days): Face detection on sampled frames. Deepgram provides speaker labels. Temporal correlation maps faces to speakers. User can manually correct. Static crop to selected face.
- Full (20-30 days): Dynamic crop following active speaker. Smooth transitions. Split-screen. Multi-speaker. Optional TalkNet-ASD for accuracy.
Feature 4: 9:16 Shorts Conversion
No changes from v1. This feature has no ML dependencies.
Architecture
Pipeline: Crop-then-caption, always. Single Remotion render pass using new ShortsVideo composition.
Caption positioning: No new schema fields needed. Backend adjusts font_size, padding_px, max_width_pct in styleConfig for 9:16.
Crop specification:
type CropConfig = {
mode: "static" | "keyframe";
staticCrop?: { x: number; y: number; zoom: number };
keyframes?: Array<{ time: number; x: number; y: number; zoom: number }>;
interpolation?: "linear" | "ease" | "smooth";
};
Backend Design
New job type: ASPECT_CONVERT in JobTypeEnum. New function crop_to_vertical() in media/service.py.
New artifact type: VERTICAL_VIDEO in ArtifactTypeEnum.
Frontend Design
- Crop preview: draggable 9:16 rectangle overlay on video player
- Side-by-side preview toggle
- "Convert to Short" button on approved viral clips
- Auto-populate crop from face detection data (when available)
Processing Time
| Approach | Time (30-min video) |
|---|---|
| FFmpeg crop-only (no captions) | 12-36 min |
| Remotion crop + captions (single pass) | 11-45 min |
| FFmpeg with NVENC hardware encoding | 3-5 min |
MVP vs Full
- MVP (6-8 days): Manual crop region selection with preview.
ShortsVideoRemotion composition. - Full (+3-4 days after Feature 3): Auto-crop from face detection. One-click conversion. Batch export.
Recommended Build Order
Week 1-2: Feature 1 (Templates) ████████
Week 2-3: Feature 2 (Viral Detection) ██████████
Week 3-5: Feature 4 MVP (9:16 crop) ████████████████
Week 5-10: Feature 3 (Head Tracking) ██████████████████████████████
Week 10-11: Feature 4 upgrade ████████
Rationale:
- Templates first — ready to implement, zero risk, immediate user value
- Viral detection second — fastest ROI with API-first (3-5 days MVP), validates user demand
- 9:16 MVP third — builds
ShortsVideocomposition, useful standalone - Head tracking last — still the most complex, but now much simpler without pyannote/GPU
- 9:16 upgrade — trivial once head tracking provides face data
Cost Analysis
Per-Video Processing Cost (30-min video, all features)
| Component | v1 (Local ML) | v2 (API-First) |
|---|---|---|
| Transcription + diarization | $0.07 compute | $0.16 (Deepgram) |
| LLM viral detection | $0.005 (Gemini) | $0.01 (GigaChat) |
| Face detection | $0.002 compute | $0.002 compute (unchanged) |
| FFmpeg/Remotion render | $0.02 compute | $0.02 compute |
| Total per video | $0.11 | $0.20 |
Monthly Cost Comparison
| Scale | v1 (Local ML) | v2 (API-First) |
|---|---|---|
| 100 videos/month | $11 compute + server + $0-380 GPU | $20 APIs + server |
| 500 videos/month | $55 + $200-380 GPU = $255-435 | $100 APIs + server |
| 1,000 videos/month | $110 + $380 GPU = $490 | $200 APIs + server |
| 5,000 videos/month | $550 + $380 GPU = $930 | $1,000 APIs + server |
Breakeven: ~2,000-3,000 videos/month. Below that, APIs are cheaper.
Suggested SaaS Pricing Tiers
| Tier | Price | Limits | Cost/Video | Margin |
|---|---|---|---|---|
| Free | $0 | 10-min videos, 5/month | ~$0.07 | Marketing |
| Pro | $15-30/mo | 30-min videos, 50/month | ~$0.20 | 50-70% |
| Business | $50-100/mo | 60-min videos, 200/month | ~$0.35 | 65-80% |
Infrastructure (v2 — Simplified)
Architecture
Frontend → Backend API → Dramatiq Worker (lightweight: MediaPipe only)
↕ ↕ ↕
PostgreSQL Deepgram API GigaChat API
Redis (transcription (viral detection)
S3/MinIO + diarization)
Remotion DeepInfra
(fallback LLM)
Docker Image
| v1 | v2 | |
|---|---|---|
| Base | python:3.11-slim + PyTorch + Whisper + CUDA libs | python:3.11-slim + mediapipe |
| Size | 1.72 GB | ~400-500 MB |
| RAM | 16 GB recommended | 4 GB sufficient |
Can remove from pyproject.toml: openai-whisper (and transitively PyTorch) — if Deepgram fully replaces Whisper. Keep Whisper as optional dependency (uv sync --group whisper) for fallback.
No ML Service Separation Needed
With only MediaPipe (~30MB, ~400MB RAM) running locally, there is no need for:
- Separate ML worker container
- Docker Compose profiles for ML
- GPU infrastructure
- Dedicated Dramatiq queues for ML
Standard worker with --processes 1 --threads 2 handles everything.
New Settings
# Deepgram
deepgram_api_key: str = Field(default="", alias="DEEPGRAM_API_KEY")
# GigaChat (Sber)
gigachat_client_id: str = Field(default="", alias="GIGACHAT_CLIENT_ID")
gigachat_client_secret: str = Field(default="", alias="GIGACHAT_CLIENT_SECRET")
# DeepInfra (fallback LLM)
deepinfra_api_key: str = Field(default="", alias="DEEPINFRA_API_KEY")
# LLM config
llm_provider: str = Field(default="gigachat", alias="LLM_PROVIDER") # gigachat | deepinfra
llm_viral_prompt_version: str = Field(default="v1", alias="LLM_VIRAL_PROMPT_VERSION")
Technology Stack Summary
New Dependencies (v2)
| Package | Size | Purpose | Feature |
|---|---|---|---|
mediapipe |
~30 MB | Face detection (CPU) | 3 |
httpx |
Already installed | API calls to Deepgram, GigaChat, DeepInfra | 2, 3 |
| Total new deps | ~30 MB |
Removed Dependencies (vs v1)
| Package | Size Saved | Was For |
|---|---|---|
openai-whisper |
~50 MB + PyTorch ~2 GB | Transcription (replaced by Deepgram) |
pyannote-audio |
~200 MB | Diarization (replaced by Deepgram) |
torchaudio |
~50-80 MB | pyannote dependency |
librosa |
~20 MB | Audio energy (replaced by Deepgram sentiment) |
| Total removed | ~2.3 GB |
New Backend Modules
| Module | Purpose | Feature |
|---|---|---|
clips |
Clip CRUD, review workflow | 2 |
New Remotion Compositions
| Composition | Purpose | Feature |
|---|---|---|
ShortsVideo |
Static/keyframe crop + captions at 9:16 | 4 |
AutoEditVideo |
Face-tracking dynamic crop + captions | 3 |
New Job Types
| Job Type | Purpose | Feature |
|---|---|---|
VIRAL_DETECT |
GigaChat analysis of transcription | 2 |
ASPECT_CONVERT |
9:16 crop + re-encode | 4 |
FACE_DETECT |
Face bounding box detection (MediaPipe) | 3 |
Note: SPEAKER_DIARIZE is no longer a separate job type — diarization is included in Deepgram transcription.
Transcription Engine Extension
# Extend existing engine selection:
engine: Literal["whisper", "google", "deepgram"] = "deepgram"
Deepgram becomes the default. Whisper remains as optional fallback (requires uv sync --group whisper).
Cross-Cutting Issues (v2)
Remaining from v1
| Issue | Priority | Action |
|---|---|---|
_get_job_status_sync() leaks DB connections |
High | Fix before adding more actors |
tasks/service.py at 1,674 lines, will exceed 2K |
Medium | Extract actor boilerplate |
Worker REMOTION_SERVICE_URL default wrong |
Medium | Fix to http://remotion:3001 |
| No resource limits on Docker services | Medium | Add memory/CPU limits |
| No temp file cleanup on OOM crash | Medium | Add periodic cleanup |
isCurrent word identity check in Captions.tsx fragile |
Low | Compare by index |
New in v2
| Issue | Priority | Action |
|---|---|---|
| API key management (3 services) | High | All via env vars in settings, never in code |
| API rate limit handling | High | Retry with exponential backoff in all actors |
| API vendor lock-in | Medium | Abstract behind engine interface (existing pattern) |
| Network dependency (API downtime) | Medium | Keep Whisper as optional fallback engine |
| Deepgram → Document schema conversion | Medium | Build converter to match existing Document structure |
| GigaChat OAuth2 token refresh | Medium | Token caching with auto-refresh in infrastructure/ |
Eliminated from v1
| Why Gone | |
|---|---|
| PyTorch removed entirely | |
| No heavy ML locally | |
| Single lightweight image | |
| All ML is API-based | |
| No PyTorch | |
| No local models |
Specialist Reports (Full Transcripts)
Full specialist outputs are available in the session transcript. Key files each specialist examined:
- ML Engineer:
cpv3/modules/transcription/service.py,cpv3/modules/tasks/service.py,pyproject.toml - Backend Architect:
cpv3/modules/tasks/service.py,cpv3/modules/jobs/schemas.py,cpv3/modules/media/service.py,cpv3/modules/captions/service.py,docker-compose.yml - Remotion Engineer:
remotion_service/src/components/Composition.tsx,Captions.tsx,Root.tsx,useCaptions.ts,useVideoMeta.ts, all type definitions - Frontend Architect:
src/widgets/TimelinePanel/,src/features/project/FragmentsStep/,src/shared/context/WizardContext.tsx,src/shared/store/notifications/ - DevOps Engineer:
docker-compose.yml,Dockerfile,pyproject.toml,uv.lock - Performance Engineer:
cpv3/modules/tasks/service.py,cpv3/modules/media/service.py,cpv3/modules/transcription/service.py,docker-compose.yml
Note: Specialist reports were produced for v1 architecture (local ML). Their recommendations for Remotion compositions, backend module design, frontend components, and crop data formats remain valid in v2. The infrastructure and ML model recommendations are superseded by the API-first approach.