Files

T

Daniil 27e03cc56c feat: rename Product Strategist to Product Lead, add lead coordination + dual-mode

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-22 22:42:35 +03:00

19 KiB

Raw Blame History

Video Features Roadmap — Technical Consultation v1

Date: 2026-03-22 Specialists consulted: ML/AI Engineer, Backend Architect, Remotion Engineer, Frontend Architect, DevOps Engineer, Performance Engineer

Feature Overview

#	Feature	Complexity	MVP	Full	Additional Infra
1	Advanced Remotion Templates	Easy-Medium	3-4 days	3-4 days	None — ready to implement
2	Viral Moments Detection	Medium	5-7 days	8-12 days	LLM API key only
3	Auto-Cut & Head Tracking	Very Hard	12-15 days	30-45 days	Phase 1: nothing; Phase 2: GPU worker
4	9:16 Shorts Conversion	Medium	6-8 days	+3-4 days after #3	None
Total			26-34 days	44-65 days

Realistic for one dev: 6-8 weeks (all MVPs) or 3-4 months (full versions).

Feature 1: Advanced Remotion Templates

Status: Spec + implementation plan already written.

Spec: docs/superpowers/specs/2026-03-21-advanced-remotion-templates-design.md
Plan: docs/superpowers/plans/2026-03-21-advanced-remotion-templates.md

Scope: Extend CaptionStyleSchema with 4 new highlight styles (pop_in, karaoke, bounce, glow_pulse), 2 transitions (zoom_in, drop_in), 3 fields (word_entrance, highlight_rotation_deg, text_transform). Seed 2 system presets: "Shorts" and "Podcast".

Changes: Schema extensions in Remotion + backend, rendering logic in Captions.tsx, Alembic migration for presets, frontend StyleEditor form controls.

No specialist input needed — fully designed, no new infrastructure.

Feature 2: Viral Moments Detection

Architecture

LLM API: Gemini 2.5 Flash (best Russian language support, $0.15/$0.60 per 1M tokens) or GPT-4o-mini (same pricing, slightly weaker Russian). Cost per 30-min video analysis: ~$0.005.

Audio augmentation: librosa for RMS energy curves — refines clip boundaries to natural pauses, boosts scoring for high-energy segments. Adds ~20MB dependency, processes 30-min audio in <10 seconds.

Pipeline:

Fetch transcription Document from DB
librosa computes energy envelope over full audio (100ms resolution)
LLM analyzes transcription text with structured JSON output prompt
Post-process: snap clip boundaries to low-energy points, compute energy scores
Save clips to new clips table

Backend Design

New module: clips (models, schemas, repository, service, router) — stores detected clips with project/file/job relationships.

Clip model:

Clip {
  project_id: UUID (FK projects)
  source_file_id: UUID (FK files)
  job_id: UUID? (FK jobs)
  title: str
  start_ms: int
  end_ms: int
  score: float
  source_type: "viral_detected" | "user_created" | "auto_generated"
  status: "pending" | "approved" | "rejected" | "exported"
  meta: JSON? (LLM reasoning, tags, hashtags)
}

New job type: VIRAL_DETECT added to JobTypeEnum. Actor calls LLM API directly via httpx from Dramatiq worker (no separate service needed).

LLM integration:

Direct HTTP call from actor with retry + exponential backoff on 429
Prompts stored in cpv3/infrastructure/prompts/viral_detection_v1.txt
Active version controlled by LLM_VIRAL_PROMPT_VERSION env var
New settings: LLM_API_URL, LLM_API_KEY, LLM_MODEL_NAME

Frontend Design

New ViralClipsStep in project wizard (features/project/)
Clip list with thumbnails, scores, titles, approve/reject buttons
Clip edit modal with video preview (scoped playback for start/end range)
New job type VIRAL_DETECT in notification handling (existing WebSocket infrastructure)

Key Numbers

Metric	Value
Accuracy (precision)	50-70%
Accuracy (recall)	60-80%
Processing time	10-20 seconds
Cost per video	~$0.005
Cost at 1,000 videos/month	~$5
New dependencies	`google-generativeai` or `openai` (~10MB) + `librosa` (~20MB)

Risks

Prompt engineering quality determines feature value — iterate based on user feedback
Visual-only moments (facial expressions, physical comedy) cannot be detected from text — ~20-30% of viral moments are missed
Transcription quality matters — Whisper tiny has ~25% WER on Russian; use at least small for viral detection input
LLM hallucinated timestamps — validate returned timestamps against actual segment boundaries

MVP vs Full

MVP: Text-only LLM analysis, no audio energy. Returns clips with scores. User reviews and accepts/rejects.
Full: Add librosa energy analysis, few-shot prompt examples from user-accepted clips, batch processing, direct clip export to 9:16.

Feature 3: Auto-Cut & Head Tracking

Architecture

Face detection: MediaPipe BlazeFace (Apache 2.0, ~2MB model, 30-60 FPS on CPU). Sample at 3 FPS — face positions don't change significantly within 330ms. Dependency: mediapipe (~30MB).

Speaker diarization: pyannote.audio 3.1 (MIT, ~10% DER, self-hosted). Runs on CPU at 0.17-0.33x real-time (5-10 min for 30-min audio). GPU accelerates to 1-2 min. Dependencies: pyannote-audio (~200MB) + torchaudio (~50-80MB). PyTorch already installed via Whisper.

Face-speaker mapping:

Phase 1: Temporal correlation heuristic — match face tracks to speaker segments by maximum temporal overlap. 70-85% accuracy for 2-speaker videos. Zero additional dependencies. ~100 lines of Python.
Phase 2: TalkNet-ASD (Active Speaker Detection) — jointly analyzes lip movement + audio to detect who is speaking. 92.3% accuracy. Requires torchvision + model weights (~50MB). Needs GPU (2-5 FPS on CPU vs 15-25 FPS on GPU).

Video compositing (Remotion approach):

Dynamic crop via CSS transform: scale() translate() on <Video> element inside overflow: hidden container. This is a GPU-composited browser operation — essentially free performance-wise. No FFmpeg re-encoding needed for the crop itself.

New Remotion compositions:

Composition	Purpose	Phase
`CaptionedVideo` (existing)	Caption overlay on native video	Current
`ShortsVideo` (new)	Static/keyframe crop + captions at 9:16	Feature 4
`AutoEditVideo` (new)	Face-tracking crop + cuts + captions	Feature 3 full

All compositions share the <Captions> component and useCaptions hook.

Crop data format (keyframes):

type FaceKeyframe = {
  time: number;       // seconds
  x: number;          // center of face, 0.0-1.0 normalized
  y: number;          // center of face, 0.0-1.0 normalized
  width: number;      // bounding box width, 0.0-1.0
  height: number;     // bounding box height, 0.0-1.0
  speakerId?: string;
};

type CropTrack = {
  keyframes: FaceKeyframe[];
  interpolation: "linear" | "ease" | "smooth";
  zoom: number;       // base zoom multiplier
  safeMargin: number; // margin around face (0.1 = 10%)
};

Remotion interpolate() between keyframes for smooth pan/zoom. Use spring() only for hard cuts between speakers.

Backend Design

New job types: FACE_DETECT, SPEAKER_DIARIZE added to JobTypeEnum. Results stored in Job.output_data (JSON) — no new table needed for face/diarization data.

ML service separation:

Phase 1: Keep in Dramatiq workers (same image). MediaPipe + pyannote add only ~280MB to image.
Phase 2: Separate ml-worker Docker container on dedicated Dramatiq queues (ml_head_tracking, ml_diarization). Same codebase, different image, different resource limits.

Remotion service changes: POST /api/render needs a compositionId request parameter to select which composition to render. Props extend with crop, outputWidth, outputHeight.

Processing Time (30-min 1080p video)

Step	CPU	GPU
Audio extraction (FFmpeg)	10-20 sec	10-20 sec
Face detection (MediaPipe, 3 FPS)	1-2 min	10-15 sec
Speaker diarization (pyannote)	15-30 min	1-2 min
Face-speaker mapping	< 1 sec	< 1 sec
Remotion render (crop + captions)	10-30 min	10-30 min
Total (parallelized)	35-80 min	16-40 min

Face detection + diarization can run in parallel (different input: video frames vs audio track).

Memory Requirements

Config	Peak RAM
Whisper base + pyannote (parallel)	8-12 GB
Whisper medium + pyannote (parallel)	12-16 GB
Recommended ML worker limit	16 GB, `--threads 1`

Frontend Design

Head tracking preview: video player with face bounding box overlay (canvas)
Speaker timeline track in TimelinePanel (extends existing 4-track system)
Controls: zoom level slider, transition speed, speaker selection
Before/after comparison toggle
UX flow: upload podcast → trigger analysis (ProcessingStep) → review speaker assignments → adjust → export

Key Numbers

Metric	Value
Face detection accuracy	~90% (MediaPipe on talking-head content)
Diarization DER	~10% (pyannote 3.1)
Face-speaker mapping (Phase 1)	70-85% accuracy
Face-speaker mapping (Phase 2, TalkNet)	~92% accuracy
New dependencies	~280MB (mediapipe + pyannote + torchaudio)
GPU mandatory?	No for Phase 1; recommended for Phase 2

Risks

Face-to-speaker mapping is the hardest unsolved subproblem — 70-85% accuracy means 1 in 5 assignments may be wrong. Must let users manually correct.
Diarization on CPU is the bottleneck — 15-30 min for 30-min video. GPU reduces to 1-2 min.
PyTorch version conflicts between Whisper and pyannote — test uv sync before committing.
Video quality loss when cropping 16:9 to 9:16 — only ~31.6% of frame width is kept. Source must be at least 1080p.
Model download on first run — pyannote models (~100MB) require Hugging Face license acceptance. Handle in Dockerfile, not at runtime.

MVP vs Full

MVP (12-15 days): Face detection on sampled frames. User manually selects which face to follow. Static crop to selected face. No speaker switching, no diarization. Works for single-speaker content.
Full (30-45 days): Speaker diarization + face-speaker mapping. Dynamic crop following active speaker. Smooth spring() transitions on speaker changes. Split-screen for reactions. Multi-speaker support.

Feature 4: 9:16 Shorts Conversion

Architecture

Pipeline: Crop-then-caption, always. Single Remotion render pass using new ShortsVideo composition. The composition renders at target 9:16 dimensions, applies CSS crop transform to <Video>, and renders captions on top.

Caption positioning: No new schema fields needed. Backend adjusts font_size, padding_px, max_width_pct in styleConfig for 9:16 aspect ratio. Remotion is a "dumb renderer" — intelligence about what looks good at 9:16 belongs in presets.

Crop specification:

type CropConfig = {
  mode: "static" | "keyframe";
  staticCrop?: { x: number; y: number; zoom: number };  // 0-1 normalized
  keyframes?: Array<{ time: number; x: number; y: number; zoom: number }>;
  interpolation?: "linear" | "ease" | "smooth";
};

Static crop is a degenerate case of keyframe crop (single keyframe).

Backend Design

New job type: ASPECT_CONVERT in JobTypeEnum. New function crop_to_vertical() in media/service.py using FFmpeg crop+scale filter.

New artifact type: VERTICAL_VIDEO in ArtifactTypeEnum.

Pipeline:

Trim source video to clip time range (if from viral detection)
Apply crop (static center crop or face-tracking crop from Feature 3)
Upload to S3 at {folder}/vertical/{filename}
Webhook + notification

Frontend Design

Crop preview: draggable 9:16 rectangle overlay on video player (CSS object-fit: cover + object-position)
Side-by-side preview toggle: original 16:9 vs cropped 9:16
Integration with Feature 2: "Convert to Short" button on each approved viral clip
Integration with Feature 3: auto-populate crop region from face detection data

Processing Time

Approach	Time (30-min video)
FFmpeg crop-only (no captions)	12-36 min
Remotion crop + captions (single pass)	11-45 min
FFmpeg with NVENC hardware encoding	3-5 min

MVP vs Full

MVP (6-8 days): Manual crop region selection with preview. User drags a 9:16 rectangle over video. New ShortsVideo Remotion composition renders crop + captions.
Full (+3-4 days after Feature 3): Auto-crop based on face detection data. One-click vertical conversion. Batch conversion of viral clips.

Recommended Build Order

Week 1-2:    Feature 1 (Templates)        ████████
Week 2-4:    Feature 2 (Viral Detection)  ████████████████
Week 4-6:    Feature 4 MVP (9:16 crop)    ████████████████
Week 6-14:   Feature 3 (Head Tracking)    ████████████████████████████████████████
Week 14-15:  Feature 4 upgrade            ████████

Rationale:

Templates first — ready to implement, zero risk, immediate user value
Viral detection second — highest value/effort ratio ($0.005/video, 5-7 days MVP), validates that users want automated editing
9:16 MVP third — builds the ShortsVideo composition that Feature 3 extends, useful standalone with manual crop
Head tracking last — most complex, biggest investment, validates demand from Features 2+4 first
9:16 upgrade — trivial once head tracking provides face position data

Cost Analysis

Per-Video Processing Cost

Tier	Components	Compute	LLM API	Total	Wait Time
CPU-only	All on CPU	$0.05	$0.06	$0.11	35-80 min
GPU (T4)	ML on GPU, FFmpeg on CPU	$0.11	$0.06	$0.17	16-40 min
GPU + NVENC	Everything on GPU	$0.13	$0.06	$0.19	10-15 min

Monthly Infrastructure Cost (100 videos/month)

Scenario	Cost
CPU-only (existing infra)	~$11 + server
Modal serverless GPU	~$21/month
Spot GPU (g4dn.xlarge)	~$115/month
Standing GPU (g4dn.xlarge 24/7)	~$380/month

Recommendation: Start CPU-only. Move to Modal serverless GPU when queue wait times exceed 15 minutes. At 500+ videos/day, evaluate spot instances.

Suggested SaaS Pricing Tiers

Tier	Price	Limits	Compute Cost	Margin
Free	$0	10-min videos, queue priority low	~$0.04/video	Marketing
Pro	$15-30/mo	30-min videos, GPU ML	~$0.17/video at 50 videos	60-80%
Business	$50-100/mo	60-min videos, priority queue, NVENC	~$0.38/video	70-85%

Infrastructure Decisions

ML Service Separation

Phase 1: Keep ML in existing Dramatiq workers. MediaPipe + pyannote add only ~280MB to image. PyTorch is already installed via Whisper.

Phase 2: Separate ml-worker Docker container on dedicated queues. Same codebase, different image (Dockerfile.ml), different resource limits. Use Docker Compose profiles:

docker-compose up                    # Default: no ML worker
docker-compose --profile ml up       # With ML worker

Do NOT build a separate HTTP microservice. Dramatiq already handles job queuing, retries, progress, and cancellation. Adding HTTP service discovery, API contracts, and health checks is overhead with zero benefit for async workloads.

Immediate Optimizations (Before New Features)

Action	Impact	Effort
Switch PyTorch to CPU-only index	-800MB image size	1 hour
Fix worker `REMOTION_SERVICE_URL` default	Bug fix	5 min
Add resource limits to docker-compose services	Prevent OOM cascades	30 min
Split Dramatiq into queue pools (lightweight vs ML vs compute)	Prevent worker starvation	2-3 hours

Technology Stack Summary

New Dependencies

Package	Size	Purpose	Feature
`google-generativeai` or `openai`	~10 MB	LLM API client	2
`librosa`	~20 MB	Audio energy analysis	2
`mediapipe`	~30 MB	Face detection	3
`pyannote-audio`	~200 MB	Speaker diarization	3
`torchaudio`	~50-80 MB	Audio processing for pyannote	3
Total new deps	~310-340 MB

New Backend Modules

Module	Purpose	Feature
`clips`	Clip CRUD, review workflow	2

New Remotion Compositions

Composition	Purpose	Feature
`ShortsVideo`	Static/keyframe crop + captions at 9:16	4
`AutoEditVideo`	Face-tracking dynamic crop + captions	3

New Job Types

Job Type	Purpose	Feature
`VIRAL_DETECT`	LLM analysis of transcription	2
`ASPECT_CONVERT`	9:16 crop + re-encode	4
`FACE_DETECT`	Face bounding box detection	3
`SPEAKER_DIARIZE`	Speaker diarization	3

Cross-Cutting Issues

Issue	Flagged By	Priority	Action
PyTorch installs CUDA libs on CPU-only infra (+800MB)	DevOps	High	Switch to CPU-only PyTorch index
Worker `--processes 1 --threads 2` will OOM with ML jobs	Performance	High	Split into queue pools, `--threads 1` for ML
`_get_job_status_sync()` leaks DB connections	Performance	High	Fix before adding more actors
No temp file cleanup on OOM crash	Performance	Medium	Add periodic `/tmp` cleanup or cron
`tasks/service.py` at 1,674 lines, will exceed 2K	Backend	Medium	Extract actor boilerplate into decorator/context manager
Worker `REMOTION_SERVICE_URL` default wrong (`localhost:8001`)	DevOps	Medium	Fix to `http://remotion:3001` in docker-compose
No resource limits on any Docker service	DevOps	Medium	Add memory/CPU limits to all services
Whisper should move to ML service eventually	Backend	Low	Plan for Phase 2 when ML worker is split out
`isCurrent` word identity check in Captions.tsx is fragile	Remotion	Low	Compare by index, not text + start time

Specialist Reports (Full Transcripts)

Full specialist outputs are available in the session transcript. Key files each specialist examined:

ML Engineer: cpv3/modules/transcription/service.py, cpv3/modules/tasks/service.py, pyproject.toml
Backend Architect: cpv3/modules/tasks/service.py, cpv3/modules/jobs/schemas.py, cpv3/modules/media/service.py, cpv3/modules/captions/service.py, docker-compose.yml
Remotion Engineer: remotion_service/src/components/Composition.tsx, Captions.tsx, Root.tsx, useCaptions.ts, useVideoMeta.ts, all type definitions
Frontend Architect: src/widgets/TimelinePanel/, src/features/project/FragmentsStep/, src/shared/context/WizardContext.tsx, src/shared/store/notifications/
DevOps Engineer: docker-compose.yml, Dockerfile, pyproject.toml, uv.lock
Performance Engineer: cpv3/modules/tasks/service.py, cpv3/modules/media/service.py, cpv3/modules/transcription/service.py, docker-compose.yml

19 KiB Raw Blame History

Video Features Roadmap — Technical Consultation v1

Feature Overview

Feature 1: Advanced Remotion Templates

Feature 2: Viral Moments Detection

Architecture

Backend Design

Frontend Design

Key Numbers

Risks

MVP vs Full

Feature 3: Auto-Cut & Head Tracking

Architecture

Backend Design

Processing Time (30-min 1080p video)

Memory Requirements

Frontend Design

Key Numbers

Risks

MVP vs Full

Feature 4: 9:16 Shorts Conversion

Architecture

Backend Design

Frontend Design

Processing Time

MVP vs Full

Recommended Build Order

Cost Analysis

Per-Video Processing Cost

Monthly Infrastructure Cost (100 videos/month)

Suggested SaaS Pricing Tiers

Infrastructure Decisions

ML Service Separation

Immediate Optimizations (Before New Features)

Technology Stack Summary

New Dependencies

New Backend Modules

New Remotion Compositions

New Job Types

Cross-Cutting Issues

Specialist Reports (Full Transcripts)

19 KiB

Raw Blame History