Files

T

Daniil 27e03cc56c feat: rename Product Strategist to Product Lead, add lead coordination + dual-mode

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-22 22:42:35 +03:00

23 KiB

Raw Blame History

Video Features Roadmap — Technical Consultation v2 (API-First)

Date: 2026-03-22 Specialists consulted: ML/AI Engineer, Backend Architect, Remotion Engineer, Frontend Architect, DevOps Engineer, Performance Engineer Revision: v2 — switched to API-first architecture using Deepgram, GigaChat, and DeepInfra

What Changed from v1

v2 replaces local ML models with managed API services. This is the single biggest architectural change — it eliminates PyTorch, GPU infrastructure, ML worker separation, and most memory/processing bottlenecks.

API Substitutions

v1 (Local ML)	v2 (API-First)	Impact
Local Whisper (PyTorch, 20-60 min CPU)	Deepgram Nova-3 API (~30 sec)	Eliminates PyTorch dependency entirely
Local pyannote.audio (PyTorch, 15-30 min CPU)	Deepgram `diarize=true` (included in transcription call)	Eliminates pyannote + torchaudio deps
Gemini 2.5 Flash / GPT-4o-mini for viral detection	GigaChat Pro (native Russian LLM by Sber)	Better Russian cultural context, humor, slang
librosa for audio energy analysis	Deepgram `sentiment=true` per utterance	Sentiment replaces energy analysis for most cases
N/A	DeepInfra (Llama, Mistral, Qwen via API)	Fallback/A/B testing for LLM analysis

Key Metrics Changed

Metric	v1	v2	Change
Docker image size	1.72 GB	~400-500 MB	-75% (no PyTorch)
Peak worker RAM	8-16 GB	~400 MB (MediaPipe only)	-95%
Processing time (30-min video, full pipeline)	35-80 min (CPU)	5-10 min	-85%
Per-video cost	$0.11	$0.20	+80% (API costs)
Monthly cost (100 videos)	$11 compute + server + $0-380 GPU	$20 APIs + server	Simpler, cheaper at low volume
GPU needed?	Phase 2 for diarization	Never	Eliminated
New Python dependencies	~310-340 MB	~40 MB (mediapipe + HTTP clients)	-88%
MVP total timeline	26-34 dev-days	20-27 dev-days	-20-25%

Issues Eliminated

These v1 cross-cutting issues no longer apply:

v1 Issue	Why It's Gone
~~Switch PyTorch to CPU-only index~~	PyTorch removed entirely (Whisper replaced by Deepgram)
~~Worker OOM with concurrent ML jobs~~	No heavy ML — standard 4 GB worker
~~Separate ML worker Docker image~~	Single lightweight image
~~GPU infrastructure planning~~	All ML is API-based
~~PyTorch version conflicts~~	No PyTorch
~~Model download on first run~~	No local models (except MediaPipe, ~2 MB)
~~ML worker separation via Docker Compose profiles~~	Not needed

New Issues Introduced

Issue	Priority	Mitigation
API key management (Deepgram, GigaChat, DeepInfra)	High	Store in settings via env vars, never in code
API rate limits	High	Retry with exponential backoff in actors
API vendor lock-in	Medium	Abstract behind engine interfaces (like current `engine: "whisper" \| "google"`)
Network dependency (API downtime = no processing)	Medium	Keep Whisper as optional fallback engine
Higher per-video cost ($0.20 vs $0.11)	Low	Offset by zero infrastructure cost; profitable at any SaaS tier

Feature Overview

#	Feature	Complexity	MVP	Full	Additional Infra
1	Advanced Remotion Templates	Easy-Medium	3-4 days	3-4 days	None — ready to implement
2	Viral Moments Detection	Medium	3-5 days	6-10 days	API keys (GigaChat, Deepgram)
3	Auto-Cut & Head Tracking	Hard	8-10 days	20-30 days	MediaPipe only (CPU, ~30 MB)
4	9:16 Shorts Conversion	Medium	6-8 days	+3-4 days after #3	None
Total			20-27 days	35-47 days

Realistic for one dev: 5-7 weeks (all MVPs) or 2-3 months (full versions).

Feature 1: Advanced Remotion Templates

No changes from v1. This feature has no ML dependencies.

Status: Spec + implementation plan already written.

Spec: docs/superpowers/specs/2026-03-21-advanced-remotion-templates-design.md
Plan: docs/superpowers/plans/2026-03-21-advanced-remotion-templates.md

Scope: Extend CaptionStyleSchema with 4 new highlight styles (pop_in, karaoke, bounce, glow_pulse), 2 transitions (zoom_in, drop_in), 3 fields (word_entrance, highlight_rotation_deg, text_transform). Seed 2 system presets: "Shorts" and "Podcast".

Changes: Schema extensions in Remotion + backend, rendering logic in Captions.tsx, Alembic migration for presets, frontend StyleEditor form controls.

Feature 2: Viral Moments Detection

Architecture (v2 — API-First)

Transcription: Deepgram Nova-3 API with diarize=true + sentiment=true. Single API call returns word-level timestamps, speaker labels, and per-utterance sentiment scores. Cost: $0.0053/min ($0.16 for 30-min video). Processing: ~30 seconds.

LLM analysis: GigaChat Pro (by Sber) — native Russian LLM trained on Russian internet content. Better detection of Russian humor, cultural references, slang, and viral patterns than English-first models. Fallback: DeepInfra (Llama 3.1 70B or Qwen) for A/B testing.

Audio augmentation: Deepgram's per-utterance sentiment scores replace librosa energy analysis for most use cases. High-sentiment utterances correlate with viral moments. Optional: keep librosa for audio loudness analysis (laughter, raised voice) as an enhancement.

Pipeline:

Deepgram transcription with diarize=true + sentiment=true → timestamps + speakers + sentiment
Convert Deepgram response to existing Document schema (segments, lines, words)
GigaChat analyzes transcription text + sentiment data → viral clip candidates
Post-process: snap boundaries to segment edges, compute composite scores
Save clips to clips table

Backend Design

New module: clips (models, schemas, repository, service, router) — stores detected clips with project/file/job relationships.

Clip model:

Clip {
  project_id: UUID (FK projects)
  source_file_id: UUID (FK files)
  job_id: UUID? (FK jobs)
  title: str
  start_ms: int
  end_ms: int
  score: float
  source_type: "viral_detected" | "user_created" | "auto_generated"
  status: "pending" | "approved" | "rejected" | "exported"
  meta: JSON? (LLM reasoning, tags, hashtags, sentiment data)
}

New job type: VIRAL_DETECT added to JobTypeEnum. Actor calls GigaChat API via httpx from Dramatiq worker.

Transcription engine extension: Add "deepgram" to the existing engine selection (engine: "whisper" | "google" | "deepgram"). Deepgram becomes the default for new transcriptions. Whisper remains as a fallback.

LLM integration:

GigaChat API via httpx (OAuth2 token auth via Sber ID)
DeepInfra as fallback (OpenAI-compatible API)
Prompts stored in cpv3/infrastructure/prompts/viral_detection_v1.txt
Active version controlled by LLM_VIRAL_PROMPT_VERSION env var
New settings: GIGACHAT_CLIENT_ID, GIGACHAT_CLIENT_SECRET, DEEPINFRA_API_KEY, DEEPGRAM_API_KEY

Frontend Design

New ViralClipsStep in project wizard (features/project/)
Clip list with thumbnails, scores, titles, approve/reject buttons
Clip edit modal with video preview (scoped playback for start/end range)
New job type VIRAL_DETECT in notification handling (existing WebSocket infrastructure)

Key Numbers

Metric	v1	v2
Transcription time	Depends on Whisper (already done)	~30 sec (Deepgram, if not already transcribed)
LLM analysis time	10-20 sec	10-20 sec (same)
Total processing	10-20 sec (after transcription)	40-50 sec (including Deepgram transcription)
Cost per video	~$0.005 (LLM only)	~$0.17 ($0.16 Deepgram + $0.01 GigaChat)
Accuracy (precision)	50-70%	60-80% (GigaChat better at Russian + sentiment data)
New dependencies	`google-generativeai` + `librosa` (~30 MB)	HTTP client only (~0 MB new)
MVP time	5-7 days	3-5 days

Risks

GigaChat API availability — Sber's API may have lower uptime than Google/OpenAI. Mitigation: DeepInfra fallback.
GigaChat structured output — verify JSON mode / function calling works reliably for clip extraction. Test early.
Deepgram Russian WER — ~10-12% WER on Russian (Nova-3). Comparable to Whisper medium. Sufficient for viral detection.
Visual-only moments still missed (~20-30%) — same limitation as v1.

MVP vs Full

MVP (3-5 days): Deepgram transcription + GigaChat analysis. Returns clips with scores. User reviews and accepts/rejects. No audio energy analysis.
Full (6-10 days): Add sentiment-weighted scoring, few-shot prompt tuning from user feedback, batch processing, direct clip export to 9:16, DeepInfra A/B testing.

Feature 3: Auto-Cut & Head Tracking

Architecture (v2 — API-First)

Face detection: MediaPipe BlazeFace (unchanged from v1). Apache 2.0, ~2MB model, 30-60 FPS on CPU. Sample at 3 FPS. This is the only local ML component remaining. Dependency: mediapipe (~30MB).

Speaker diarization: Deepgram API with diarize=true (~30 seconds for 30-min video). Replaces pyannote.audio entirely. Diarization is included in the transcription call — no additional API cost.

Face-speaker mapping:

Phase 1: Temporal correlation heuristic — match face tracks to Deepgram speaker segments by maximum temporal overlap. 70-85% accuracy for 2-speaker videos. Zero additional dependencies. ~100 lines of Python.
Phase 2: TalkNet-ASD — if needed for accuracy. This is the only scenario where GPU would be reconsidered, but can be deferred indefinitely if temporal correlation + user correction is sufficient.

Video compositing: Same as v1 — Remotion compositions with CSS transform crop. No changes.

New Remotion compositions: Same as v1.

Composition	Purpose	Phase
`CaptionedVideo` (existing)	Caption overlay on native video	Current
`ShortsVideo` (new)	Static/keyframe crop + captions at 9:16	Feature 4
`AutoEditVideo` (new)	Face-tracking crop + cuts + captions	Feature 3 full

Crop data format: Same as v1 (keyframes with normalized 0-1 coordinates).

Backend Design

New job types: FACE_DETECT added to JobTypeEnum. SPEAKER_DIARIZE is no longer needed as a separate job — diarization comes from Deepgram as part of transcription.

ML service separation: Not needed. MediaPipe is lightweight (~30MB, ~400MB RAM). Runs in standard Dramatiq worker.

Remotion service changes: Same as v1 — compositionId parameter, crop/outputWidth/outputHeight props.

Processing Time (30-min 1080p video)

Step	v1 (CPU)	v2 (API-First)
Deepgram transcription + diarization	N/A	~30 sec
Face detection (MediaPipe, 3 FPS)	1-2 min	1-2 min (unchanged)
~~Speaker diarization (pyannote)~~	~~15-30 min~~	Included in Deepgram
Face-speaker mapping	< 1 sec	< 1 sec
Remotion render (crop + captions)	10-30 min	10-30 min (unchanged)
Total (parallelized)	35-80 min	12-33 min

The 15-30 min diarization bottleneck is completely eliminated.

Memory Requirements

Config	v1	v2
Peak RAM	8-16 GB	~400 MB (MediaPipe only)
Worker config needed	`--threads 1`, 16 GB limit	Standard worker, 4 GB limit

Frontend Design

Same as v1:

Head tracking preview: video player with face bounding box overlay (canvas)
Speaker timeline track in TimelinePanel
Controls: zoom level slider, transition speed, speaker selection
Before/after comparison toggle

Key Numbers

Metric	v1	v2
Diarization time	15-30 min (CPU) / 1-2 min (GPU)	~30 sec (API)
Face detection time	1-2 min	1-2 min (unchanged)
Total analysis time	17-33 min (CPU)	~2 min
Full pipeline (with render)	35-80 min (CPU)	12-33 min
Peak RAM	8-16 GB	~400 MB
New dependencies	~280 MB (mediapipe + pyannote + torchaudio)	~30 MB (mediapipe only)
GPU needed?	Phase 2 recommended	Never
MVP time	12-15 days	8-10 days

Risks

Face-to-speaker mapping accuracy unchanged (70-85% with heuristic) — still the hardest subproblem
Deepgram diarization accuracy — DER may be slightly worse than pyannote 3.1 (~12-15% vs ~10%). Acceptable for this use case.
Video quality loss when cropping — unchanged from v1
TalkNet-ASD deferred — if temporal correlation isn't accurate enough, TalkNet requires GPU. Cross that bridge if needed.

MVP vs Full

MVP (8-10 days): Face detection on sampled frames. Deepgram provides speaker labels. Temporal correlation maps faces to speakers. User can manually correct. Static crop to selected face.
Full (20-30 days): Dynamic crop following active speaker. Smooth transitions. Split-screen. Multi-speaker. Optional TalkNet-ASD for accuracy.

Feature 4: 9:16 Shorts Conversion

No changes from v1. This feature has no ML dependencies.

Architecture

Pipeline: Crop-then-caption, always. Single Remotion render pass using new ShortsVideo composition.

Caption positioning: No new schema fields needed. Backend adjusts font_size, padding_px, max_width_pct in styleConfig for 9:16.

Crop specification:

type CropConfig = {
  mode: "static" | "keyframe";
  staticCrop?: { x: number; y: number; zoom: number };
  keyframes?: Array<{ time: number; x: number; y: number; zoom: number }>;
  interpolation?: "linear" | "ease" | "smooth";
};

Backend Design

New job type: ASPECT_CONVERT in JobTypeEnum. New function crop_to_vertical() in media/service.py.

New artifact type: VERTICAL_VIDEO in ArtifactTypeEnum.

Frontend Design

Crop preview: draggable 9:16 rectangle overlay on video player
Side-by-side preview toggle
"Convert to Short" button on approved viral clips
Auto-populate crop from face detection data (when available)

Processing Time

Approach	Time (30-min video)
FFmpeg crop-only (no captions)	12-36 min
Remotion crop + captions (single pass)	11-45 min
FFmpeg with NVENC hardware encoding	3-5 min

MVP vs Full

MVP (6-8 days): Manual crop region selection with preview. ShortsVideo Remotion composition.
Full (+3-4 days after Feature 3): Auto-crop from face detection. One-click conversion. Batch export.

Recommended Build Order

Week 1-2:    Feature 1 (Templates)        ████████
Week 2-3:    Feature 2 (Viral Detection)  ██████████
Week 3-5:    Feature 4 MVP (9:16 crop)    ████████████████
Week 5-10:   Feature 3 (Head Tracking)    ██████████████████████████████
Week 10-11:  Feature 4 upgrade            ████████

Rationale:

Templates first — ready to implement, zero risk, immediate user value
Viral detection second — fastest ROI with API-first (3-5 days MVP), validates user demand
9:16 MVP third — builds ShortsVideo composition, useful standalone
Head tracking last — still the most complex, but now much simpler without pyannote/GPU
9:16 upgrade — trivial once head tracking provides face data

Cost Analysis

Per-Video Processing Cost (30-min video, all features)

Component	v1 (Local ML)	v2 (API-First)
Transcription + diarization	$0.07 compute	$0.16 (Deepgram)
LLM viral detection	$0.005 (Gemini)	$0.01 (GigaChat)
Face detection	$0.002 compute	$0.002 compute (unchanged)
FFmpeg/Remotion render	$0.02 compute	$0.02 compute
Total per video	$0.11	$0.20

Monthly Cost Comparison

Scale	v1 (Local ML)	v2 (API-First)
100 videos/month	$11 compute + server + $0-380 GPU	$20 APIs + server
500 videos/month	$55 + $200-380 GPU = $255-435	$100 APIs + server
1,000 videos/month	$110 + $380 GPU = $490	$200 APIs + server
5,000 videos/month	$550 + $380 GPU = $930	$1,000 APIs + server

Breakeven: ~2,000-3,000 videos/month. Below that, APIs are cheaper.

Suggested SaaS Pricing Tiers

Tier	Price	Limits	Cost/Video	Margin
Free	$0	10-min videos, 5/month	~$0.07	Marketing
Pro	$15-30/mo	30-min videos, 50/month	~$0.20	50-70%
Business	$50-100/mo	60-min videos, 200/month	~$0.35	65-80%

Infrastructure (v2 — Simplified)

Architecture

Frontend → Backend API → Dramatiq Worker (lightweight: MediaPipe only)
                              ↕              ↕           ↕
                         PostgreSQL     Deepgram API   GigaChat API
                         Redis          (transcription  (viral detection)
                         S3/MinIO        + diarization)
                         Remotion        DeepInfra
                                         (fallback LLM)

Docker Image

	v1	v2
Base	python:3.11-slim + PyTorch + Whisper + CUDA libs	python:3.11-slim + mediapipe
Size	1.72 GB	~400-500 MB
RAM	16 GB recommended	4 GB sufficient

Can remove from pyproject.toml: openai-whisper (and transitively PyTorch) — if Deepgram fully replaces Whisper. Keep Whisper as optional dependency (uv sync --group whisper) for fallback.

No ML Service Separation Needed

With only MediaPipe (~30MB, ~400MB RAM) running locally, there is no need for:

Separate ML worker container
Docker Compose profiles for ML
GPU infrastructure
Dedicated Dramatiq queues for ML

Standard worker with --processes 1 --threads 2 handles everything.

New Settings

# Deepgram
deepgram_api_key: str = Field(default="", alias="DEEPGRAM_API_KEY")

# GigaChat (Sber)
gigachat_client_id: str = Field(default="", alias="GIGACHAT_CLIENT_ID")
gigachat_client_secret: str = Field(default="", alias="GIGACHAT_CLIENT_SECRET")

# DeepInfra (fallback LLM)
deepinfra_api_key: str = Field(default="", alias="DEEPINFRA_API_KEY")

# LLM config
llm_provider: str = Field(default="gigachat", alias="LLM_PROVIDER")  # gigachat | deepinfra
llm_viral_prompt_version: str = Field(default="v1", alias="LLM_VIRAL_PROMPT_VERSION")

Technology Stack Summary

New Dependencies (v2)

Package	Size	Purpose	Feature
`mediapipe`	~30 MB	Face detection (CPU)	3
`httpx`	Already installed	API calls to Deepgram, GigaChat, DeepInfra	2, 3
Total new deps	~30 MB

Removed Dependencies (vs v1)

Package	Size Saved	Was For
~~`openai-whisper`~~	~50 MB + PyTorch ~2 GB	Transcription (replaced by Deepgram)
~~`pyannote-audio`~~	~200 MB	Diarization (replaced by Deepgram)
~~`torchaudio`~~	~50-80 MB	pyannote dependency
~~`librosa`~~	~20 MB	Audio energy (replaced by Deepgram sentiment)
Total removed	~2.3 GB

New Backend Modules

Module	Purpose	Feature
`clips`	Clip CRUD, review workflow	2

New Remotion Compositions

Composition	Purpose	Feature
`ShortsVideo`	Static/keyframe crop + captions at 9:16	4
`AutoEditVideo`	Face-tracking dynamic crop + captions	3

New Job Types

Job Type	Purpose	Feature
`VIRAL_DETECT`	GigaChat analysis of transcription	2
`ASPECT_CONVERT`	9:16 crop + re-encode	4
`FACE_DETECT`	Face bounding box detection (MediaPipe)	3

Note: SPEAKER_DIARIZE is no longer a separate job type — diarization is included in Deepgram transcription.

Transcription Engine Extension

# Extend existing engine selection:
engine: Literal["whisper", "google", "deepgram"] = "deepgram"

Deepgram becomes the default. Whisper remains as optional fallback (requires uv sync --group whisper).

Cross-Cutting Issues (v2)

Remaining from v1

Issue	Priority	Action
`_get_job_status_sync()` leaks DB connections	High	Fix before adding more actors
`tasks/service.py` at 1,674 lines, will exceed 2K	Medium	Extract actor boilerplate
Worker `REMOTION_SERVICE_URL` default wrong	Medium	Fix to `http://remotion:3001`
No resource limits on Docker services	Medium	Add memory/CPU limits
No temp file cleanup on OOM crash	Medium	Add periodic cleanup
`isCurrent` word identity check in Captions.tsx fragile	Low	Compare by index

New in v2

Issue	Priority	Action
API key management (3 services)	High	All via env vars in settings, never in code
API rate limit handling	High	Retry with exponential backoff in all actors
API vendor lock-in	Medium	Abstract behind engine interface (existing pattern)
Network dependency (API downtime)	Medium	Keep Whisper as optional fallback engine
Deepgram → Document schema conversion	Medium	Build converter to match existing `Document` structure
GigaChat OAuth2 token refresh	Medium	Token caching with auto-refresh in `infrastructure/`

Eliminated from v1

~~Issue~~	Why Gone
~~PyTorch CPU-only index~~	PyTorch removed entirely
~~Worker OOM with ML jobs~~	No heavy ML locally
~~ML worker Docker image~~	Single lightweight image
~~GPU infrastructure~~	All ML is API-based
~~PyTorch version conflicts~~	No PyTorch
~~Model downloads on first run~~	No local models

Specialist Reports (Full Transcripts)

Full specialist outputs are available in the session transcript. Key files each specialist examined:

ML Engineer: cpv3/modules/transcription/service.py, cpv3/modules/tasks/service.py, pyproject.toml
Backend Architect: cpv3/modules/tasks/service.py, cpv3/modules/jobs/schemas.py, cpv3/modules/media/service.py, cpv3/modules/captions/service.py, docker-compose.yml
Remotion Engineer: remotion_service/src/components/Composition.tsx, Captions.tsx, Root.tsx, useCaptions.ts, useVideoMeta.ts, all type definitions
Frontend Architect: src/widgets/TimelinePanel/, src/features/project/FragmentsStep/, src/shared/context/WizardContext.tsx, src/shared/store/notifications/
DevOps Engineer: docker-compose.yml, Dockerfile, pyproject.toml, uv.lock
Performance Engineer: cpv3/modules/tasks/service.py, cpv3/modules/media/service.py, cpv3/modules/transcription/service.py, docker-compose.yml

Note: Specialist reports were produced for v1 architecture (local ML). Their recommendations for Remotion compositions, backend module design, frontend components, and crop data formats remain valid in v2. The infrastructure and ML model recommendations are superseded by the API-first approach.

23 KiB Raw Blame History

Video Features Roadmap — Technical Consultation v2 (API-First)

What Changed from v1

API Substitutions

Key Metrics Changed

Issues Eliminated

New Issues Introduced

Feature Overview

Feature 1: Advanced Remotion Templates

Feature 2: Viral Moments Detection

Architecture (v2 — API-First)

Backend Design

Frontend Design

Key Numbers

Risks

MVP vs Full

Feature 3: Auto-Cut & Head Tracking

Architecture (v2 — API-First)

Backend Design

Processing Time (30-min 1080p video)

Memory Requirements

Frontend Design

Key Numbers

Risks

MVP vs Full

Feature 4: 9:16 Shorts Conversion

Architecture

Backend Design

Frontend Design

Processing Time

MVP vs Full

Recommended Build Order

Cost Analysis

Per-Video Processing Cost (30-min video, all features)

Monthly Cost Comparison

Suggested SaaS Pricing Tiers

Infrastructure (v2 — Simplified)

Architecture

Docker Image

No ML Service Separation Needed

New Settings

Technology Stack Summary

New Dependencies (v2)

Removed Dependencies (vs v1)

New Backend Modules

New Remotion Compositions

New Job Types

Transcription Engine Extension

Cross-Cutting Issues (v2)

Remaining from v1

New in v2

Eliminated from v1

Specialist Reports (Full Transcripts)

23 KiB

Raw Blame History