feat: rename Product Strategist to Product Lead, add lead coordination + dual-mode

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 22:42:35 +03:00
parent 6430ab3eff
commit 27e03cc56c
20 changed files with 6305 additions and 14 deletions
@@ -0,0 +1,416 @@
+# Video Features Roadmap — Technical Consultation v1
+
+**Date:** 2026-03-22
+**Specialists consulted:** ML/AI Engineer, Backend Architect, Remotion Engineer, Frontend Architect, DevOps Engineer, Performance Engineer
+
+---
+
+## Feature Overview
+
+| # | Feature | Complexity | MVP | Full | Additional Infra |
+|---|---------|-----------|-----|------|-----------------|
+| 1 | Advanced Remotion Templates | Easy-Medium | 3-4 days | 3-4 days | None — ready to implement |
+| 2 | Viral Moments Detection | Medium | 5-7 days | 8-12 days | LLM API key only |
+| 3 | Auto-Cut & Head Tracking | Very Hard | 12-15 days | 30-45 days | Phase 1: nothing; Phase 2: GPU worker |
+| 4 | 9:16 Shorts Conversion | Medium | 6-8 days | +3-4 days after #3 | None |
+| **Total** | | | **26-34 days** | **44-65 days** | |
+
+Realistic for one dev: **6-8 weeks** (all MVPs) or **3-4 months** (full versions).
+
+---
+
+## Feature 1: Advanced Remotion Templates
+
+**Status:** Spec + implementation plan already written.
+
+- Spec: `docs/superpowers/specs/2026-03-21-advanced-remotion-templates-design.md`
+- Plan: `docs/superpowers/plans/2026-03-21-advanced-remotion-templates.md`
+
+**Scope:** Extend `CaptionStyleSchema` with 4 new highlight styles (pop_in, karaoke, bounce, glow_pulse), 2 transitions (zoom_in, drop_in), 3 fields (word_entrance, highlight_rotation_deg, text_transform). Seed 2 system presets: "Shorts" and "Podcast".
+
+**Changes:** Schema extensions in Remotion + backend, rendering logic in `Captions.tsx`, Alembic migration for presets, frontend StyleEditor form controls.
+
+**No specialist input needed** — fully designed, no new infrastructure.
+
+---
+
+## Feature 2: Viral Moments Detection
+
+### Architecture
+
+**LLM API:** Gemini 2.5 Flash (best Russian language support, $0.15/$0.60 per 1M tokens) or GPT-4o-mini (same pricing, slightly weaker Russian). Cost per 30-min video analysis: ~$0.005.
+
+**Audio augmentation:** `librosa` for RMS energy curves — refines clip boundaries to natural pauses, boosts scoring for high-energy segments. Adds ~20MB dependency, processes 30-min audio in <10 seconds.
+
+**Pipeline:**
+1. Fetch transcription Document from DB
+2. librosa computes energy envelope over full audio (100ms resolution)
+3. LLM analyzes transcription text with structured JSON output prompt
+4. Post-process: snap clip boundaries to low-energy points, compute energy scores
+5. Save clips to new `clips` table
+
+### Backend Design
+
+**New module:** `clips` (models, schemas, repository, service, router) — stores detected clips with project/file/job relationships.
+
+**Clip model:**
+```
+Clip {
+  project_id: UUID (FK projects)
+  source_file_id: UUID (FK files)
+  job_id: UUID? (FK jobs)
+  title: str
+  start_ms: int
+  end_ms: int
+  score: float
+  source_type: "viral_detected" | "user_created" | "auto_generated"
+  status: "pending" | "approved" | "rejected" | "exported"
+  meta: JSON? (LLM reasoning, tags, hashtags)
+}
+```
+
+**New job type:** `VIRAL_DETECT` added to `JobTypeEnum`. Actor calls LLM API directly via `httpx` from Dramatiq worker (no separate service needed).
+
+**LLM integration:**
+- Direct HTTP call from actor with retry + exponential backoff on 429
+- Prompts stored in `cpv3/infrastructure/prompts/viral_detection_v1.txt`
+- Active version controlled by `LLM_VIRAL_PROMPT_VERSION` env var
+- New settings: `LLM_API_URL`, `LLM_API_KEY`, `LLM_MODEL_NAME`
+
+### Frontend Design
+
+- New `ViralClipsStep` in project wizard (features/project/)
+- Clip list with thumbnails, scores, titles, approve/reject buttons
+- Clip edit modal with video preview (scoped playback for start/end range)
+- New job type `VIRAL_DETECT` in notification handling (existing WebSocket infrastructure)
+
+### Key Numbers
+
+| Metric | Value |
+|---|---|
+| Accuracy (precision) | 50-70% |
+| Accuracy (recall) | 60-80% |
+| Processing time | 10-20 seconds |
+| Cost per video | ~$0.005 |
+| Cost at 1,000 videos/month | ~$5 |
+| New dependencies | `google-generativeai` or `openai` (~10MB) + `librosa` (~20MB) |
+
+### Risks
+
+- **Prompt engineering quality** determines feature value — iterate based on user feedback
+- **Visual-only moments** (facial expressions, physical comedy) cannot be detected from text — ~20-30% of viral moments are missed
+- **Transcription quality matters** — Whisper `tiny` has ~25% WER on Russian; use at least `small` for viral detection input
+- **LLM hallucinated timestamps** — validate returned timestamps against actual segment boundaries
+
+### MVP vs Full
+
+- **MVP:** Text-only LLM analysis, no audio energy. Returns clips with scores. User reviews and accepts/rejects.
+- **Full:** Add librosa energy analysis, few-shot prompt examples from user-accepted clips, batch processing, direct clip export to 9:16.
+
+---
+
+## Feature 3: Auto-Cut & Head Tracking
+
+### Architecture
+
+**Face detection:** MediaPipe BlazeFace (Apache 2.0, ~2MB model, 30-60 FPS on CPU). Sample at 3 FPS — face positions don't change significantly within 330ms. Dependency: `mediapipe` (~30MB).
+
+**Speaker diarization:** pyannote.audio 3.1 (MIT, ~10% DER, self-hosted). Runs on CPU at 0.17-0.33x real-time (5-10 min for 30-min audio). GPU accelerates to 1-2 min. Dependencies: `pyannote-audio` (~200MB) + `torchaudio` (~50-80MB). PyTorch already installed via Whisper.
+
+**Face-speaker mapping:**
+- Phase 1: Temporal correlation heuristic — match face tracks to speaker segments by maximum temporal overlap. 70-85% accuracy for 2-speaker videos. Zero additional dependencies. ~100 lines of Python.
+- Phase 2: TalkNet-ASD (Active Speaker Detection) — jointly analyzes lip movement + audio to detect who is speaking. 92.3% accuracy. Requires `torchvision` + model weights (~50MB). Needs GPU (2-5 FPS on CPU vs 15-25 FPS on GPU).
+
+**Video compositing (Remotion approach):**
+
+Dynamic crop via CSS `transform: scale() translate()` on `<Video>` element inside `overflow: hidden` container. This is a GPU-composited browser operation — essentially free performance-wise. No FFmpeg re-encoding needed for the crop itself.
+
+**New Remotion compositions:**
+
+| Composition | Purpose | Phase |
+|---|---|---|
+| `CaptionedVideo` (existing) | Caption overlay on native video | Current |
+| `ShortsVideo` (new) | Static/keyframe crop + captions at 9:16 | Feature 4 |
+| `AutoEditVideo` (new) | Face-tracking crop + cuts + captions | Feature 3 full |
+
+All compositions share the `<Captions>` component and `useCaptions` hook.
+
+**Crop data format (keyframes):**
+```typescript
+type FaceKeyframe = {
+  time: number;       // seconds
+  x: number;          // center of face, 0.0-1.0 normalized
+  y: number;          // center of face, 0.0-1.0 normalized
+  width: number;      // bounding box width, 0.0-1.0
+  height: number;     // bounding box height, 0.0-1.0
+  speakerId?: string;
+};
+
+type CropTrack = {
+  keyframes: FaceKeyframe[];
+  interpolation: "linear" | "ease" | "smooth";
+  zoom: number;       // base zoom multiplier
+  safeMargin: number; // margin around face (0.1 = 10%)
+};
+```
+
+Remotion `interpolate()` between keyframes for smooth pan/zoom. Use `spring()` only for hard cuts between speakers.
+
+### Backend Design
+
+**New job types:** `FACE_DETECT`, `SPEAKER_DIARIZE` added to `JobTypeEnum`. Results stored in `Job.output_data` (JSON) — no new table needed for face/diarization data.
+
+**ML service separation:**
+- Phase 1: Keep in Dramatiq workers (same image). MediaPipe + pyannote add only ~280MB to image.
+- Phase 2: Separate `ml-worker` Docker container on dedicated Dramatiq queues (`ml_head_tracking`, `ml_diarization`). Same codebase, different image, different resource limits.
+
+**Remotion service changes:** `POST /api/render` needs a `compositionId` request parameter to select which composition to render. Props extend with `crop`, `outputWidth`, `outputHeight`.
+
+### Processing Time (30-min 1080p video)
+
+| Step | CPU | GPU |
+|---|---|---|
+| Audio extraction (FFmpeg) | 10-20 sec | 10-20 sec |
+| Face detection (MediaPipe, 3 FPS) | 1-2 min | 10-15 sec |
+| Speaker diarization (pyannote) | **15-30 min** | 1-2 min |
+| Face-speaker mapping | < 1 sec | < 1 sec |
+| Remotion render (crop + captions) | 10-30 min | 10-30 min |
+| **Total (parallelized)** | **35-80 min** | **16-40 min** |
+
+Face detection + diarization can run in parallel (different input: video frames vs audio track).
+
+### Memory Requirements
+
+| Config | Peak RAM |
+|---|---|
+| Whisper base + pyannote (parallel) | 8-12 GB |
+| Whisper medium + pyannote (parallel) | 12-16 GB |
+| Recommended ML worker limit | 16 GB, `--threads 1` |
+
+### Frontend Design
+
+- Head tracking preview: video player with face bounding box overlay (canvas)
+- Speaker timeline track in TimelinePanel (extends existing 4-track system)
+- Controls: zoom level slider, transition speed, speaker selection
+- Before/after comparison toggle
+- UX flow: upload podcast → trigger analysis (ProcessingStep) → review speaker assignments → adjust → export
+
+### Key Numbers
+
+| Metric | Value |
+|---|---|
+| Face detection accuracy | ~90% (MediaPipe on talking-head content) |
+| Diarization DER | ~10% (pyannote 3.1) |
+| Face-speaker mapping (Phase 1) | 70-85% accuracy |
+| Face-speaker mapping (Phase 2, TalkNet) | ~92% accuracy |
+| New dependencies | ~280MB (mediapipe + pyannote + torchaudio) |
+| GPU mandatory? | No for Phase 1; recommended for Phase 2 |
+
+### Risks
+
+- **Face-to-speaker mapping** is the hardest unsolved subproblem — 70-85% accuracy means 1 in 5 assignments may be wrong. Must let users manually correct.
+- **Diarization on CPU** is the bottleneck — 15-30 min for 30-min video. GPU reduces to 1-2 min.
+- **PyTorch version conflicts** between Whisper and pyannote — test `uv sync` before committing.
+- **Video quality loss** when cropping 16:9 to 9:16 — only ~31.6% of frame width is kept. Source must be at least 1080p.
+- **Model download on first run** — pyannote models (~100MB) require Hugging Face license acceptance. Handle in Dockerfile, not at runtime.
+
+### MVP vs Full
+
+- **MVP (12-15 days):** Face detection on sampled frames. User manually selects which face to follow. Static crop to selected face. No speaker switching, no diarization. Works for single-speaker content.
+- **Full (30-45 days):** Speaker diarization + face-speaker mapping. Dynamic crop following active speaker. Smooth spring() transitions on speaker changes. Split-screen for reactions. Multi-speaker support.
+
+---
+
+## Feature 4: 9:16 Shorts Conversion
+
+### Architecture
+
+**Pipeline:** Crop-then-caption, always. Single Remotion render pass using new `ShortsVideo` composition. The composition renders at target 9:16 dimensions, applies CSS crop transform to `<Video>`, and renders captions on top.
+
+**Caption positioning:** No new schema fields needed. Backend adjusts `font_size`, `padding_px`, `max_width_pct` in `styleConfig` for 9:16 aspect ratio. Remotion is a "dumb renderer" — intelligence about what looks good at 9:16 belongs in presets.
+
+**Crop specification:**
+```typescript
+type CropConfig = {
+  mode: "static" | "keyframe";
+  staticCrop?: { x: number; y: number; zoom: number };  // 0-1 normalized
+  keyframes?: Array<{ time: number; x: number; y: number; zoom: number }>;
+  interpolation?: "linear" | "ease" | "smooth";
+};
+```
+
+Static crop is a degenerate case of keyframe crop (single keyframe).
+
+### Backend Design
+
+**New job type:** `ASPECT_CONVERT` in `JobTypeEnum`. New function `crop_to_vertical()` in `media/service.py` using FFmpeg crop+scale filter.
+
+**New artifact type:** `VERTICAL_VIDEO` in `ArtifactTypeEnum`.
+
+**Pipeline:**
+1. Trim source video to clip time range (if from viral detection)
+2. Apply crop (static center crop or face-tracking crop from Feature 3)
+3. Upload to S3 at `{folder}/vertical/{filename}`
+4. Webhook + notification
+
+### Frontend Design
+
+- Crop preview: draggable 9:16 rectangle overlay on video player (CSS `object-fit: cover` + `object-position`)
+- Side-by-side preview toggle: original 16:9 vs cropped 9:16
+- Integration with Feature 2: "Convert to Short" button on each approved viral clip
+- Integration with Feature 3: auto-populate crop region from face detection data
+
+### Processing Time
+
+| Approach | Time (30-min video) |
+|---|---|
+| FFmpeg crop-only (no captions) | 12-36 min |
+| Remotion crop + captions (single pass) | 11-45 min |
+| FFmpeg with NVENC hardware encoding | 3-5 min |
+
+### MVP vs Full
+
+- **MVP (6-8 days):** Manual crop region selection with preview. User drags a 9:16 rectangle over video. New `ShortsVideo` Remotion composition renders crop + captions.
+- **Full (+3-4 days after Feature 3):** Auto-crop based on face detection data. One-click vertical conversion. Batch conversion of viral clips.
+
+---
+
+## Recommended Build Order
+
+```
+Week 1-2:    Feature 1 (Templates)        ████████
+Week 2-4:    Feature 2 (Viral Detection)  ████████████████
+Week 4-6:    Feature 4 MVP (9:16 crop)    ████████████████
+Week 6-14:   Feature 3 (Head Tracking)    ████████████████████████████████████████
+Week 14-15:  Feature 4 upgrade            ████████
+```
+
+**Rationale:**
+1. **Templates first** — ready to implement, zero risk, immediate user value
+2. **Viral detection second** — highest value/effort ratio ($0.005/video, 5-7 days MVP), validates that users want automated editing
+3. **9:16 MVP third** — builds the `ShortsVideo` composition that Feature 3 extends, useful standalone with manual crop
+4. **Head tracking last** — most complex, biggest investment, validates demand from Features 2+4 first
+5. **9:16 upgrade** — trivial once head tracking provides face position data
+
+---
+
+## Cost Analysis
+
+### Per-Video Processing Cost
+
+| Tier | Components | Compute | LLM API | Total | Wait Time |
+|---|---|---|---|---|---|
+| CPU-only | All on CPU | $0.05 | $0.06 | **$0.11** | 35-80 min |
+| GPU (T4) | ML on GPU, FFmpeg on CPU | $0.11 | $0.06 | **$0.17** | 16-40 min |
+| GPU + NVENC | Everything on GPU | $0.13 | $0.06 | **$0.19** | 10-15 min |
+
+### Monthly Infrastructure Cost (100 videos/month)
+
+| Scenario | Cost |
+|---|---|
+| CPU-only (existing infra) | ~$11 + server |
+| Modal serverless GPU | ~$21/month |
+| Spot GPU (g4dn.xlarge) | ~$115/month |
+| Standing GPU (g4dn.xlarge 24/7) | ~$380/month |
+
+**Recommendation:** Start CPU-only. Move to Modal serverless GPU when queue wait times exceed 15 minutes. At 500+ videos/day, evaluate spot instances.
+
+### Suggested SaaS Pricing Tiers
+
+| Tier | Price | Limits | Compute Cost | Margin |
+|---|---|---|---|---|
+| Free | $0 | 10-min videos, queue priority low | ~$0.04/video | Marketing |
+| Pro | $15-30/mo | 30-min videos, GPU ML | ~$0.17/video at 50 videos | 60-80% |
+| Business | $50-100/mo | 60-min videos, priority queue, NVENC | ~$0.38/video | 70-85% |
+
+---
+
+## Infrastructure Decisions
+
+### ML Service Separation
+
+**Phase 1:** Keep ML in existing Dramatiq workers. MediaPipe + pyannote add only ~280MB to image. PyTorch is already installed via Whisper.
+
+**Phase 2:** Separate `ml-worker` Docker container on dedicated queues. Same codebase, different image (`Dockerfile.ml`), different resource limits. Use Docker Compose profiles:
+
+```bash
+docker-compose up                    # Default: no ML worker
+docker-compose --profile ml up       # With ML worker
+```
+
+**Do NOT build a separate HTTP microservice.** Dramatiq already handles job queuing, retries, progress, and cancellation. Adding HTTP service discovery, API contracts, and health checks is overhead with zero benefit for async workloads.
+
+### Immediate Optimizations (Before New Features)
+
+| Action | Impact | Effort |
+|---|---|---|
+| Switch PyTorch to CPU-only index | -800MB image size | 1 hour |
+| Fix worker `REMOTION_SERVICE_URL` default | Bug fix | 5 min |
+| Add resource limits to docker-compose services | Prevent OOM cascades | 30 min |
+| Split Dramatiq into queue pools (lightweight vs ML vs compute) | Prevent worker starvation | 2-3 hours |
+
+---
+
+## Technology Stack Summary
+
+### New Dependencies
+
+| Package | Size | Purpose | Feature |
+|---|---|---|---|
+| `google-generativeai` or `openai` | ~10 MB | LLM API client | 2 |
+| `librosa` | ~20 MB | Audio energy analysis | 2 |
+| `mediapipe` | ~30 MB | Face detection | 3 |
+| `pyannote-audio` | ~200 MB | Speaker diarization | 3 |
+| `torchaudio` | ~50-80 MB | Audio processing for pyannote | 3 |
+| **Total new deps** | **~310-340 MB** | | |
+
+### New Backend Modules
+
+| Module | Purpose | Feature |
+|---|---|---|
+| `clips` | Clip CRUD, review workflow | 2 |
+
+### New Remotion Compositions
+
+| Composition | Purpose | Feature |
+|---|---|---|
+| `ShortsVideo` | Static/keyframe crop + captions at 9:16 | 4 |
+| `AutoEditVideo` | Face-tracking dynamic crop + captions | 3 |
+
+### New Job Types
+
+| Job Type | Purpose | Feature |
+|---|---|---|
+| `VIRAL_DETECT` | LLM analysis of transcription | 2 |
+| `ASPECT_CONVERT` | 9:16 crop + re-encode | 4 |
+| `FACE_DETECT` | Face bounding box detection | 3 |
+| `SPEAKER_DIARIZE` | Speaker diarization | 3 |
+
+---
+
+## Cross-Cutting Issues
+
+| Issue | Flagged By | Priority | Action |
+|---|---|---|---|
+| PyTorch installs CUDA libs on CPU-only infra (+800MB) | DevOps | High | Switch to CPU-only PyTorch index |
+| Worker `--processes 1 --threads 2` will OOM with ML jobs | Performance | High | Split into queue pools, `--threads 1` for ML |
+| `_get_job_status_sync()` leaks DB connections | Performance | High | Fix before adding more actors |
+| No temp file cleanup on OOM crash | Performance | Medium | Add periodic `/tmp` cleanup or cron |
+| `tasks/service.py` at 1,674 lines, will exceed 2K | Backend | Medium | Extract actor boilerplate into decorator/context manager |
+| Worker `REMOTION_SERVICE_URL` default wrong (`localhost:8001`) | DevOps | Medium | Fix to `http://remotion:3001` in docker-compose |
+| No resource limits on any Docker service | DevOps | Medium | Add memory/CPU limits to all services |
+| Whisper should move to ML service eventually | Backend | Low | Plan for Phase 2 when ML worker is split out |
+| `isCurrent` word identity check in Captions.tsx is fragile | Remotion | Low | Compare by index, not text + start time |
+
+---
+
+## Specialist Reports (Full Transcripts)
+
+Full specialist outputs are available in the session transcript. Key files each specialist examined:
+
+- **ML Engineer:** `cpv3/modules/transcription/service.py`, `cpv3/modules/tasks/service.py`, `pyproject.toml`
+- **Backend Architect:** `cpv3/modules/tasks/service.py`, `cpv3/modules/jobs/schemas.py`, `cpv3/modules/media/service.py`, `cpv3/modules/captions/service.py`, `docker-compose.yml`
+- **Remotion Engineer:** `remotion_service/src/components/Composition.tsx`, `Captions.tsx`, `Root.tsx`, `useCaptions.ts`, `useVideoMeta.ts`, all type definitions
+- **Frontend Architect:** `src/widgets/TimelinePanel/`, `src/features/project/FragmentsStep/`, `src/shared/context/WizardContext.tsx`, `src/shared/store/notifications/`
+- **DevOps Engineer:** `docker-compose.yml`, `Dockerfile`, `pyproject.toml`, `uv.lock`
+- **Performance Engineer:** `cpv3/modules/tasks/service.py`, `cpv3/modules/media/service.py`, `cpv3/modules/transcription/service.py`, `docker-compose.yml`