# Video Features Roadmap — Technical Consultation v1

**Date:** 2026-03-22
**Specialists consulted:** ML/AI Engineer, Backend Architect, Remotion Engineer, Frontend Architect, DevOps Engineer, Performance Engineer

---

## Feature Overview

| # | Feature | Complexity | MVP | Full | Additional Infra |
|---|---------|-----------|-----|------|-----------------|
| 1 | Advanced Remotion Templates | Easy-Medium | 3-4 days | 3-4 days | None — ready to implement |
| 2 | Viral Moments Detection | Medium | 5-7 days | 8-12 days | LLM API key only |
| 3 | Auto-Cut & Head Tracking | Very Hard | 12-15 days | 30-45 days | Phase 1: nothing; Phase 2: GPU worker |
| 4 | 9:16 Shorts Conversion | Medium | 6-8 days | +3-4 days after #3 | None |
| **Total** | | | **26-34 days** | **44-65 days** | |

Realistic for one dev: **6-8 weeks** (all MVPs) or **3-4 months** (full versions).

---

## Feature 1: Advanced Remotion Templates

**Status:** Spec + implementation plan already written.

- Spec: `docs/superpowers/specs/2026-03-21-advanced-remotion-templates-design.md`
- Plan: `docs/superpowers/plans/2026-03-21-advanced-remotion-templates.md`

**Scope:** Extend `CaptionStyleSchema` with 4 new highlight styles (pop_in, karaoke, bounce, glow_pulse), 2 transitions (zoom_in, drop_in), 3 fields (word_entrance, highlight_rotation_deg, text_transform). Seed 2 system presets: "Shorts" and "Podcast".

**Changes:** Schema extensions in Remotion + backend, rendering logic in `Captions.tsx`, Alembic migration for presets, frontend StyleEditor form controls.

**No specialist input needed** — fully designed, no new infrastructure.

---

## Feature 2: Viral Moments Detection

### Architecture

**LLM API:** Gemini 2.5 Flash (best Russian language support, $0.15/$0.60 per 1M tokens) or GPT-4o-mini (same pricing, slightly weaker Russian). Cost per 30-min video analysis: ~$0.005.

**Audio augmentation:** `librosa` for RMS energy curves — refines clip boundaries to natural pauses, boosts scoring for high-energy segments. Adds ~20MB dependency, processes 30-min audio in <10 seconds.

**Pipeline:**
1. Fetch transcription Document from DB
2. librosa computes energy envelope over full audio (100ms resolution)
3. LLM analyzes transcription text with structured JSON output prompt
4. Post-process: snap clip boundaries to low-energy points, compute energy scores
5. Save clips to new `clips` table

### Backend Design

**New module:** `clips` (models, schemas, repository, service, router) — stores detected clips with project/file/job relationships.

**Clip model:**
```
Clip {
  project_id: UUID (FK projects)
  source_file_id: UUID (FK files)
  job_id: UUID? (FK jobs)
  title: str
  start_ms: int
  end_ms: int
  score: float
  source_type: "viral_detected" | "user_created" | "auto_generated"
  status: "pending" | "approved" | "rejected" | "exported"
  meta: JSON? (LLM reasoning, tags, hashtags)
}
```

**New job type:** `VIRAL_DETECT` added to `JobTypeEnum`. Actor calls LLM API directly via `httpx` from Dramatiq worker (no separate service needed).

**LLM integration:**
- Direct HTTP call from actor with retry + exponential backoff on 429
- Prompts stored in `cpv3/infrastructure/prompts/viral_detection_v1.txt`
- Active version controlled by `LLM_VIRAL_PROMPT_VERSION` env var
- New settings: `LLM_API_URL`, `LLM_API_KEY`, `LLM_MODEL_NAME`

### Frontend Design

- New `ViralClipsStep` in project wizard (features/project/)
- Clip list with thumbnails, scores, titles, approve/reject buttons
- Clip edit modal with video preview (scoped playback for start/end range)
- New job type `VIRAL_DETECT` in notification handling (existing WebSocket infrastructure)

### Key Numbers

| Metric | Value |
|---|---|
| Accuracy (precision) | 50-70% |
| Accuracy (recall) | 60-80% |
| Processing time | 10-20 seconds |
| Cost per video | ~$0.005 |
| Cost at 1,000 videos/month | ~$5 |
| New dependencies | `google-generativeai` or `openai` (~10MB) + `librosa` (~20MB) |

### Risks

- **Prompt engineering quality** determines feature value — iterate based on user feedback
- **Visual-only moments** (facial expressions, physical comedy) cannot be detected from text — ~20-30% of viral moments are missed
- **Transcription quality matters** — Whisper `tiny` has ~25% WER on Russian; use at least `small` for viral detection input
- **LLM hallucinated timestamps** — validate returned timestamps against actual segment boundaries

### MVP vs Full

- **MVP:** Text-only LLM analysis, no audio energy. Returns clips with scores. User reviews and accepts/rejects.
- **Full:** Add librosa energy analysis, few-shot prompt examples from user-accepted clips, batch processing, direct clip export to 9:16.

---

## Feature 3: Auto-Cut & Head Tracking

### Architecture

**Face detection:** MediaPipe BlazeFace (Apache 2.0, ~2MB model, 30-60 FPS on CPU). Sample at 3 FPS — face positions don't change significantly within 330ms. Dependency: `mediapipe` (~30MB).

**Speaker diarization:** pyannote.audio 3.1 (MIT, ~10% DER, self-hosted). Runs on CPU at 0.17-0.33x real-time (5-10 min for 30-min audio). GPU accelerates to 1-2 min. Dependencies: `pyannote-audio` (~200MB) + `torchaudio` (~50-80MB). PyTorch already installed via Whisper.

**Face-speaker mapping:**
- Phase 1: Temporal correlation heuristic — match face tracks to speaker segments by maximum temporal overlap. 70-85% accuracy for 2-speaker videos. Zero additional dependencies. ~100 lines of Python.
- Phase 2: TalkNet-ASD (Active Speaker Detection) — jointly analyzes lip movement + audio to detect who is speaking. 92.3% accuracy. Requires `torchvision` + model weights (~50MB). Needs GPU (2-5 FPS on CPU vs 15-25 FPS on GPU).

**Video compositing (Remotion approach):**

Dynamic crop via CSS `transform: scale() translate()` on `<Video>` element inside `overflow: hidden` container. This is a GPU-composited browser operation — essentially free performance-wise. No FFmpeg re-encoding needed for the crop itself.

**New Remotion compositions:**

| Composition | Purpose | Phase |
|---|---|---|
| `CaptionedVideo` (existing) | Caption overlay on native video | Current |
| `ShortsVideo` (new) | Static/keyframe crop + captions at 9:16 | Feature 4 |
| `AutoEditVideo` (new) | Face-tracking crop + cuts + captions | Feature 3 full |

All compositions share the `<Captions>` component and `useCaptions` hook.

**Crop data format (keyframes):**
```typescript
type FaceKeyframe = {
  time: number;       // seconds
  x: number;          // center of face, 0.0-1.0 normalized
  y: number;          // center of face, 0.0-1.0 normalized
  width: number;      // bounding box width, 0.0-1.0
  height: number;     // bounding box height, 0.0-1.0
  speakerId?: string;
};

type CropTrack = {
  keyframes: FaceKeyframe[];
  interpolation: "linear" | "ease" | "smooth";
  zoom: number;       // base zoom multiplier
  safeMargin: number; // margin around face (0.1 = 10%)
};
```

Remotion `interpolate()` between keyframes for smooth pan/zoom. Use `spring()` only for hard cuts between speakers.

### Backend Design

**New job types:** `FACE_DETECT`, `SPEAKER_DIARIZE` added to `JobTypeEnum`. Results stored in `Job.output_data` (JSON) — no new table needed for face/diarization data.

**ML service separation:**
- Phase 1: Keep in Dramatiq workers (same image). MediaPipe + pyannote add only ~280MB to image.
- Phase 2: Separate `ml-worker` Docker container on dedicated Dramatiq queues (`ml_head_tracking`, `ml_diarization`). Same codebase, different image, different resource limits.

**Remotion service changes:** `POST /api/render` needs a `compositionId` request parameter to select which composition to render. Props extend with `crop`, `outputWidth`, `outputHeight`.

### Processing Time (30-min 1080p video)

| Step | CPU | GPU |
|---|---|---|
| Audio extraction (FFmpeg) | 10-20 sec | 10-20 sec |
| Face detection (MediaPipe, 3 FPS) | 1-2 min | 10-15 sec |
| Speaker diarization (pyannote) | **15-30 min** | 1-2 min |
| Face-speaker mapping | < 1 sec | < 1 sec |
| Remotion render (crop + captions) | 10-30 min | 10-30 min |
| **Total (parallelized)** | **35-80 min** | **16-40 min** |

Face detection + diarization can run in parallel (different input: video frames vs audio track).

### Memory Requirements

| Config | Peak RAM |
|---|---|
| Whisper base + pyannote (parallel) | 8-12 GB |
| Whisper medium + pyannote (parallel) | 12-16 GB |
| Recommended ML worker limit | 16 GB, `--threads 1` |

### Frontend Design

- Head tracking preview: video player with face bounding box overlay (canvas)
- Speaker timeline track in TimelinePanel (extends existing 4-track system)
- Controls: zoom level slider, transition speed, speaker selection
- Before/after comparison toggle
- UX flow: upload podcast → trigger analysis (ProcessingStep) → review speaker assignments → adjust → export

### Key Numbers

| Metric | Value |
|---|---|
| Face detection accuracy | ~90% (MediaPipe on talking-head content) |
| Diarization DER | ~10% (pyannote 3.1) |
| Face-speaker mapping (Phase 1) | 70-85% accuracy |
| Face-speaker mapping (Phase 2, TalkNet) | ~92% accuracy |
| New dependencies | ~280MB (mediapipe + pyannote + torchaudio) |
| GPU mandatory? | No for Phase 1; recommended for Phase 2 |

### Risks

- **Face-to-speaker mapping** is the hardest unsolved subproblem — 70-85% accuracy means 1 in 5 assignments may be wrong. Must let users manually correct.
- **Diarization on CPU** is the bottleneck — 15-30 min for 30-min video. GPU reduces to 1-2 min.
- **PyTorch version conflicts** between Whisper and pyannote — test `uv sync` before committing.
- **Video quality loss** when cropping 16:9 to 9:16 — only ~31.6% of frame width is kept. Source must be at least 1080p.
- **Model download on first run** — pyannote models (~100MB) require Hugging Face license acceptance. Handle in Dockerfile, not at runtime.

### MVP vs Full

- **MVP (12-15 days):** Face detection on sampled frames. User manually selects which face to follow. Static crop to selected face. No speaker switching, no diarization. Works for single-speaker content.
- **Full (30-45 days):** Speaker diarization + face-speaker mapping. Dynamic crop following active speaker. Smooth spring() transitions on speaker changes. Split-screen for reactions. Multi-speaker support.

---

## Feature 4: 9:16 Shorts Conversion

### Architecture

**Pipeline:** Crop-then-caption, always. Single Remotion render pass using new `ShortsVideo` composition. The composition renders at target 9:16 dimensions, applies CSS crop transform to `<Video>`, and renders captions on top.

**Caption positioning:** No new schema fields needed. Backend adjusts `font_size`, `padding_px`, `max_width_pct` in `styleConfig` for 9:16 aspect ratio. Remotion is a "dumb renderer" — intelligence about what looks good at 9:16 belongs in presets.

**Crop specification:**
```typescript
type CropConfig = {
  mode: "static" | "keyframe";
  staticCrop?: { x: number; y: number; zoom: number };  // 0-1 normalized
  keyframes?: Array<{ time: number; x: number; y: number; zoom: number }>;
  interpolation?: "linear" | "ease" | "smooth";
};
```

Static crop is a degenerate case of keyframe crop (single keyframe).

### Backend Design

**New job type:** `ASPECT_CONVERT` in `JobTypeEnum`. New function `crop_to_vertical()` in `media/service.py` using FFmpeg crop+scale filter.

**New artifact type:** `VERTICAL_VIDEO` in `ArtifactTypeEnum`.

**Pipeline:**
1. Trim source video to clip time range (if from viral detection)
2. Apply crop (static center crop or face-tracking crop from Feature 3)
3. Upload to S3 at `{folder}/vertical/{filename}`
4. Webhook + notification

### Frontend Design

- Crop preview: draggable 9:16 rectangle overlay on video player (CSS `object-fit: cover` + `object-position`)
- Side-by-side preview toggle: original 16:9 vs cropped 9:16
- Integration with Feature 2: "Convert to Short" button on each approved viral clip
- Integration with Feature 3: auto-populate crop region from face detection data

### Processing Time

| Approach | Time (30-min video) |
|---|---|
| FFmpeg crop-only (no captions) | 12-36 min |
| Remotion crop + captions (single pass) | 11-45 min |
| FFmpeg with NVENC hardware encoding | 3-5 min |

### MVP vs Full

- **MVP (6-8 days):** Manual crop region selection with preview. User drags a 9:16 rectangle over video. New `ShortsVideo` Remotion composition renders crop + captions.
- **Full (+3-4 days after Feature 3):** Auto-crop based on face detection data. One-click vertical conversion. Batch conversion of viral clips.

---

## Recommended Build Order

```
Week 1-2:    Feature 1 (Templates)        ████████
Week 2-4:    Feature 2 (Viral Detection)  ████████████████
Week 4-6:    Feature 4 MVP (9:16 crop)    ████████████████
Week 6-14:   Feature 3 (Head Tracking)    ████████████████████████████████████████
Week 14-15:  Feature 4 upgrade            ████████
```

**Rationale:**
1. **Templates first** — ready to implement, zero risk, immediate user value
2. **Viral detection second** — highest value/effort ratio ($0.005/video, 5-7 days MVP), validates that users want automated editing
3. **9:16 MVP third** — builds the `ShortsVideo` composition that Feature 3 extends, useful standalone with manual crop
4. **Head tracking last** — most complex, biggest investment, validates demand from Features 2+4 first
5. **9:16 upgrade** — trivial once head tracking provides face position data

---

## Cost Analysis

### Per-Video Processing Cost

| Tier | Components | Compute | LLM API | Total | Wait Time |
|---|---|---|---|---|---|
| CPU-only | All on CPU | $0.05 | $0.06 | **$0.11** | 35-80 min |
| GPU (T4) | ML on GPU, FFmpeg on CPU | $0.11 | $0.06 | **$0.17** | 16-40 min |
| GPU + NVENC | Everything on GPU | $0.13 | $0.06 | **$0.19** | 10-15 min |

### Monthly Infrastructure Cost (100 videos/month)

| Scenario | Cost |
|---|---|
| CPU-only (existing infra) | ~$11 + server |
| Modal serverless GPU | ~$21/month |
| Spot GPU (g4dn.xlarge) | ~$115/month |
| Standing GPU (g4dn.xlarge 24/7) | ~$380/month |

**Recommendation:** Start CPU-only. Move to Modal serverless GPU when queue wait times exceed 15 minutes. At 500+ videos/day, evaluate spot instances.

### Suggested SaaS Pricing Tiers

| Tier | Price | Limits | Compute Cost | Margin |
|---|---|---|---|---|
| Free | $0 | 10-min videos, queue priority low | ~$0.04/video | Marketing |
| Pro | $15-30/mo | 30-min videos, GPU ML | ~$0.17/video at 50 videos | 60-80% |
| Business | $50-100/mo | 60-min videos, priority queue, NVENC | ~$0.38/video | 70-85% |

---

## Infrastructure Decisions

### ML Service Separation

**Phase 1:** Keep ML in existing Dramatiq workers. MediaPipe + pyannote add only ~280MB to image. PyTorch is already installed via Whisper.

**Phase 2:** Separate `ml-worker` Docker container on dedicated queues. Same codebase, different image (`Dockerfile.ml`), different resource limits. Use Docker Compose profiles:

```bash
docker-compose up                    # Default: no ML worker
docker-compose --profile ml up       # With ML worker
```

**Do NOT build a separate HTTP microservice.** Dramatiq already handles job queuing, retries, progress, and cancellation. Adding HTTP service discovery, API contracts, and health checks is overhead with zero benefit for async workloads.

### Immediate Optimizations (Before New Features)

| Action | Impact | Effort |
|---|---|---|
| Switch PyTorch to CPU-only index | -800MB image size | 1 hour |
| Fix worker `REMOTION_SERVICE_URL` default | Bug fix | 5 min |
| Add resource limits to docker-compose services | Prevent OOM cascades | 30 min |
| Split Dramatiq into queue pools (lightweight vs ML vs compute) | Prevent worker starvation | 2-3 hours |

---

## Technology Stack Summary

### New Dependencies

| Package | Size | Purpose | Feature |
|---|---|---|---|
| `google-generativeai` or `openai` | ~10 MB | LLM API client | 2 |
| `librosa` | ~20 MB | Audio energy analysis | 2 |
| `mediapipe` | ~30 MB | Face detection | 3 |
| `pyannote-audio` | ~200 MB | Speaker diarization | 3 |
| `torchaudio` | ~50-80 MB | Audio processing for pyannote | 3 |
| **Total new deps** | **~310-340 MB** | | |

### New Backend Modules

| Module | Purpose | Feature |
|---|---|---|
| `clips` | Clip CRUD, review workflow | 2 |

### New Remotion Compositions

| Composition | Purpose | Feature |
|---|---|---|
| `ShortsVideo` | Static/keyframe crop + captions at 9:16 | 4 |
| `AutoEditVideo` | Face-tracking dynamic crop + captions | 3 |

### New Job Types

| Job Type | Purpose | Feature |
|---|---|---|
| `VIRAL_DETECT` | LLM analysis of transcription | 2 |
| `ASPECT_CONVERT` | 9:16 crop + re-encode | 4 |
| `FACE_DETECT` | Face bounding box detection | 3 |
| `SPEAKER_DIARIZE` | Speaker diarization | 3 |

---

## Cross-Cutting Issues

| Issue | Flagged By | Priority | Action |
|---|---|---|---|
| PyTorch installs CUDA libs on CPU-only infra (+800MB) | DevOps | High | Switch to CPU-only PyTorch index |
| Worker `--processes 1 --threads 2` will OOM with ML jobs | Performance | High | Split into queue pools, `--threads 1` for ML |
| `_get_job_status_sync()` leaks DB connections | Performance | High | Fix before adding more actors |
| No temp file cleanup on OOM crash | Performance | Medium | Add periodic `/tmp` cleanup or cron |
| `tasks/service.py` at 1,674 lines, will exceed 2K | Backend | Medium | Extract actor boilerplate into decorator/context manager |
| Worker `REMOTION_SERVICE_URL` default wrong (`localhost:8001`) | DevOps | Medium | Fix to `http://remotion:3001` in docker-compose |
| No resource limits on any Docker service | DevOps | Medium | Add memory/CPU limits to all services |
| Whisper should move to ML service eventually | Backend | Low | Plan for Phase 2 when ML worker is split out |
| `isCurrent` word identity check in Captions.tsx is fragile | Remotion | Low | Compare by index, not text + start time |

---

## Specialist Reports (Full Transcripts)

Full specialist outputs are available in the session transcript. Key files each specialist examined:

- **ML Engineer:** `cpv3/modules/transcription/service.py`, `cpv3/modules/tasks/service.py`, `pyproject.toml`
- **Backend Architect:** `cpv3/modules/tasks/service.py`, `cpv3/modules/jobs/schemas.py`, `cpv3/modules/media/service.py`, `cpv3/modules/captions/service.py`, `docker-compose.yml`
- **Remotion Engineer:** `remotion_service/src/components/Composition.tsx`, `Captions.tsx`, `Root.tsx`, `useCaptions.ts`, `useVideoMeta.ts`, all type definitions
- **Frontend Architect:** `src/widgets/TimelinePanel/`, `src/features/project/FragmentsStep/`, `src/shared/context/WizardContext.tsx`, `src/shared/store/notifications/`
- **DevOps Engineer:** `docker-compose.yml`, `Dockerfile`, `pyproject.toml`, `uv.lock`
- **Performance Engineer:** `cpv3/modules/tasks/service.py`, `cpv3/modules/media/service.py`, `cpv3/modules/transcription/service.py`, `docker-compose.yml`