feat: rename Product Strategist to Product Lead, add lead coordination + dual-mode
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,515 @@
|
||||
# Video Features Roadmap — Technical Consultation v2 (API-First)
|
||||
|
||||
**Date:** 2026-03-22
|
||||
**Specialists consulted:** ML/AI Engineer, Backend Architect, Remotion Engineer, Frontend Architect, DevOps Engineer, Performance Engineer
|
||||
**Revision:** v2 — switched to API-first architecture using Deepgram, GigaChat, and DeepInfra
|
||||
|
||||
---
|
||||
|
||||
## What Changed from v1
|
||||
|
||||
v2 replaces local ML models with managed API services. This is the single biggest architectural change — it eliminates PyTorch, GPU infrastructure, ML worker separation, and most memory/processing bottlenecks.
|
||||
|
||||
### API Substitutions
|
||||
|
||||
| v1 (Local ML) | v2 (API-First) | Impact |
|
||||
|---|---|---|
|
||||
| Local Whisper (PyTorch, 20-60 min CPU) | **Deepgram Nova-3** API (~30 sec) | Eliminates PyTorch dependency entirely |
|
||||
| Local pyannote.audio (PyTorch, 15-30 min CPU) | **Deepgram** `diarize=true` (included in transcription call) | Eliminates pyannote + torchaudio deps |
|
||||
| Gemini 2.5 Flash / GPT-4o-mini for viral detection | **GigaChat Pro** (native Russian LLM by Sber) | Better Russian cultural context, humor, slang |
|
||||
| librosa for audio energy analysis | **Deepgram** `sentiment=true` per utterance | Sentiment replaces energy analysis for most cases |
|
||||
| N/A | **DeepInfra** (Llama, Mistral, Qwen via API) | Fallback/A/B testing for LLM analysis |
|
||||
|
||||
### Key Metrics Changed
|
||||
|
||||
| Metric | v1 | v2 | Change |
|
||||
|---|---|---|---|
|
||||
| Docker image size | 1.72 GB | **~400-500 MB** | -75% (no PyTorch) |
|
||||
| Peak worker RAM | 8-16 GB | **~400 MB** (MediaPipe only) | -95% |
|
||||
| Processing time (30-min video, full pipeline) | 35-80 min (CPU) | **5-10 min** | -85% |
|
||||
| Per-video cost | $0.11 | **$0.20** | +80% (API costs) |
|
||||
| Monthly cost (100 videos) | $11 compute + server + $0-380 GPU | **$20 APIs + server** | Simpler, cheaper at low volume |
|
||||
| GPU needed? | Phase 2 for diarization | **Never** | Eliminated |
|
||||
| New Python dependencies | ~310-340 MB | **~40 MB** (mediapipe + HTTP clients) | -88% |
|
||||
| MVP total timeline | 26-34 dev-days | **20-27 dev-days** | -20-25% |
|
||||
|
||||
### Issues Eliminated
|
||||
|
||||
These v1 cross-cutting issues no longer apply:
|
||||
|
||||
| v1 Issue | Why It's Gone |
|
||||
|---|---|
|
||||
| ~~Switch PyTorch to CPU-only index~~ | PyTorch removed entirely (Whisper replaced by Deepgram) |
|
||||
| ~~Worker OOM with concurrent ML jobs~~ | No heavy ML — standard 4 GB worker |
|
||||
| ~~Separate ML worker Docker image~~ | Single lightweight image |
|
||||
| ~~GPU infrastructure planning~~ | All ML is API-based |
|
||||
| ~~PyTorch version conflicts~~ | No PyTorch |
|
||||
| ~~Model download on first run~~ | No local models (except MediaPipe, ~2 MB) |
|
||||
| ~~ML worker separation via Docker Compose profiles~~ | Not needed |
|
||||
|
||||
### New Issues Introduced
|
||||
|
||||
| Issue | Priority | Mitigation |
|
||||
|---|---|---|
|
||||
| API key management (Deepgram, GigaChat, DeepInfra) | High | Store in settings via env vars, never in code |
|
||||
| API rate limits | High | Retry with exponential backoff in actors |
|
||||
| API vendor lock-in | Medium | Abstract behind engine interfaces (like current `engine: "whisper" \| "google"`) |
|
||||
| Network dependency (API downtime = no processing) | Medium | Keep Whisper as optional fallback engine |
|
||||
| Higher per-video cost ($0.20 vs $0.11) | Low | Offset by zero infrastructure cost; profitable at any SaaS tier |
|
||||
|
||||
---
|
||||
|
||||
## Feature Overview
|
||||
|
||||
| # | Feature | Complexity | MVP | Full | Additional Infra |
|
||||
|---|---------|-----------|-----|------|-----------------|
|
||||
| 1 | Advanced Remotion Templates | Easy-Medium | 3-4 days | 3-4 days | None — ready to implement |
|
||||
| 2 | Viral Moments Detection | Medium | **3-5 days** | 6-10 days | API keys (GigaChat, Deepgram) |
|
||||
| 3 | Auto-Cut & Head Tracking | Hard | **8-10 days** | 20-30 days | MediaPipe only (CPU, ~30 MB) |
|
||||
| 4 | 9:16 Shorts Conversion | Medium | 6-8 days | +3-4 days after #3 | None |
|
||||
| **Total** | | | **20-27 days** | **35-47 days** | |
|
||||
|
||||
Realistic for one dev: **5-7 weeks** (all MVPs) or **2-3 months** (full versions).
|
||||
|
||||
---
|
||||
|
||||
## Feature 1: Advanced Remotion Templates
|
||||
|
||||
**No changes from v1.** This feature has no ML dependencies.
|
||||
|
||||
**Status:** Spec + implementation plan already written.
|
||||
|
||||
- Spec: `docs/superpowers/specs/2026-03-21-advanced-remotion-templates-design.md`
|
||||
- Plan: `docs/superpowers/plans/2026-03-21-advanced-remotion-templates.md`
|
||||
|
||||
**Scope:** Extend `CaptionStyleSchema` with 4 new highlight styles (pop_in, karaoke, bounce, glow_pulse), 2 transitions (zoom_in, drop_in), 3 fields (word_entrance, highlight_rotation_deg, text_transform). Seed 2 system presets: "Shorts" and "Podcast".
|
||||
|
||||
**Changes:** Schema extensions in Remotion + backend, rendering logic in `Captions.tsx`, Alembic migration for presets, frontend StyleEditor form controls.
|
||||
|
||||
---
|
||||
|
||||
## Feature 2: Viral Moments Detection
|
||||
|
||||
### Architecture (v2 — API-First)
|
||||
|
||||
**Transcription:** Deepgram Nova-3 API with `diarize=true` + `sentiment=true`. Single API call returns word-level timestamps, speaker labels, and per-utterance sentiment scores. Cost: $0.0053/min ($0.16 for 30-min video). Processing: ~30 seconds.
|
||||
|
||||
**LLM analysis:** GigaChat Pro (by Sber) — native Russian LLM trained on Russian internet content. Better detection of Russian humor, cultural references, slang, and viral patterns than English-first models. Fallback: DeepInfra (Llama 3.1 70B or Qwen) for A/B testing.
|
||||
|
||||
**Audio augmentation:** Deepgram's per-utterance sentiment scores replace `librosa` energy analysis for most use cases. High-sentiment utterances correlate with viral moments. Optional: keep `librosa` for audio loudness analysis (laughter, raised voice) as an enhancement.
|
||||
|
||||
**Pipeline:**
|
||||
1. Deepgram transcription with `diarize=true` + `sentiment=true` → timestamps + speakers + sentiment
|
||||
2. Convert Deepgram response to existing `Document` schema (segments, lines, words)
|
||||
3. GigaChat analyzes transcription text + sentiment data → viral clip candidates
|
||||
4. Post-process: snap boundaries to segment edges, compute composite scores
|
||||
5. Save clips to `clips` table
|
||||
|
||||
### Backend Design
|
||||
|
||||
**New module:** `clips` (models, schemas, repository, service, router) — stores detected clips with project/file/job relationships.
|
||||
|
||||
**Clip model:**
|
||||
```
|
||||
Clip {
|
||||
project_id: UUID (FK projects)
|
||||
source_file_id: UUID (FK files)
|
||||
job_id: UUID? (FK jobs)
|
||||
title: str
|
||||
start_ms: int
|
||||
end_ms: int
|
||||
score: float
|
||||
source_type: "viral_detected" | "user_created" | "auto_generated"
|
||||
status: "pending" | "approved" | "rejected" | "exported"
|
||||
meta: JSON? (LLM reasoning, tags, hashtags, sentiment data)
|
||||
}
|
||||
```
|
||||
|
||||
**New job type:** `VIRAL_DETECT` added to `JobTypeEnum`. Actor calls GigaChat API via `httpx` from Dramatiq worker.
|
||||
|
||||
**Transcription engine extension:** Add `"deepgram"` to the existing engine selection (`engine: "whisper" | "google" | "deepgram"`). Deepgram becomes the default for new transcriptions. Whisper remains as a fallback.
|
||||
|
||||
**LLM integration:**
|
||||
- GigaChat API via `httpx` (OAuth2 token auth via Sber ID)
|
||||
- DeepInfra as fallback (OpenAI-compatible API)
|
||||
- Prompts stored in `cpv3/infrastructure/prompts/viral_detection_v1.txt`
|
||||
- Active version controlled by `LLM_VIRAL_PROMPT_VERSION` env var
|
||||
- New settings: `GIGACHAT_CLIENT_ID`, `GIGACHAT_CLIENT_SECRET`, `DEEPINFRA_API_KEY`, `DEEPGRAM_API_KEY`
|
||||
|
||||
### Frontend Design
|
||||
|
||||
- New `ViralClipsStep` in project wizard (features/project/)
|
||||
- Clip list with thumbnails, scores, titles, approve/reject buttons
|
||||
- Clip edit modal with video preview (scoped playback for start/end range)
|
||||
- New job type `VIRAL_DETECT` in notification handling (existing WebSocket infrastructure)
|
||||
|
||||
### Key Numbers
|
||||
|
||||
| Metric | v1 | v2 |
|
||||
|---|---|---|
|
||||
| Transcription time | Depends on Whisper (already done) | ~30 sec (Deepgram, if not already transcribed) |
|
||||
| LLM analysis time | 10-20 sec | 10-20 sec (same) |
|
||||
| Total processing | 10-20 sec (after transcription) | **40-50 sec** (including Deepgram transcription) |
|
||||
| Cost per video | ~$0.005 (LLM only) | **~$0.17** ($0.16 Deepgram + $0.01 GigaChat) |
|
||||
| Accuracy (precision) | 50-70% | **60-80%** (GigaChat better at Russian + sentiment data) |
|
||||
| New dependencies | `google-generativeai` + `librosa` (~30 MB) | **HTTP client only** (~0 MB new) |
|
||||
| MVP time | 5-7 days | **3-5 days** |
|
||||
|
||||
### Risks
|
||||
|
||||
- **GigaChat API availability** — Sber's API may have lower uptime than Google/OpenAI. Mitigation: DeepInfra fallback.
|
||||
- **GigaChat structured output** — verify JSON mode / function calling works reliably for clip extraction. Test early.
|
||||
- **Deepgram Russian WER** — ~10-12% WER on Russian (Nova-3). Comparable to Whisper `medium`. Sufficient for viral detection.
|
||||
- **Visual-only moments** still missed (~20-30%) — same limitation as v1.
|
||||
|
||||
### MVP vs Full
|
||||
|
||||
- **MVP (3-5 days):** Deepgram transcription + GigaChat analysis. Returns clips with scores. User reviews and accepts/rejects. No audio energy analysis.
|
||||
- **Full (6-10 days):** Add sentiment-weighted scoring, few-shot prompt tuning from user feedback, batch processing, direct clip export to 9:16, DeepInfra A/B testing.
|
||||
|
||||
---
|
||||
|
||||
## Feature 3: Auto-Cut & Head Tracking
|
||||
|
||||
### Architecture (v2 — API-First)
|
||||
|
||||
**Face detection:** MediaPipe BlazeFace (unchanged from v1). Apache 2.0, ~2MB model, 30-60 FPS on CPU. Sample at 3 FPS. **This is the only local ML component remaining.** Dependency: `mediapipe` (~30MB).
|
||||
|
||||
**Speaker diarization:** **Deepgram API** with `diarize=true` (~30 seconds for 30-min video). Replaces pyannote.audio entirely. Diarization is included in the transcription call — no additional API cost.
|
||||
|
||||
**Face-speaker mapping:**
|
||||
- Phase 1: Temporal correlation heuristic — match face tracks to Deepgram speaker segments by maximum temporal overlap. 70-85% accuracy for 2-speaker videos. Zero additional dependencies. ~100 lines of Python.
|
||||
- Phase 2: TalkNet-ASD — if needed for accuracy. This is the only scenario where GPU would be reconsidered, but can be deferred indefinitely if temporal correlation + user correction is sufficient.
|
||||
|
||||
**Video compositing:** Same as v1 — Remotion compositions with CSS transform crop. No changes.
|
||||
|
||||
**New Remotion compositions:** Same as v1.
|
||||
|
||||
| Composition | Purpose | Phase |
|
||||
|---|---|---|
|
||||
| `CaptionedVideo` (existing) | Caption overlay on native video | Current |
|
||||
| `ShortsVideo` (new) | Static/keyframe crop + captions at 9:16 | Feature 4 |
|
||||
| `AutoEditVideo` (new) | Face-tracking crop + cuts + captions | Feature 3 full |
|
||||
|
||||
**Crop data format:** Same as v1 (keyframes with normalized 0-1 coordinates).
|
||||
|
||||
### Backend Design
|
||||
|
||||
**New job types:** `FACE_DETECT` added to `JobTypeEnum`. `SPEAKER_DIARIZE` is **no longer needed as a separate job** — diarization comes from Deepgram as part of transcription.
|
||||
|
||||
**ML service separation:** **Not needed.** MediaPipe is lightweight (~30MB, ~400MB RAM). Runs in standard Dramatiq worker.
|
||||
|
||||
**Remotion service changes:** Same as v1 — `compositionId` parameter, `crop`/`outputWidth`/`outputHeight` props.
|
||||
|
||||
### Processing Time (30-min 1080p video)
|
||||
|
||||
| Step | v1 (CPU) | v2 (API-First) |
|
||||
|---|---|---|
|
||||
| Deepgram transcription + diarization | N/A | **~30 sec** |
|
||||
| Face detection (MediaPipe, 3 FPS) | 1-2 min | 1-2 min (unchanged) |
|
||||
| ~~Speaker diarization (pyannote)~~ | ~~15-30 min~~ | **Included in Deepgram** |
|
||||
| Face-speaker mapping | < 1 sec | < 1 sec |
|
||||
| Remotion render (crop + captions) | 10-30 min | 10-30 min (unchanged) |
|
||||
| **Total (parallelized)** | **35-80 min** | **12-33 min** |
|
||||
|
||||
**The 15-30 min diarization bottleneck is completely eliminated.**
|
||||
|
||||
### Memory Requirements
|
||||
|
||||
| Config | v1 | v2 |
|
||||
|---|---|---|
|
||||
| Peak RAM | 8-16 GB | **~400 MB** (MediaPipe only) |
|
||||
| Worker config needed | `--threads 1`, 16 GB limit | Standard worker, 4 GB limit |
|
||||
|
||||
### Frontend Design
|
||||
|
||||
Same as v1:
|
||||
- Head tracking preview: video player with face bounding box overlay (canvas)
|
||||
- Speaker timeline track in TimelinePanel
|
||||
- Controls: zoom level slider, transition speed, speaker selection
|
||||
- Before/after comparison toggle
|
||||
|
||||
### Key Numbers
|
||||
|
||||
| Metric | v1 | v2 |
|
||||
|---|---|---|
|
||||
| Diarization time | 15-30 min (CPU) / 1-2 min (GPU) | **~30 sec** (API) |
|
||||
| Face detection time | 1-2 min | 1-2 min (unchanged) |
|
||||
| Total analysis time | 17-33 min (CPU) | **~2 min** |
|
||||
| Full pipeline (with render) | 35-80 min (CPU) | **12-33 min** |
|
||||
| Peak RAM | 8-16 GB | **~400 MB** |
|
||||
| New dependencies | ~280 MB (mediapipe + pyannote + torchaudio) | **~30 MB** (mediapipe only) |
|
||||
| GPU needed? | Phase 2 recommended | **Never** |
|
||||
| MVP time | 12-15 days | **8-10 days** |
|
||||
|
||||
### Risks
|
||||
|
||||
- **Face-to-speaker mapping** accuracy unchanged (70-85% with heuristic) — still the hardest subproblem
|
||||
- **Deepgram diarization accuracy** — DER may be slightly worse than pyannote 3.1 (~12-15% vs ~10%). Acceptable for this use case.
|
||||
- **Video quality loss** when cropping — unchanged from v1
|
||||
- **TalkNet-ASD deferred** — if temporal correlation isn't accurate enough, TalkNet requires GPU. Cross that bridge if needed.
|
||||
|
||||
### MVP vs Full
|
||||
|
||||
- **MVP (8-10 days):** Face detection on sampled frames. Deepgram provides speaker labels. Temporal correlation maps faces to speakers. User can manually correct. Static crop to selected face.
|
||||
- **Full (20-30 days):** Dynamic crop following active speaker. Smooth transitions. Split-screen. Multi-speaker. Optional TalkNet-ASD for accuracy.
|
||||
|
||||
---
|
||||
|
||||
## Feature 4: 9:16 Shorts Conversion
|
||||
|
||||
**No changes from v1.** This feature has no ML dependencies.
|
||||
|
||||
### Architecture
|
||||
|
||||
**Pipeline:** Crop-then-caption, always. Single Remotion render pass using new `ShortsVideo` composition.
|
||||
|
||||
**Caption positioning:** No new schema fields needed. Backend adjusts `font_size`, `padding_px`, `max_width_pct` in `styleConfig` for 9:16.
|
||||
|
||||
**Crop specification:**
|
||||
```typescript
|
||||
type CropConfig = {
|
||||
mode: "static" | "keyframe";
|
||||
staticCrop?: { x: number; y: number; zoom: number };
|
||||
keyframes?: Array<{ time: number; x: number; y: number; zoom: number }>;
|
||||
interpolation?: "linear" | "ease" | "smooth";
|
||||
};
|
||||
```
|
||||
|
||||
### Backend Design
|
||||
|
||||
**New job type:** `ASPECT_CONVERT` in `JobTypeEnum`. New function `crop_to_vertical()` in `media/service.py`.
|
||||
|
||||
**New artifact type:** `VERTICAL_VIDEO` in `ArtifactTypeEnum`.
|
||||
|
||||
### Frontend Design
|
||||
|
||||
- Crop preview: draggable 9:16 rectangle overlay on video player
|
||||
- Side-by-side preview toggle
|
||||
- "Convert to Short" button on approved viral clips
|
||||
- Auto-populate crop from face detection data (when available)
|
||||
|
||||
### Processing Time
|
||||
|
||||
| Approach | Time (30-min video) |
|
||||
|---|---|
|
||||
| FFmpeg crop-only (no captions) | 12-36 min |
|
||||
| Remotion crop + captions (single pass) | 11-45 min |
|
||||
| FFmpeg with NVENC hardware encoding | 3-5 min |
|
||||
|
||||
### MVP vs Full
|
||||
|
||||
- **MVP (6-8 days):** Manual crop region selection with preview. `ShortsVideo` Remotion composition.
|
||||
- **Full (+3-4 days after Feature 3):** Auto-crop from face detection. One-click conversion. Batch export.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Build Order
|
||||
|
||||
```
|
||||
Week 1-2: Feature 1 (Templates) ████████
|
||||
Week 2-3: Feature 2 (Viral Detection) ██████████
|
||||
Week 3-5: Feature 4 MVP (9:16 crop) ████████████████
|
||||
Week 5-10: Feature 3 (Head Tracking) ██████████████████████████████
|
||||
Week 10-11: Feature 4 upgrade ████████
|
||||
```
|
||||
|
||||
**Rationale:**
|
||||
1. **Templates first** — ready to implement, zero risk, immediate user value
|
||||
2. **Viral detection second** — fastest ROI with API-first (3-5 days MVP), validates user demand
|
||||
3. **9:16 MVP third** — builds `ShortsVideo` composition, useful standalone
|
||||
4. **Head tracking last** — still the most complex, but now much simpler without pyannote/GPU
|
||||
5. **9:16 upgrade** — trivial once head tracking provides face data
|
||||
|
||||
---
|
||||
|
||||
## Cost Analysis
|
||||
|
||||
### Per-Video Processing Cost (30-min video, all features)
|
||||
|
||||
| Component | v1 (Local ML) | v2 (API-First) |
|
||||
|---|---|---|
|
||||
| Transcription + diarization | $0.07 compute | **$0.16** (Deepgram) |
|
||||
| LLM viral detection | $0.005 (Gemini) | **$0.01** (GigaChat) |
|
||||
| Face detection | $0.002 compute | $0.002 compute (unchanged) |
|
||||
| FFmpeg/Remotion render | $0.02 compute | $0.02 compute |
|
||||
| **Total per video** | **$0.11** | **$0.20** |
|
||||
|
||||
### Monthly Cost Comparison
|
||||
|
||||
| Scale | v1 (Local ML) | v2 (API-First) |
|
||||
|---|---|---|
|
||||
| 100 videos/month | $11 compute + server + $0-380 GPU | **$20 APIs + server** |
|
||||
| 500 videos/month | $55 + $200-380 GPU = $255-435 | **$100 APIs + server** |
|
||||
| 1,000 videos/month | $110 + $380 GPU = $490 | **$200 APIs + server** |
|
||||
| 5,000 videos/month | $550 + $380 GPU = $930 | **$1,000 APIs + server** |
|
||||
|
||||
**Breakeven:** ~2,000-3,000 videos/month. Below that, APIs are cheaper.
|
||||
|
||||
### Suggested SaaS Pricing Tiers
|
||||
|
||||
| Tier | Price | Limits | Cost/Video | Margin |
|
||||
|---|---|---|---|---|
|
||||
| Free | $0 | 10-min videos, 5/month | ~$0.07 | Marketing |
|
||||
| Pro | $15-30/mo | 30-min videos, 50/month | ~$0.20 | 50-70% |
|
||||
| Business | $50-100/mo | 60-min videos, 200/month | ~$0.35 | 65-80% |
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure (v2 — Simplified)
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
Frontend → Backend API → Dramatiq Worker (lightweight: MediaPipe only)
|
||||
↕ ↕ ↕
|
||||
PostgreSQL Deepgram API GigaChat API
|
||||
Redis (transcription (viral detection)
|
||||
S3/MinIO + diarization)
|
||||
Remotion DeepInfra
|
||||
(fallback LLM)
|
||||
```
|
||||
|
||||
### Docker Image
|
||||
|
||||
| | v1 | v2 |
|
||||
|---|---|---|
|
||||
| Base | python:3.11-slim + PyTorch + Whisper + CUDA libs | python:3.11-slim + mediapipe |
|
||||
| Size | 1.72 GB | **~400-500 MB** |
|
||||
| RAM | 16 GB recommended | **4 GB sufficient** |
|
||||
|
||||
**Can remove from `pyproject.toml`:** `openai-whisper` (and transitively PyTorch) — if Deepgram fully replaces Whisper. Keep Whisper as optional dependency (`uv sync --group whisper`) for fallback.
|
||||
|
||||
### No ML Service Separation Needed
|
||||
|
||||
With only MediaPipe (~30MB, ~400MB RAM) running locally, there is no need for:
|
||||
- Separate ML worker container
|
||||
- Docker Compose profiles for ML
|
||||
- GPU infrastructure
|
||||
- Dedicated Dramatiq queues for ML
|
||||
|
||||
Standard worker with `--processes 1 --threads 2` handles everything.
|
||||
|
||||
### New Settings
|
||||
|
||||
```python
|
||||
# Deepgram
|
||||
deepgram_api_key: str = Field(default="", alias="DEEPGRAM_API_KEY")
|
||||
|
||||
# GigaChat (Sber)
|
||||
gigachat_client_id: str = Field(default="", alias="GIGACHAT_CLIENT_ID")
|
||||
gigachat_client_secret: str = Field(default="", alias="GIGACHAT_CLIENT_SECRET")
|
||||
|
||||
# DeepInfra (fallback LLM)
|
||||
deepinfra_api_key: str = Field(default="", alias="DEEPINFRA_API_KEY")
|
||||
|
||||
# LLM config
|
||||
llm_provider: str = Field(default="gigachat", alias="LLM_PROVIDER") # gigachat | deepinfra
|
||||
llm_viral_prompt_version: str = Field(default="v1", alias="LLM_VIRAL_PROMPT_VERSION")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Technology Stack Summary
|
||||
|
||||
### New Dependencies (v2)
|
||||
|
||||
| Package | Size | Purpose | Feature |
|
||||
|---|---|---|---|
|
||||
| `mediapipe` | ~30 MB | Face detection (CPU) | 3 |
|
||||
| `httpx` | Already installed | API calls to Deepgram, GigaChat, DeepInfra | 2, 3 |
|
||||
| **Total new deps** | **~30 MB** | | |
|
||||
|
||||
### Removed Dependencies (vs v1)
|
||||
|
||||
| Package | Size Saved | Was For |
|
||||
|---|---|---|
|
||||
| ~~`openai-whisper`~~ | ~50 MB + PyTorch ~2 GB | Transcription (replaced by Deepgram) |
|
||||
| ~~`pyannote-audio`~~ | ~200 MB | Diarization (replaced by Deepgram) |
|
||||
| ~~`torchaudio`~~ | ~50-80 MB | pyannote dependency |
|
||||
| ~~`librosa`~~ | ~20 MB | Audio energy (replaced by Deepgram sentiment) |
|
||||
| **Total removed** | **~2.3 GB** | |
|
||||
|
||||
### New Backend Modules
|
||||
|
||||
| Module | Purpose | Feature |
|
||||
|---|---|---|
|
||||
| `clips` | Clip CRUD, review workflow | 2 |
|
||||
|
||||
### New Remotion Compositions
|
||||
|
||||
| Composition | Purpose | Feature |
|
||||
|---|---|---|
|
||||
| `ShortsVideo` | Static/keyframe crop + captions at 9:16 | 4 |
|
||||
| `AutoEditVideo` | Face-tracking dynamic crop + captions | 3 |
|
||||
|
||||
### New Job Types
|
||||
|
||||
| Job Type | Purpose | Feature |
|
||||
|---|---|---|
|
||||
| `VIRAL_DETECT` | GigaChat analysis of transcription | 2 |
|
||||
| `ASPECT_CONVERT` | 9:16 crop + re-encode | 4 |
|
||||
| `FACE_DETECT` | Face bounding box detection (MediaPipe) | 3 |
|
||||
|
||||
Note: `SPEAKER_DIARIZE` is **no longer a separate job type** — diarization is included in Deepgram transcription.
|
||||
|
||||
### Transcription Engine Extension
|
||||
|
||||
```python
|
||||
# Extend existing engine selection:
|
||||
engine: Literal["whisper", "google", "deepgram"] = "deepgram"
|
||||
```
|
||||
|
||||
Deepgram becomes the default. Whisper remains as optional fallback (requires `uv sync --group whisper`).
|
||||
|
||||
---
|
||||
|
||||
## Cross-Cutting Issues (v2)
|
||||
|
||||
### Remaining from v1
|
||||
|
||||
| Issue | Priority | Action |
|
||||
|---|---|---|
|
||||
| `_get_job_status_sync()` leaks DB connections | High | Fix before adding more actors |
|
||||
| `tasks/service.py` at 1,674 lines, will exceed 2K | Medium | Extract actor boilerplate |
|
||||
| Worker `REMOTION_SERVICE_URL` default wrong | Medium | Fix to `http://remotion:3001` |
|
||||
| No resource limits on Docker services | Medium | Add memory/CPU limits |
|
||||
| No temp file cleanup on OOM crash | Medium | Add periodic cleanup |
|
||||
| `isCurrent` word identity check in Captions.tsx fragile | Low | Compare by index |
|
||||
|
||||
### New in v2
|
||||
|
||||
| Issue | Priority | Action |
|
||||
|---|---|---|
|
||||
| API key management (3 services) | High | All via env vars in settings, never in code |
|
||||
| API rate limit handling | High | Retry with exponential backoff in all actors |
|
||||
| API vendor lock-in | Medium | Abstract behind engine interface (existing pattern) |
|
||||
| Network dependency (API downtime) | Medium | Keep Whisper as optional fallback engine |
|
||||
| Deepgram → Document schema conversion | Medium | Build converter to match existing `Document` structure |
|
||||
| GigaChat OAuth2 token refresh | Medium | Token caching with auto-refresh in `infrastructure/` |
|
||||
|
||||
### Eliminated from v1
|
||||
|
||||
| ~~Issue~~ | Why Gone |
|
||||
|---|---|
|
||||
| ~~PyTorch CPU-only index~~ | PyTorch removed entirely |
|
||||
| ~~Worker OOM with ML jobs~~ | No heavy ML locally |
|
||||
| ~~ML worker Docker image~~ | Single lightweight image |
|
||||
| ~~GPU infrastructure~~ | All ML is API-based |
|
||||
| ~~PyTorch version conflicts~~ | No PyTorch |
|
||||
| ~~Model downloads on first run~~ | No local models |
|
||||
|
||||
---
|
||||
|
||||
## Specialist Reports (Full Transcripts)
|
||||
|
||||
Full specialist outputs are available in the session transcript. Key files each specialist examined:
|
||||
|
||||
- **ML Engineer:** `cpv3/modules/transcription/service.py`, `cpv3/modules/tasks/service.py`, `pyproject.toml`
|
||||
- **Backend Architect:** `cpv3/modules/tasks/service.py`, `cpv3/modules/jobs/schemas.py`, `cpv3/modules/media/service.py`, `cpv3/modules/captions/service.py`, `docker-compose.yml`
|
||||
- **Remotion Engineer:** `remotion_service/src/components/Composition.tsx`, `Captions.tsx`, `Root.tsx`, `useCaptions.ts`, `useVideoMeta.ts`, all type definitions
|
||||
- **Frontend Architect:** `src/widgets/TimelinePanel/`, `src/features/project/FragmentsStep/`, `src/shared/context/WizardContext.tsx`, `src/shared/store/notifications/`
|
||||
- **DevOps Engineer:** `docker-compose.yml`, `Dockerfile`, `pyproject.toml`, `uv.lock`
|
||||
- **Performance Engineer:** `cpv3/modules/tasks/service.py`, `cpv3/modules/media/service.py`, `cpv3/modules/transcription/service.py`, `docker-compose.yml`
|
||||
|
||||
Note: Specialist reports were produced for v1 architecture (local ML). Their recommendations for Remotion compositions, backend module design, frontend components, and crop data formats remain valid in v2. The infrastructure and ML model recommendations are superseded by the API-first approach.
|
||||
Reference in New Issue
Block a user