docs: add backend comprehensive refactor design spec

Covers decomposition of tasks/service.py, shared CRUD/auth helpers
in common/, permission fixes, error handling, naming, and layering
consistency across all backend modules.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Daniil
2026-03-15 23:07:22 +03:00
parent dc04efe0fb
commit 2944040ac7
@@ -0,0 +1,360 @@
# Backend Comprehensive Refactor — Design Spec
**Date:** 2026-03-15
**Scope:** `cofee_backend/` — all modules, common layer, infrastructure layer
**Approach:** Top-down (decompose tasks/ first, then extract shared helpers, then apply across modules)
## Problem Summary
The backend has grown feature-by-feature and accumulated structural debt:
1. **`tasks/service.py`** is 1673 lines mixing four concerns: broker infrastructure, actor lifecycle, artifact persistence, and task submission orchestration. Eight actors repeat an identical ~40-line lifecycle skeleton.
2. **CRUD modules** repeat the same patterns: repository update loops (`model_dump` + commit + refresh), router permission checks (fetch-or-404, owner-or-staff, raise 403), inline error strings.
3. **Auth/ownership** is inconsistent — some modules enforce ownership, others don't.
4. **Layering** is inconsistent — some routers bypass the service layer and call repositories directly.
5. **Error messages** are inline strings instead of `ERROR_` constants.
6. **Dead code**, naming issues, and minor bugs exist across several modules.
## Guiding Principles
- **DRY:** Extract repeated patterns into small free functions, not base classes
- **KISS:** Helpers are plain functions — no magic, no decorators, no hidden behavior
- **Developer-readable:** Every call site shows what it does; no hidden abstractions
- **Preserve structure:** Keep the existing module pattern; allow extra files only when a single `service.py` would exceed ~300 lines
- **Shared helpers in `common/`:** Application-level helpers (DB update, HTTP permission checks)
- **Infrastructure for infrastructure:** Only broker/transport concerns go in `infrastructure/`
## Phase 1: Decompose `tasks/service.py`
### 1.1 New file layout
```
cpv3/modules/tasks/
├── __init__.py (unchanged)
├── schemas.py (unchanged)
├── router.py (unchanged)
├── service.py TaskService class only: submit_* methods + record_webhook_event + cancel_job
├── actors.py NEW: all @dramatiq.actor functions, each using _run_actor()
├── artifacts.py NEW: _save_transcription_artifacts, _save_convert_artifacts, _save_captions_artifacts
├── webhooks.py NEW: webhook transport (_send_webhook_event, _build_webhook_url, event helpers)
└── constants.py NEW: ERROR_, MESSAGE_, STATUS_ constants, JobCancelledError, all in Russian
```
New infrastructure file:
```
cpv3/infrastructure/dramatiq.py
- RedisBroker setup and initialization
- _serialize_broker_reference() / _parse_broker_reference()
- _get_job_status_sync() (raw psycopg2 cancellation check)
- _raise_if_job_cancelled()
```
### 1.2 The `_run_actor()` pattern
One helper function replaces the repeated lifecycle skeleton across all 8 actors:
```python
def _run_actor(
job_id: str,
webhook_url: str,
work_fn: Callable[..., dict[str, Any]],
*,
starting_message: str = MESSAGE_STARTING,
**work_kwargs: Any,
) -> None:
"""Shared actor lifecycle: cancel-check → RUNNING → work → DONE/FAILED."""
job_uuid = uuid.UUID(job_id)
try:
_raise_if_job_cancelled(job_uuid)
except JobCancelledError:
logger.info("Actor cancelled before start: %s", job_uuid)
return
_send_webhook_event(webhook_url, TaskWebhookEvent(
status=JOB_STATUS_RUNNING,
current_message=starting_message,
started_at=_utc_now(),
))
try:
result = work_fn(job_uuid=job_uuid, webhook_url=webhook_url, **work_kwargs)
_send_webhook_event(webhook_url, TaskWebhookEvent(
status=JOB_STATUS_DONE,
output=result,
finished_at=_utc_now(),
))
except JobCancelledError:
logger.info("Actor cancelled mid-run: %s", job_uuid)
except Exception as exc:
logger.exception("Actor failed: %s", job_uuid)
_send_webhook_event(webhook_url, TaskWebhookEvent(
status=JOB_STATUS_FAILED,
error_message=str(exc),
finished_at=_utc_now(),
))
```
Each actor becomes ~5-15 lines defining only its unique work function and parameters.
### 1.3 Artifact persistence → owning modules
Move artifact-saving logic to the modules that own those domain concepts:
- `_save_transcription_artifacts()``transcription/service.py` as `TranscriptionService.save_artifacts()`
- `_save_convert_artifacts()``media/service.py` as `MediaService.save_artifacts()`
- `_save_captions_artifacts()``captions/service.py` as `CaptionService.save_artifacts()`
`TaskService.record_webhook_event()` dispatches to these module-owned methods.
### 1.4 What stays in `tasks/service.py`
Only the `TaskService` class (~300-400 lines):
- `__init__()` with session + job repo
- `submit_media_probe()`, `submit_silence_detect()`, `submit_silence_remove()`, `submit_convert()`, `submit_frame_extract()`, `submit_transcription_generate()`, `submit_captions_generate()`
- `record_webhook_event()` — orchestration only, delegates artifact saving
- `cancel_job()`
### 1.5 Constants language
All user-facing constants in `tasks/constants.py` are in Russian. The current mix of English and Russian (`tasks/service.py:93-103`) is unified to Russian.
## Phase 2: Shared Helpers in `common/`
### 2.1 `common/db.py` — Repository helpers
Three free functions:
```python
async def apply_partial_update(instance: object, data: BaseModel) -> None:
"""Apply non-None fields from a Pydantic schema to a SQLAlchemy model."""
async def commit_and_refresh(session: AsyncSession, instance: object) -> None:
"""Commit the session and refresh the instance."""
async def soft_delete(
session: AsyncSession,
instance: object,
*,
field: str = "is_active",
value: bool = False,
) -> None:
"""Mark an instance as soft-deleted and commit.
Supports both is_active=False and is_deleted=True patterns."""
```
Soft-delete documentation comment explains which modules use which pattern.
No unification of `is_active` vs `is_deleted` in this refactor — that requires a data migration.
### 2.2 `common/http.py` — Router helpers
Two free functions + two error constants:
```python
ERROR_NOT_FOUND = "Объект не найден"
ERROR_FORBIDDEN = "Доступ запрещён"
def ensure_found(instance: object | None, detail: str = ERROR_NOT_FOUND) -> None:
"""Raise 404 if instance is None."""
def ensure_owner_or_staff(
instance: object,
user: User,
*,
owner_field: str = "owner_id",
) -> None:
"""Raise 403 if user is not staff and doesn't own the resource."""
```
## Phase 3: Apply Shared Helpers Across All CRUD Modules
### 3.1 Repository updates (7 modules)
Replace `model_dump` loop + commit + refresh with `apply_partial_update()` + `commit_and_refresh()`:
- `files/repository.py`
- `media/repository.py` (×2: MediaFile + Artifact)
- `projects/repository.py`
- `jobs/repository.py` (×2: Job + JobEvent)
- `webhooks/repository.py`
- `transcription/repository.py`
- `captions/repository.py` (special case: `style_config` nested serialization stays explicit)
### 3.2 Router permission checks (7 modules)
Replace inline 404/403 checks with `ensure_found()` + `ensure_owner_or_staff()`:
- `users/router.py` (3 sites)
- `projects/router.py` (3 sites)
- `files/router.py` (3 sites + storage path check)
- `media/router.py` (6 sites: mediafiles + artifacts)
- `jobs/router.py` (6 sites: jobs + events)
- `webhooks/router.py` (3 sites)
- `captions/router.py` (2 sites)
### 3.3 Add missing permission checks
While applying helpers, add ownership checks where currently absent:
| Module | Current state | Fix |
|--------|--------------|-----|
| `transcription/router.py` | No ownership checks | Add `ensure_owner_or_staff()` via parent project |
| `media/router.py` artifacts | `_ = current_user` (ignores user) | Add ownership check via parent media file |
| `captions/router.py` get-by-id | Returns any preset by ID | Only return if system preset or owned by requester |
| `jobs/router.py` events | No user scoping on events | Add ownership check via parent job |
## Phase 4: Error Handling, Naming, Dead Code, Security
### 4.1 Error message constants
Extract inline strings into `ERROR_` constants at the top of each file:
- `infrastructure/auth.py``ERROR_TOKEN_EXPIRED`, `ERROR_INVALID_TOKEN`, `ERROR_USER_NOT_FOUND`
- `users/service.py``ERROR_EMAIL_ALREADY_EXISTS`, `ERROR_INVALID_CREDENTIALS`
- `users/router.py``ERROR_PASSWORD_MISMATCH`, `ERROR_INCORRECT_PASSWORD`
- `files/router.py``ERROR_FILE_TOO_LARGE`, `ERROR_EMPTY_FILE`
- `media/service.py``ERROR_PROBE_FAILED`, `ERROR_NO_VIDEO_STREAM`, `ERROR_FFMPEG_FAILED`
### 4.2 Naming fix
Rename `project_pct``progress_pct` in:
- `jobs/models.py`
- `jobs/schemas.py` (3 occurrences)
- Alembic migration for the DB column rename
- Check frontend for field dependency before applying
### 4.3 Dead code removal
| Dead code | Location |
|-----------|----------|
| Legacy service function exports | `users/service.py:60-105`, `projects/service.py:46-69` |
| `WordOptions.max_line_count` | `transcription/schemas.py:84-88` |
| Duplicate `import json` | `media/service.py:7` (top-level unused) |
| `filename_without_ext` variable | `media/service.py:396, 434` |
### 4.4 Notification completeness
Add missing entries to `notifications/service.py`:
- `JOB_TYPE_LABELS`: add `FRAME_EXTRACT` (Russian label)
- `STATUS_TITLES`: add `CANCELLED` (Russian label)
- All notification messages must be in Russian
### 4.5 Duplicate WebSocket auth
Extract `decode_user_id_from_token(token: str) -> uuid.UUID` in `infrastructure/auth.py`.
Both `get_current_user()` and `notifications/router.py` WebSocket handler reuse it.
WebSocket handler keeps its own session management.
### 4.6 Settings fix
Change `remotion_service_url` default from `http://localhost:8001` to `http://localhost:3001` in `infrastructure/settings.py`.
### 4.7 Path traversal fix
Add root boundary check in `files/router.py` local file serving:
```python
full_path = (settings.local_storage_dir / file_path).resolve()
root = Path(settings.local_storage_dir).resolve()
if not str(full_path).startswith(str(root)):
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail=ERROR_FORBIDDEN)
```
### 4.8 Mutable default fix
Change `media/schemas.py:138` from `streams: list[StreamSchema] = []` to `streams: list[StreamSchema] = Field(default_factory=list)`.
## Phase 5: Consistent Layering and Tests
### 5.1 Fix Router → Service → Repository layering
| Module | Current state | Fix |
|--------|--------------|-----|
| `media/router.py` | CRUD calls `MediaFileRepository` directly | Route through `MediaService` — add thin CRUD wrappers |
| `transcription/router.py` | All routes call `TranscriptionRepository` directly | Route through `TranscriptionService` — add thin CRUD wrappers |
| `notifications/router.py` | Uses `NotificationRepository` directly | Route through `NotificationService` — add list/get/mark_read |
Service methods are pass-through calls to the repository. No new business logic invented. Value is consistent architecture.
### 5.2 Test additions
| Test | Reason |
|------|--------|
| Unit test `_run_actor()` lifecycle (cancel/success/failure) | Core new abstraction |
| Unit test `ensure_found()` / `ensure_owner_or_staff()` | Shared helpers |
| Unit test `apply_partial_update()` / `commit_and_refresh()` | Shared helpers |
| Integration tests for new ownership checks (transcription, media artifacts, caption presets, job events) | New permission enforcement |
| Test path traversal prevention on local file route | Security fix |
No full test rewrites. No new integration test infrastructure.
### 5.3 Update `CLAUDE.md`
Reflect the new architecture:
- Module file rule updated: extra files allowed when `service.py` would exceed ~300 lines
- Document `common/db.py` and `common/http.py`
- Document `infrastructure/dramatiq.py`
- Document `ensure_found` / `ensure_owner_or_staff` as the standard router permission pattern
- Document `apply_partial_update` / `commit_and_refresh` as the standard repository update pattern
## Files Changed Summary
### New files
| File | Purpose |
|------|---------|
| `cpv3/common/db.py` | Repository helpers: apply_partial_update, commit_and_refresh, soft_delete |
| `cpv3/common/http.py` | Router helpers: ensure_found, ensure_owner_or_staff, ERROR_ constants |
| `cpv3/infrastructure/dramatiq.py` | RedisBroker setup, broker-id serialization, cancellation check |
| `cpv3/modules/tasks/actors.py` | All @dramatiq.actor functions + _run_actor() lifecycle helper |
| `cpv3/modules/tasks/artifacts.py` | Artifact persistence helpers (moved from tasks/service.py) |
| `cpv3/modules/tasks/webhooks.py` | Webhook transport helpers |
| `cpv3/modules/tasks/constants.py` | All ERROR_, MESSAGE_, STATUS_ constants (Russian) |
### Modified files (by phase)
**Phase 1 — tasks decomposition:**
- `cpv3/modules/tasks/service.py` — stripped to TaskService class only
- `cpv3/modules/transcription/service.py` — receives `save_artifacts()`
- `cpv3/modules/media/service.py` — receives `save_artifacts()`
- `cpv3/modules/captions/service.py` — receives `save_artifacts()`
**Phase 3 — apply shared helpers:**
- All 7 module `repository.py` files — use `apply_partial_update()` + `commit_and_refresh()`
- All 7 module `router.py` files — use `ensure_found()` + `ensure_owner_or_staff()`
**Phase 4 — error handling, cleanup:**
- `cpv3/infrastructure/auth.py` — error constants + shared token decoder
- `cpv3/infrastructure/settings.py` — fix remotion port default
- `cpv3/modules/users/service.py` — error constants + remove dead exports
- `cpv3/modules/users/router.py` — error constants
- `cpv3/modules/projects/service.py` — remove dead exports
- `cpv3/modules/files/router.py` — path traversal fix + error constants
- `cpv3/modules/media/service.py` — error constants + dead code removal
- `cpv3/modules/media/schemas.py` — mutable default fix
- `cpv3/modules/transcription/schemas.py` — remove dead `max_line_count`
- `cpv3/modules/notifications/service.py` — complete label maps (Russian)
- `cpv3/modules/notifications/router.py` — use shared token decoder
- `cpv3/modules/jobs/models.py` — rename `project_pct``progress_pct`
- `cpv3/modules/jobs/schemas.py` — rename `project_pct``progress_pct`
**Phase 5 — layering + tests:**
- `cpv3/modules/media/router.py` — route through service
- `cpv3/modules/transcription/router.py` — route through service
- `cpv3/modules/notifications/router.py` — route through service
- `cofee_backend/CLAUDE.md` — updated architecture docs
- New test files for helpers and permission checks
### Alembic migration
One migration for `project_pct``progress_pct` column rename in jobs table.
## Risks and Mitigations
| Risk | Mitigation |
|------|-----------|
| Actor refactor breaks background processing | Test `_run_actor()` lifecycle thoroughly before deploying. Keep old actors available on a branch for rollback. |
| `progress_pct` rename breaks frontend | Check frontend API types for `project_pct` usage before applying migration. Coordinate with frontend update. |
| Missing auth checks break existing workflows | New permission checks only restrict access that was unintentionally open. Test with real user flows. |
| Shared helpers become a dumping ground | Keep `common/db.py` and `common/http.py` minimal. New helpers need clear justification. |