Files
main_backend/docs/superpowers/specs/2026-03-15-backend-comprehensive-refactor-design.md
T
Daniil 2944040ac7 docs: add backend comprehensive refactor design spec
Covers decomposition of tasks/service.py, shared CRUD/auth helpers
in common/, permission fixes, error handling, naming, and layering
consistency across all backend modules.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 23:07:22 +03:00

15 KiB
Raw Blame History

Backend Comprehensive Refactor — Design Spec

Date: 2026-03-15 Scope: cofee_backend/ — all modules, common layer, infrastructure layer Approach: Top-down (decompose tasks/ first, then extract shared helpers, then apply across modules)

Problem Summary

The backend has grown feature-by-feature and accumulated structural debt:

  1. tasks/service.py is 1673 lines mixing four concerns: broker infrastructure, actor lifecycle, artifact persistence, and task submission orchestration. Eight actors repeat an identical ~40-line lifecycle skeleton.
  2. CRUD modules repeat the same patterns: repository update loops (model_dump + commit + refresh), router permission checks (fetch-or-404, owner-or-staff, raise 403), inline error strings.
  3. Auth/ownership is inconsistent — some modules enforce ownership, others don't.
  4. Layering is inconsistent — some routers bypass the service layer and call repositories directly.
  5. Error messages are inline strings instead of ERROR_ constants.
  6. Dead code, naming issues, and minor bugs exist across several modules.

Guiding Principles

  • DRY: Extract repeated patterns into small free functions, not base classes
  • KISS: Helpers are plain functions — no magic, no decorators, no hidden behavior
  • Developer-readable: Every call site shows what it does; no hidden abstractions
  • Preserve structure: Keep the existing module pattern; allow extra files only when a single service.py would exceed ~300 lines
  • Shared helpers in common/: Application-level helpers (DB update, HTTP permission checks)
  • Infrastructure for infrastructure: Only broker/transport concerns go in infrastructure/

Phase 1: Decompose tasks/service.py

1.1 New file layout

cpv3/modules/tasks/
├── __init__.py        (unchanged)
├── schemas.py         (unchanged)
├── router.py          (unchanged)
├── service.py         TaskService class only: submit_* methods + record_webhook_event + cancel_job
├── actors.py          NEW: all @dramatiq.actor functions, each using _run_actor()
├── artifacts.py       NEW: _save_transcription_artifacts, _save_convert_artifacts, _save_captions_artifacts
├── webhooks.py        NEW: webhook transport (_send_webhook_event, _build_webhook_url, event helpers)
└── constants.py       NEW: ERROR_, MESSAGE_, STATUS_ constants, JobCancelledError, all in Russian

New infrastructure file:

cpv3/infrastructure/dramatiq.py
    - RedisBroker setup and initialization
    - _serialize_broker_reference() / _parse_broker_reference()
    - _get_job_status_sync() (raw psycopg2 cancellation check)
    - _raise_if_job_cancelled()

1.2 The _run_actor() pattern

One helper function replaces the repeated lifecycle skeleton across all 8 actors:

def _run_actor(
    job_id: str,
    webhook_url: str,
    work_fn: Callable[..., dict[str, Any]],
    *,
    starting_message: str = MESSAGE_STARTING,
    **work_kwargs: Any,
) -> None:
    """Shared actor lifecycle: cancel-check → RUNNING → work → DONE/FAILED."""
    job_uuid = uuid.UUID(job_id)
    try:
        _raise_if_job_cancelled(job_uuid)
    except JobCancelledError:
        logger.info("Actor cancelled before start: %s", job_uuid)
        return

    _send_webhook_event(webhook_url, TaskWebhookEvent(
        status=JOB_STATUS_RUNNING,
        current_message=starting_message,
        started_at=_utc_now(),
    ))

    try:
        result = work_fn(job_uuid=job_uuid, webhook_url=webhook_url, **work_kwargs)
        _send_webhook_event(webhook_url, TaskWebhookEvent(
            status=JOB_STATUS_DONE,
            output=result,
            finished_at=_utc_now(),
        ))
    except JobCancelledError:
        logger.info("Actor cancelled mid-run: %s", job_uuid)
    except Exception as exc:
        logger.exception("Actor failed: %s", job_uuid)
        _send_webhook_event(webhook_url, TaskWebhookEvent(
            status=JOB_STATUS_FAILED,
            error_message=str(exc),
            finished_at=_utc_now(),
        ))

Each actor becomes ~5-15 lines defining only its unique work function and parameters.

1.3 Artifact persistence → owning modules

Move artifact-saving logic to the modules that own those domain concepts:

  • _save_transcription_artifacts()transcription/service.py as TranscriptionService.save_artifacts()
  • _save_convert_artifacts()media/service.py as MediaService.save_artifacts()
  • _save_captions_artifacts()captions/service.py as CaptionService.save_artifacts()

TaskService.record_webhook_event() dispatches to these module-owned methods.

1.4 What stays in tasks/service.py

Only the TaskService class (~300-400 lines):

  • __init__() with session + job repo
  • submit_media_probe(), submit_silence_detect(), submit_silence_remove(), submit_convert(), submit_frame_extract(), submit_transcription_generate(), submit_captions_generate()
  • record_webhook_event() — orchestration only, delegates artifact saving
  • cancel_job()

1.5 Constants language

All user-facing constants in tasks/constants.py are in Russian. The current mix of English and Russian (tasks/service.py:93-103) is unified to Russian.

Phase 2: Shared Helpers in common/

2.1 common/db.py — Repository helpers

Three free functions:

async def apply_partial_update(instance: object, data: BaseModel) -> None:
    """Apply non-None fields from a Pydantic schema to a SQLAlchemy model."""

async def commit_and_refresh(session: AsyncSession, instance: object) -> None:
    """Commit the session and refresh the instance."""

async def soft_delete(
    session: AsyncSession,
    instance: object,
    *,
    field: str = "is_active",
    value: bool = False,
) -> None:
    """Mark an instance as soft-deleted and commit.
    Supports both is_active=False and is_deleted=True patterns."""

Soft-delete documentation comment explains which modules use which pattern. No unification of is_active vs is_deleted in this refactor — that requires a data migration.

2.2 common/http.py — Router helpers

Two free functions + two error constants:

ERROR_NOT_FOUND = "Объект не найден"
ERROR_FORBIDDEN = "Доступ запрещён"

def ensure_found(instance: object | None, detail: str = ERROR_NOT_FOUND) -> None:
    """Raise 404 if instance is None."""

def ensure_owner_or_staff(
    instance: object,
    user: User,
    *,
    owner_field: str = "owner_id",
) -> None:
    """Raise 403 if user is not staff and doesn't own the resource."""

Phase 3: Apply Shared Helpers Across All CRUD Modules

3.1 Repository updates (7 modules)

Replace model_dump loop + commit + refresh with apply_partial_update() + commit_and_refresh():

  • files/repository.py
  • media/repository.py (×2: MediaFile + Artifact)
  • projects/repository.py
  • jobs/repository.py (×2: Job + JobEvent)
  • webhooks/repository.py
  • transcription/repository.py
  • captions/repository.py (special case: style_config nested serialization stays explicit)

3.2 Router permission checks (7 modules)

Replace inline 404/403 checks with ensure_found() + ensure_owner_or_staff():

  • users/router.py (3 sites)
  • projects/router.py (3 sites)
  • files/router.py (3 sites + storage path check)
  • media/router.py (6 sites: mediafiles + artifacts)
  • jobs/router.py (6 sites: jobs + events)
  • webhooks/router.py (3 sites)
  • captions/router.py (2 sites)

3.3 Add missing permission checks

While applying helpers, add ownership checks where currently absent:

Module Current state Fix
transcription/router.py No ownership checks Add ensure_owner_or_staff() via parent project
media/router.py artifacts _ = current_user (ignores user) Add ownership check via parent media file
captions/router.py get-by-id Returns any preset by ID Only return if system preset or owned by requester
jobs/router.py events No user scoping on events Add ownership check via parent job

Phase 4: Error Handling, Naming, Dead Code, Security

4.1 Error message constants

Extract inline strings into ERROR_ constants at the top of each file:

  • infrastructure/auth.pyERROR_TOKEN_EXPIRED, ERROR_INVALID_TOKEN, ERROR_USER_NOT_FOUND
  • users/service.pyERROR_EMAIL_ALREADY_EXISTS, ERROR_INVALID_CREDENTIALS
  • users/router.pyERROR_PASSWORD_MISMATCH, ERROR_INCORRECT_PASSWORD
  • files/router.pyERROR_FILE_TOO_LARGE, ERROR_EMPTY_FILE
  • media/service.pyERROR_PROBE_FAILED, ERROR_NO_VIDEO_STREAM, ERROR_FFMPEG_FAILED

4.2 Naming fix

Rename project_pctprogress_pct in:

  • jobs/models.py
  • jobs/schemas.py (3 occurrences)
  • Alembic migration for the DB column rename
  • Check frontend for field dependency before applying

4.3 Dead code removal

Dead code Location
Legacy service function exports users/service.py:60-105, projects/service.py:46-69
WordOptions.max_line_count transcription/schemas.py:84-88
Duplicate import json media/service.py:7 (top-level unused)
filename_without_ext variable media/service.py:396, 434

4.4 Notification completeness

Add missing entries to notifications/service.py:

  • JOB_TYPE_LABELS: add FRAME_EXTRACT (Russian label)
  • STATUS_TITLES: add CANCELLED (Russian label)
  • All notification messages must be in Russian

4.5 Duplicate WebSocket auth

Extract decode_user_id_from_token(token: str) -> uuid.UUID in infrastructure/auth.py. Both get_current_user() and notifications/router.py WebSocket handler reuse it. WebSocket handler keeps its own session management.

4.6 Settings fix

Change remotion_service_url default from http://localhost:8001 to http://localhost:3001 in infrastructure/settings.py.

4.7 Path traversal fix

Add root boundary check in files/router.py local file serving:

full_path = (settings.local_storage_dir / file_path).resolve()
root = Path(settings.local_storage_dir).resolve()
if not str(full_path).startswith(str(root)):
    raise HTTPException(status_code=status.HTTP_403_FORBIDDEN, detail=ERROR_FORBIDDEN)

4.8 Mutable default fix

Change media/schemas.py:138 from streams: list[StreamSchema] = [] to streams: list[StreamSchema] = Field(default_factory=list).

Phase 5: Consistent Layering and Tests

5.1 Fix Router → Service → Repository layering

Module Current state Fix
media/router.py CRUD calls MediaFileRepository directly Route through MediaService — add thin CRUD wrappers
transcription/router.py All routes call TranscriptionRepository directly Route through TranscriptionService — add thin CRUD wrappers
notifications/router.py Uses NotificationRepository directly Route through NotificationService — add list/get/mark_read

Service methods are pass-through calls to the repository. No new business logic invented. Value is consistent architecture.

5.2 Test additions

Test Reason
Unit test _run_actor() lifecycle (cancel/success/failure) Core new abstraction
Unit test ensure_found() / ensure_owner_or_staff() Shared helpers
Unit test apply_partial_update() / commit_and_refresh() Shared helpers
Integration tests for new ownership checks (transcription, media artifacts, caption presets, job events) New permission enforcement
Test path traversal prevention on local file route Security fix

No full test rewrites. No new integration test infrastructure.

5.3 Update CLAUDE.md

Reflect the new architecture:

  • Module file rule updated: extra files allowed when service.py would exceed ~300 lines
  • Document common/db.py and common/http.py
  • Document infrastructure/dramatiq.py
  • Document ensure_found / ensure_owner_or_staff as the standard router permission pattern
  • Document apply_partial_update / commit_and_refresh as the standard repository update pattern

Files Changed Summary

New files

File Purpose
cpv3/common/db.py Repository helpers: apply_partial_update, commit_and_refresh, soft_delete
cpv3/common/http.py Router helpers: ensure_found, ensure_owner_or_staff, ERROR_ constants
cpv3/infrastructure/dramatiq.py RedisBroker setup, broker-id serialization, cancellation check
cpv3/modules/tasks/actors.py All @dramatiq.actor functions + _run_actor() lifecycle helper
cpv3/modules/tasks/artifacts.py Artifact persistence helpers (moved from tasks/service.py)
cpv3/modules/tasks/webhooks.py Webhook transport helpers
cpv3/modules/tasks/constants.py All ERROR_, MESSAGE_, STATUS_ constants (Russian)

Modified files (by phase)

Phase 1 — tasks decomposition:

  • cpv3/modules/tasks/service.py — stripped to TaskService class only
  • cpv3/modules/transcription/service.py — receives save_artifacts()
  • cpv3/modules/media/service.py — receives save_artifacts()
  • cpv3/modules/captions/service.py — receives save_artifacts()

Phase 3 — apply shared helpers:

  • All 7 module repository.py files — use apply_partial_update() + commit_and_refresh()
  • All 7 module router.py files — use ensure_found() + ensure_owner_or_staff()

Phase 4 — error handling, cleanup:

  • cpv3/infrastructure/auth.py — error constants + shared token decoder
  • cpv3/infrastructure/settings.py — fix remotion port default
  • cpv3/modules/users/service.py — error constants + remove dead exports
  • cpv3/modules/users/router.py — error constants
  • cpv3/modules/projects/service.py — remove dead exports
  • cpv3/modules/files/router.py — path traversal fix + error constants
  • cpv3/modules/media/service.py — error constants + dead code removal
  • cpv3/modules/media/schemas.py — mutable default fix
  • cpv3/modules/transcription/schemas.py — remove dead max_line_count
  • cpv3/modules/notifications/service.py — complete label maps (Russian)
  • cpv3/modules/notifications/router.py — use shared token decoder
  • cpv3/modules/jobs/models.py — rename project_pctprogress_pct
  • cpv3/modules/jobs/schemas.py — rename project_pctprogress_pct

Phase 5 — layering + tests:

  • cpv3/modules/media/router.py — route through service
  • cpv3/modules/transcription/router.py — route through service
  • cpv3/modules/notifications/router.py — route through service
  • cofee_backend/CLAUDE.md — updated architecture docs
  • New test files for helpers and permission checks

Alembic migration

One migration for project_pctprogress_pct column rename in jobs table.

Risks and Mitigations

Risk Mitigation
Actor refactor breaks background processing Test _run_actor() lifecycle thoroughly before deploying. Keep old actors available on a branch for rollback.
progress_pct rename breaks frontend Check frontend API types for project_pct usage before applying migration. Coordinate with frontend update.
Missing auth checks break existing workflows New permission checks only restrict access that was unintentionally open. Test with real user flows.
Shared helpers become a dumping ground Keep common/db.py and common/http.py minimal. New helpers need clear justification.