Files
remotion_service/docs/superpowers/plans/2026-04-03-salutespeech-transcription.md
T
2026-04-03 23:47:58 +03:00

34 KiB
Raw Blame History

SaluteSpeech Transcription Engine — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Add SaluteSpeech (Sber) as a third transcription engine with async REST API, domain-specific models, and word-level timestamps.

Architecture: Direct integration following existing engine pattern — plain functions in transcription/service.py, if/elif dispatch in Dramatiq actor, no new abstractions. SaluteSpeech uses a 4-step REST flow (auth → upload → create task → poll → download) with a thread-safe OAuth token cache.

Tech Stack: Python, httpx (sync), FastAPI, Dramatiq, React/TypeScript

Spec: docs/superpowers/specs/2026-04-03-salutespeech-transcription-design.md


File Map

File Action Responsibility
cofee_backend/.certs/russian_trusted_root_ca.pem Create Russian CA certificate for TLS
cofee_backend/cpv3/infrastructure/settings.py Modify 3 new SaluteSpeech settings fields
cofee_backend/cpv3/modules/transcription/schemas.py Modify New schema types, extend engine enum + type unions
cofee_backend/cpv3/modules/transcription/service.py Modify ~8 new functions for SaluteSpeech flow
cofee_backend/cpv3/modules/transcription/router.py Modify Direct /salute-speech/ endpoint
cofee_backend/cpv3/modules/tasks/schemas.py Modify Extend engine Literal
cofee_backend/cpv3/modules/tasks/service.py Modify ENGINE_MAP + elif dispatch branch
cofee_frontend/src/features/project/TranscriptionModal/TranscriptionModal.tsx Modify Engine option, split model options, engine change effect
cofee_frontend/src/features/project/TranscriptionSettingsStep/TranscriptionSettingsStep.tsx Modify Same as TranscriptionModal
cofee_backend/tests/integration/test_salutespeech_parsing.py Create Unit tests for timestamp parsing + result conversion

Task 1: Bundle TLS Certificate

Files:

  • Create: cofee_backend/.certs/russian_trusted_root_ca.pem

  • Step 1: Create .certs directory

mkdir -p cofee_backend/.certs
  • Step 2: Download the Russian root CA certificate
curl -k "https://gu-st.ru/content/Other/doc/russian_trusted_root_ca.cer" \
  -o cofee_backend/.certs/russian_trusted_root_ca.pem
  • Step 3: Verify the cert is valid PEM format
openssl x509 -in cofee_backend/.certs/russian_trusted_root_ca.pem -noout -subject -dates

Expected: prints subject (Russian CA) and validity dates without errors. If the file is DER format instead of PEM, convert:

openssl x509 -inform DER -in cofee_backend/.certs/russian_trusted_root_ca.pem \
  -out cofee_backend/.certs/russian_trusted_root_ca.pem -outform PEM
  • Step 4: Add to .gitignore exclusion

The .certs/ directory should NOT be gitignored — this is a public root CA, safe to commit. Verify it's not caught by any existing gitignore pattern:

cd cofee_backend && git check-ignore .certs/russian_trusted_root_ca.pem

Expected: no output (not ignored).

  • Step 5: Commit
git add cofee_backend/.certs/russian_trusted_root_ca.pem
git commit -m "chore(backend): bundle Russian root CA cert for SaluteSpeech TLS"

Task 2: Add SaluteSpeech Settings

Files:

  • Modify: cofee_backend/cpv3/infrastructure/settings.py:97 (after webhook_base_url field)

  • Step 1: Add 3 new fields to Settings class

In cofee_backend/cpv3/infrastructure/settings.py, after the webhook_base_url field (line 97) and before def get_database_url(self) (line 99), add:

    # SaluteSpeech
    salute_auth_key: str = Field(default="", alias="SALUTE_AUTH_KEY")
    salute_ca_cert_path: Path | None = Field(
        default=None, alias="SALUTE_CA_CERT_PATH"
    )
    salute_scope: str = Field(
        default="SALUTE_SPEECH_PERS", alias="SALUTE_SCOPE"
    )
  • Step 2: Verify settings load without errors
cd cofee_backend && uv run python -c "from cpv3.infrastructure.settings import get_settings; s = get_settings(); print(f'salute_auth_key={s.salute_auth_key!r}, salute_ca_cert_path={s.salute_ca_cert_path!r}, salute_scope={s.salute_scope!r}')"

Expected: salute_auth_key='', salute_ca_cert_path=None, salute_scope='SALUTE_SPEECH_PERS'

  • Step 3: Commit
git add cofee_backend/cpv3/infrastructure/settings.py
git commit -m "feat(backend): add SaluteSpeech settings (auth key, cert path, scope)"

Task 3: Add SaluteSpeech Schemas

Files:

  • Modify: cofee_backend/cpv3/modules/transcription/schemas.py:10 (engine enum) and after line 147 (EOF, new classes)

  • Step 1: Extend TranscriptionEngineEnum

In cofee_backend/cpv3/modules/transcription/schemas.py, line 10, change:

TranscriptionEngineEnum = Literal["LOCAL_WHISPER", "GOOGLE_SPEECH_CLOUD"]

to:

TranscriptionEngineEnum = Literal["LOCAL_WHISPER", "GOOGLE_SPEECH_CLOUD", "SALUTE_SPEECH"]
  • Step 2: Add SaluteSpeech schema classes

After the GoogleSpeechParams class (line 147, end of file), add:



# ---------------------------------- SaluteSpeech Models ----------------------------------


class SaluteSpeechWord(Schema):
    word: str
    start: float
    end: float


class SaluteSpeechSegment(Schema):
    text: str
    start: float
    end: float
    words: list[SaluteSpeechWord] = []


class SaluteSpeechResult(Schema):
    text: str
    segments: list[SaluteSpeechSegment]
    language: str


class SaluteSpeechParams(Schema):
    file_path: str
    language: str | None = None
    model: str = "general"
  • Step 3: Verify schemas import correctly
cd cofee_backend && uv run python -c "from cpv3.modules.transcription.schemas import SaluteSpeechWord, SaluteSpeechSegment, SaluteSpeechResult, SaluteSpeechParams, TranscriptionEngineEnum; print('OK')"

Expected: OK

  • Step 4: Commit
git add cofee_backend/cpv3/modules/transcription/schemas.py
git commit -m "feat(backend): add SaluteSpeech schema types and extend engine enum"

Task 4: Extend Type Unions in Service

Files:

  • Modify: cofee_backend/cpv3/modules/transcription/service.py:44 and service.py:222 (type unions)

  • Modify: cofee_backend/cpv3/modules/transcription/service.py imports (top of file)

  • Step 1: Add SaluteSpeech imports

In cofee_backend/cpv3/modules/transcription/service.py, in the imports from transcription.schemas (around line 229243), add SaluteSpeechSegment to the import list:

from cpv3.modules.transcription.schemas import (
    Document,
    GoogleSpeechResult,
    GoogleSpeechSegment,
    GoogleSpeechWord,
    LineNode,
    SaluteSpeechSegment,
    SegmentNode,
    Tag,
    TimeRange,
    WhisperResult,
    WhisperSegment,
    WhisperWord,
    WordNode,
    WordOptions,
)
  • Step 2: Extend compute_segment_lines type hint

At line 44, change:

    def compute_segment_lines(
        self, segment: WhisperSegment | GoogleSpeechSegment, max_chars_per_line: int
    ) -> list[LineNode]:

to:

    def compute_segment_lines(
        self,
        segment: WhisperSegment | GoogleSpeechSegment | SaluteSpeechSegment,
        max_chars_per_line: int,
    ) -> list[LineNode]:
  • Step 3: Extend _make_document_from_segments type hint

At line 222, change:

def _make_document_from_segments(
    builder: DocumentBuilder,
    segments: list[WhisperSegment] | list[GoogleSpeechSegment],
    *,
    max_line_width: int,
) -> Document:

to:

def _make_document_from_segments(
    builder: DocumentBuilder,
    segments: list[WhisperSegment] | list[GoogleSpeechSegment] | list[SaluteSpeechSegment],
    *,
    max_line_width: int,
) -> Document:
  • Step 4: Run lint to verify
cd cofee_backend && uv run ruff check cpv3/modules/transcription/service.py

Expected: no errors.

  • Step 5: Commit
git add cofee_backend/cpv3/modules/transcription/service.py
git commit -m "feat(backend): extend type unions to accept SaluteSpeechSegment"

Task 5: Write Tests for SaluteSpeech Parsing

Files:

  • Create: cofee_backend/tests/integration/test_salutespeech_parsing.py

  • Step 1: Write the test file

Create cofee_backend/tests/integration/test_salutespeech_parsing.py:

"""Tests for SaluteSpeech result parsing and document building."""

from cpv3.modules.transcription.service import (
    _build_document_from_salute_result,
    _parse_salute_time,
)


class TestParseSaluteTime:
    def test_simple_timestamp(self):
        assert _parse_salute_time("0.480s") == 0.48

    def test_zero(self):
        assert _parse_salute_time("0.000s") == 0.0

    def test_large_timestamp(self):
        assert _parse_salute_time("123.456s") == 123.456

    def test_integer_timestamp(self):
        assert _parse_salute_time("5s") == 5.0


class TestBuildDocumentFromSaluteResult:
    def _make_raw_result(self):
        """Minimal SaluteSpeech API response for testing."""
        return [
            {
                "results": [
                    {
                        "text": "привет мир",
                        "normalized_text": "Привет мир.",
                        "start": "0.480s",
                        "end": "1.200s",
                        "word_alignments": [
                            {"word": "привет", "start": "0.480s", "end": "0.840s"},
                            {"word": "мир", "start": "0.960s", "end": "1.200s"},
                        ],
                    },
                    {
                        "text": "это тест",
                        "normalized_text": "Это тест.",
                        "start": "1.500s",
                        "end": "2.100s",
                        "word_alignments": [
                            {"word": "это", "start": "1.500s", "end": "1.700s"},
                            {"word": "тест", "start": "1.800s", "end": "2.100s"},
                        ],
                    },
                ],
                "channel": 0,
            }
        ]

    def test_returns_document_with_segments(self):
        raw = self._make_raw_result()
        doc = _build_document_from_salute_result(raw, language="ru-RU")
        assert len(doc.segments) == 2

    def test_segment_text(self):
        raw = self._make_raw_result()
        doc = _build_document_from_salute_result(raw, language="ru-RU")
        assert doc.segments[0].lines[0].text == "привет мир"

    def test_word_timestamps(self):
        raw = self._make_raw_result()
        doc = _build_document_from_salute_result(raw, language="ru-RU")
        first_word = doc.segments[0].lines[0].words[0]
        assert first_word.text == "привет"
        assert first_word.time.start == 0.48
        assert first_word.time.end == 0.84

    def test_segment_time_range(self):
        raw = self._make_raw_result()
        doc = _build_document_from_salute_result(raw, language="ru-RU")
        assert doc.segments[0].time.start == 0.48
        assert doc.segments[0].time.end == 1.2

    def test_empty_results(self):
        raw = [{"results": [], "channel": 0}]
        doc = _build_document_from_salute_result(raw, language="ru-RU")
        assert len(doc.segments) == 0

    def test_missing_word_alignments(self):
        raw = [
            {
                "results": [
                    {
                        "text": "привет",
                        "normalized_text": "Привет.",
                        "start": "0.000s",
                        "end": "0.500s",
                    }
                ],
                "channel": 0,
            }
        ]
        doc = _build_document_from_salute_result(raw, language="ru-RU")
        assert len(doc.segments) == 1
        # No words but segment still created
        assert doc.segments[0].time.start == 0.0
  • Step 2: Run tests to verify they fail
cd cofee_backend && uv run pytest tests/integration/test_salutespeech_parsing.py -v 2>&1 | head -20

Expected: ImportError_build_document_from_salute_result and _parse_salute_time don't exist yet.

  • Step 3: Commit test file
git add cofee_backend/tests/integration/test_salutespeech_parsing.py
git commit -m "test(backend): add SaluteSpeech parsing and document building tests"

Task 6: Implement SaluteSpeech Service Functions

Files:

  • Modify: cofee_backend/cpv3/modules/transcription/service.py (append after line 430)

This is the core task — all 8 SaluteSpeech functions.

  • Step 1: Add new imports at top of file

In cofee_backend/cpv3/modules/transcription/service.py, add these imports at the top (after the existing imports, around line 10):

import threading
import time
import uuid
from pathlib import Path

import httpx

Note: time may already be imported. Check and avoid duplicates. asyncio is already imported. anyio is already imported.

Also add to the schema imports block:

from cpv3.modules.transcription.schemas import (
    ...  # existing imports
    SaluteSpeechResult,
    SaluteSpeechSegment,
    SaluteSpeechWord,
    SaluteSpeechParams,
)
  • Step 2: Add constants and token cache

After the existing imports (before the DocumentBuilder class), add:

# ---------------------------------- SaluteSpeech Constants ----------------------------------

SALUTE_AUTH_URL = "https://ngw.devices.sberbank.ru:9443/api/v2/oauth"
SALUTE_API_BASE = "https://smartspeech.sber.ru/rest/v1"
SALUTE_POLL_INTERVAL_SECONDS = 5.0
SALUTE_POLL_TIMEOUT_SECONDS = 600
SALUTE_TOKEN_REFRESH_MARGIN_SECONDS = 60

SALUTE_ENCODING_MAP: dict[str, str] = {
    ".mp3": "MP3",
    ".wav": "PCM_S16LE",
    ".ogg": "opus",
    ".flac": "FLAC",
}

SALUTE_CONTENT_TYPE_MAP: dict[str, str] = {
    ".mp3": "audio/mpeg",
    ".wav": "audio/wav",
    ".ogg": "audio/ogg",
    ".flac": "audio/flac",
}

SALUTE_LANGUAGE_MAP: dict[str, str] = {
    "ru": "ru-RU",
    "en": "en-US",
}

ERROR_SALUTE_AUTH_FAILED = "Ошибка авторизации SaluteSpeech: {detail}"
ERROR_SALUTE_UPLOAD_FAILED = "Ошибка загрузки файла в SaluteSpeech: {detail}"
ERROR_SALUTE_TASK_FAILED = "Ошибка распознавания SaluteSpeech: {detail}"
ERROR_SALUTE_TIMEOUT = "Превышено время ожидания распознавания SaluteSpeech"
ERROR_SALUTE_UNSUPPORTED_FORMAT = "Неподдерживаемый формат аудио для SaluteSpeech: {ext}"

_salute_token_lock = threading.Lock()
_salute_token: str | None = None
_salute_token_expires_at: float = 0.0
  • Step 3: Add helper functions

After the end of file (after transcribe_with_google_speech), append all SaluteSpeech functions:

# ---------------------------------- SaluteSpeech Engine ----------------------------------


def _parse_salute_time(s: str) -> float:
    """Parse SaluteSpeech timestamp string '0.480s' → 0.48."""
    return float(s.rstrip("s"))


def _get_salute_access_token(client: httpx.Client) -> str:
    """Get or refresh SaluteSpeech OAuth token. Thread-safe."""
    global _salute_token, _salute_token_expires_at
    with _salute_token_lock:
        if _salute_token and time.monotonic() < (
            _salute_token_expires_at - SALUTE_TOKEN_REFRESH_MARGIN_SECONDS
        ):
            return _salute_token

        settings = get_settings()
        response = client.post(
            SALUTE_AUTH_URL,
            headers={
                "Authorization": f"Basic {settings.salute_auth_key}",
                "RqUID": str(uuid.uuid4()),
                "Content-Type": "application/x-www-form-urlencoded",
            },
            content=f"scope={settings.salute_scope}",
        )
        if response.status_code != 200:
            raise RuntimeError(
                ERROR_SALUTE_AUTH_FAILED.format(detail=response.text[:200])
            )
        data = response.json()
        _salute_token = data["access_token"]
        expires_in_seconds = (data["expires_at"] / 1000) - time.time()
        _salute_token_expires_at = time.monotonic() + expires_in_seconds
        return _salute_token


def _upload_salute_audio(
    client: httpx.Client, token: str, audio_data: bytes, content_type: str
) -> str:
    """Upload audio to SaluteSpeech, return request_file_id."""
    response = client.post(
        f"{SALUTE_API_BASE}/data:upload",
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": content_type,
        },
        content=audio_data,
        timeout=120.0,
    )
    if response.status_code != 200:
        raise RuntimeError(
            ERROR_SALUTE_UPLOAD_FAILED.format(detail=response.text[:200])
        )
    return response.json()["result"]["request_file_id"]


def _create_salute_task(
    client: httpx.Client,
    token: str,
    file_id: str,
    *,
    language: str,
    model: str,
    audio_encoding: str,
    sample_rate: int,
) -> str:
    """Create async recognition task, return task_id."""
    body = {
        "options": {
            "audio_encoding": audio_encoding,
            "sample_rate": sample_rate,
            "language": language,
            "model": model,
            "channels_count": 1,
            "hypotheses_count": 1,
        },
        "request_file_id": file_id,
    }
    response = client.post(
        f"{SALUTE_API_BASE}/speech:async_recognize",
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json",
        },
        json=body,
    )
    if response.status_code != 200:
        raise RuntimeError(
            ERROR_SALUTE_TASK_FAILED.format(detail=response.text[:200])
        )
    return response.json()["result"]["id"]


def _poll_salute_task(
    client: httpx.Client,
    token: str,
    task_id: str,
    job_uuid: uuid.UUID | None,
    on_progress: ProgressCallback | None,
) -> str:
    """Poll task until DONE, return response_file_id. Checks job cancellation each iteration."""
    from cpv3.modules.tasks.service import _raise_if_job_cancelled

    start = time.monotonic()
    while True:
        elapsed = time.monotonic() - start
        if elapsed > SALUTE_POLL_TIMEOUT_SECONDS:
            raise TimeoutError(ERROR_SALUTE_TIMEOUT)

        if job_uuid is not None:
            _raise_if_job_cancelled(job_uuid)

        response = client.get(
            f"{SALUTE_API_BASE}/task:get",
            params={"id": task_id},
            headers={"Authorization": f"Bearer {token}"},
        )
        response.raise_for_status()
        result = response.json()["result"]
        status = result["status"]

        if status == "DONE":
            return result["response_file_id"]
        if status == "ERROR":
            error_msg = result.get("error", "unknown error")
            raise RuntimeError(
                ERROR_SALUTE_TASK_FAILED.format(detail=error_msg)
            )

        if on_progress is not None:
            pct = min(elapsed / SALUTE_POLL_TIMEOUT_SECONDS * 100, 95.0)
            on_progress(pct)

        time.sleep(SALUTE_POLL_INTERVAL_SECONDS)


def _download_salute_result(
    client: httpx.Client, token: str, response_file_id: str
) -> list[dict]:
    """Download recognition result JSON."""
    response = client.get(
        f"{SALUTE_API_BASE}/data:download",
        params={"response_file_id": response_file_id},
        headers={"Authorization": f"Bearer {token}"},
        timeout=60.0,
    )
    response.raise_for_status()
    return response.json()


def _build_document_from_salute_result(
    raw_channels: list[dict], *, language: str
) -> Document:
    """Convert SaluteSpeech result JSON to Document."""
    builder = DocumentBuilder()
    words_options = WordOptions()

    all_segments: list[SaluteSpeechSegment] = []

    for channel_data in raw_channels:
        for result_item in channel_data.get("results", []):
            word_alignments = result_item.get("word_alignments", [])
            words = [
                SaluteSpeechWord(
                    word=w["word"],
                    start=_parse_salute_time(w["start"]),
                    end=_parse_salute_time(w["end"]),
                )
                for w in word_alignments
            ]

            text = result_item.get("text", "")
            seg_start = _parse_salute_time(result_item["start"]) if words else 0.0
            seg_end = _parse_salute_time(result_item["end"]) if words else 0.0

            all_segments.append(
                SaluteSpeechSegment(
                    text=text,
                    start=seg_start,
                    end=seg_end,
                    words=words,
                )
            )

    document = _make_document_from_segments(
        builder, all_segments, max_line_width=words_options.max_line_width
    )
    return builder.process_document(document)


def _salute_transcribe_sync(
    *,
    local_file_path: str,
    language: str | None,
    model: str,
    sample_rate: int,
    job_id: uuid.UUID | None = None,
    on_progress: ProgressCallback | None = None,
) -> Document:
    """Synchronous SaluteSpeech transcription (runs in Dramatiq worker thread)."""
    settings = get_settings()

    ext = Path(local_file_path).suffix.lower()
    audio_encoding = SALUTE_ENCODING_MAP.get(ext)
    content_type = SALUTE_CONTENT_TYPE_MAP.get(ext)
    if not audio_encoding or not content_type:
        raise ValueError(ERROR_SALUTE_UNSUPPORTED_FORMAT.format(ext=ext))

    salute_language = SALUTE_LANGUAGE_MAP.get(language or "", "ru-RU")

    verify = str(settings.salute_ca_cert_path) if settings.salute_ca_cert_path else True
    with httpx.Client(verify=verify, timeout=30.0) as client:
        token = _get_salute_access_token(client)

        with open(local_file_path, "rb") as f:
            audio_data = f.read()

        file_id = _upload_salute_audio(client, token, audio_data, content_type)
        task_id = _create_salute_task(
            client,
            token,
            file_id,
            language=salute_language,
            model=model,
            audio_encoding=audio_encoding,
            sample_rate=sample_rate,
        )
        response_file_id = _poll_salute_task(
            client, token, task_id, job_id, on_progress
        )
        raw_result = _download_salute_result(client, token, response_file_id)

    return _build_document_from_salute_result(raw_result, language=salute_language)


async def transcribe_with_salute_speech(
    storage: StorageService,
    *,
    file_key: str,
    language: str | None = None,
    model: str = "general",
    sample_rate: int = 16000,
    job_id: uuid.UUID | None = None,
    on_progress: ProgressCallback | None = None,
) -> Document:
    """Async wrapper for SaluteSpeech transcription."""
    tmp = await storage.download_to_temp(file_key)
    try:
        return await anyio.to_thread.run_sync(
            lambda: _salute_transcribe_sync(
                local_file_path=tmp.path,
                language=language,
                model=model,
                sample_rate=sample_rate,
                job_id=job_id,
                on_progress=on_progress,
            )
        )
    finally:
        tmp.cleanup()
  • Step 4: Run the parsing tests
cd cofee_backend && uv run pytest tests/integration/test_salutespeech_parsing.py -v

Expected: all tests pass.

  • Step 5: Run lint
cd cofee_backend && uv run ruff check cpv3/modules/transcription/service.py

Expected: no errors.

  • Step 6: Commit
git add cofee_backend/cpv3/modules/transcription/service.py
git commit -m "feat(backend): implement SaluteSpeech transcription engine (8 functions)"

Task 7: Add Task Dispatch

Files:

  • Modify: cofee_backend/cpv3/modules/tasks/schemas.py:86 (engine Literal)

  • Modify: cofee_backend/cpv3/modules/tasks/service.py:88-91 (ENGINE_MAP)

  • Modify: cofee_backend/cpv3/modules/tasks/service.py:613-616 (actor import)

  • Modify: cofee_backend/cpv3/modules/tasks/service.py:700 (elif branch)

  • Step 1: Extend engine Literal in task schema

In cofee_backend/cpv3/modules/tasks/schemas.py, line 86, change:

    engine: Literal["whisper", "google"] = Field(

to:

    engine: Literal["whisper", "google", "salutespeech"] = Field(
  • Step 2: Add to ENGINE_MAP

In cofee_backend/cpv3/modules/tasks/service.py, lines 88-91, change:

ENGINE_MAP: dict[str, str] = {
    "whisper": "LOCAL_WHISPER",
    "google": "GOOGLE_SPEECH_CLOUD",
}

to:

ENGINE_MAP: dict[str, str] = {
    "whisper": "LOCAL_WHISPER",
    "google": "GOOGLE_SPEECH_CLOUD",
    "salutespeech": "SALUTE_SPEECH",
}
  • Step 3: Add import in actor

In cofee_backend/cpv3/modules/tasks/service.py, inside transcription_generate_actor (lines 613-616), change:

    from cpv3.modules.transcription.service import (
        transcribe_with_google_speech,
        transcribe_with_whisper,
    )

to:

    from cpv3.modules.transcription.service import (
        transcribe_with_google_speech,
        transcribe_with_salute_speech,
        transcribe_with_whisper,
    )
  • Step 4: Add elif dispatch branch

In cofee_backend/cpv3/modules/tasks/service.py, after the Google branch (after line 700, before the else:), add:

        elif engine == "salutespeech":
            # Extract sample rate from probe if available
            audio_stream = next(
                (s for s in probe.streams if s.codec_type == "audio"), None
            )
            sr = int(audio_stream.sample_rate) if audio_stream and audio_stream.sample_rate else 16000
            document = _run_async(
                transcribe_with_salute_speech(
                    storage,
                    file_key=file_key,
                    language=language,
                    model=model,
                    sample_rate=sr,
                    job_id=job_uuid,
                    on_progress=_on_whisper_progress,
                )
            )
  • Step 5: Run lint
cd cofee_backend && uv run ruff check cpv3/modules/tasks/service.py cpv3/modules/tasks/schemas.py

Expected: no errors.

  • Step 6: Commit
git add cofee_backend/cpv3/modules/tasks/schemas.py cofee_backend/cpv3/modules/tasks/service.py
git commit -m "feat(backend): add SaluteSpeech to task dispatch (ENGINE_MAP + elif branch)"

Task 8: Add Direct Endpoint (Optional)

Files:

  • Modify: cofee_backend/cpv3/modules/transcription/router.py (after line 145)

  • Step 1: Add route

In cofee_backend/cpv3/modules/transcription/router.py, add the import at the top alongside existing imports:

from cpv3.modules.transcription.schemas import (
    ...  # existing
    SaluteSpeechParams,
)
from cpv3.modules.transcription.service import (
    ...  # existing
    transcribe_with_salute_speech,
)

Then append after the last endpoint (after line 145):



@router.post("/salute-speech/", response_model=Document)
async def salute_speech_transcribe(
    body: SaluteSpeechParams,
    current_user: User = Depends(get_current_user),
    storage: StorageService = Depends(get_storage),
) -> Document:
    _ = current_user
    return await transcribe_with_salute_speech(
        storage,
        file_key=body.file_path,
        language=body.language,
        model=body.model,
    )
  • Step 2: Run lint
cd cofee_backend && uv run ruff check cpv3/modules/transcription/router.py

Expected: no errors.

  • Step 3: Commit
git add cofee_backend/cpv3/modules/transcription/router.py
git commit -m "feat(backend): add direct /salute-speech/ transcription endpoint"

Task 9: Frontend — TranscriptionModal

Files:

  • Modify: cofee_frontend/src/features/project/TranscriptionModal/TranscriptionModal.tsx

  • Step 1: Extend type

At line 17, change:

    engine: "whisper" | "google"

to:

    engine: "whisper" | "google" | "salutespeech"
  • Step 2: Add engine option

At lines 22-25, change:

const ENGINE_OPTIONS = [
    { value: "whisper", label: "Whisper (локальный)" },
    { value: "google", label: "Google Speech" },
]

to:

const ENGINE_OPTIONS = [
    { value: "whisper", label: "Whisper (локальный)" },
    { value: "google", label: "Google Speech" },
    { value: "salutespeech", label: "SaluteSpeech" },
]
  • Step 3: Split model options

Rename the existing MODEL_OPTIONS (lines 33-38) and add SaluteSpeech models:

const WHISPER_MODEL_OPTIONS = [
    { value: "base", label: "Базовая" },
    { value: "small", label: "Малая" },
    { value: "medium", label: "Средняя" },
    { value: "large", label: "Большая" },
]

const SALUTE_MODEL_OPTIONS = [
    { value: "general", label: "Общая" },
    { value: "finance", label: "Финансы" },
    { value: "medicine", label: "Медицина" },
]
  • Step 4: Update model dropdown guard

At line 162, change the model dropdown conditional from:

{engine === "whisper" && (

to:

{(engine === "whisper" || engine === "salutespeech") && (

And inside, change the options reference from MODEL_OPTIONS to:

{(engine === "whisper" ? WHISPER_MODEL_OPTIONS : SALUTE_MODEL_OPTIONS).map((opt) => (
  • Step 5: Add model reset on engine change

Find the component function body (after the useForm call). Add a useEffect that resets the model when engine changes:

const engine = watch("engine")

useEffect(() => {
    if (engine === "salutespeech") {
        setValue("model", "general")
    } else if (engine === "whisper") {
        setValue("model", "base")
    }
}, [engine, setValue])

Note: watch and setValue come from useForm — check that they're destructured. If watch("engine") is already used elsewhere, reuse that variable.

  • Step 6: Type check
cd cofee_frontend && bunx tsc --noEmit 2>&1 | grep -v "app/template.tsx" | grep -v "CreateProjectModal" | head -20

Expected: no new errors.

  • Step 7: Commit
git add cofee_frontend/src/features/project/TranscriptionModal/TranscriptionModal.tsx
git commit -m "feat(frontend): add SaluteSpeech engine option to TranscriptionModal"

Task 10: Frontend — TranscriptionSettingsStep

Files:

  • Modify: cofee_frontend/src/features/project/TranscriptionSettingsStep/TranscriptionSettingsStep.tsx

Apply the same changes as Task 9 to this file (constants are duplicated).

  • Step 1: Extend type

At line 22, change:

    engine: "whisper" | "google"

to:

    engine: "whisper" | "google" | "salutespeech"
  • Step 2: Add engine option

At lines 27-30, change:

const ENGINE_OPTIONS = [
    { value: "whisper", label: "Whisper (локальный)" },
    { value: "google", label: "Google Speech" },
]

to:

const ENGINE_OPTIONS = [
    { value: "whisper", label: "Whisper (локальный)" },
    { value: "google", label: "Google Speech" },
    { value: "salutespeech", label: "SaluteSpeech" },
]
  • Step 3: Split model options

Rename MODEL_OPTIONS (lines 38-43) and add SaluteSpeech models:

const WHISPER_MODEL_OPTIONS = [
    { value: "base", label: "Базовая" },
    { value: "small", label: "Малая" },
    { value: "medium", label: "Средняя" },
    { value: "large", label: "Большая" },
]

const SALUTE_MODEL_OPTIONS = [
    { value: "general", label: "Общая" },
    { value: "finance", label: "Финансы" },
    { value: "medicine", label: "Медицина" },
]
  • Step 4: Update model dropdown guard

At line 263, change:

{engine === "whisper" && (

to:

{(engine === "whisper" || engine === "salutespeech") && (

And change the options reference from MODEL_OPTIONS to:

{(engine === "whisper" ? WHISPER_MODEL_OPTIONS : SALUTE_MODEL_OPTIONS).map((opt) => (
  • Step 5: Add model reset on engine change

Same useEffect as Task 9:

const engine = watch("engine")

useEffect(() => {
    if (engine === "salutespeech") {
        setValue("model", "general")
    } else if (engine === "whisper") {
        setValue("model", "base")
    }
}, [engine, setValue])
  • Step 6: Type check
cd cofee_frontend && bunx tsc --noEmit 2>&1 | grep -v "app/template.tsx" | grep -v "CreateProjectModal" | head -20

Expected: no new errors.

  • Step 7: Commit
git add cofee_frontend/src/features/project/TranscriptionSettingsStep/TranscriptionSettingsStep.tsx
git commit -m "feat(frontend): add SaluteSpeech engine option to TranscriptionSettingsStep"

Task 11: Final Verification

Files: None (verification only)

  • Step 1: Backend lint
cd cofee_backend && uv run ruff check cpv3/ 2>&1 | head -20

Expected: no errors.

  • Step 2: Backend tests
cd cofee_backend && uv run pytest 2>&1 | tail -30

Expected: all tests pass (including new SaluteSpeech parsing tests).

  • Step 3: Frontend type check
cd cofee_frontend && bunx tsc --noEmit 2>&1 | grep -v "app/template.tsx" | grep -v "CreateProjectModal" | head -20

Expected: no new errors.

  • Step 4: Write verification report
VERIFICATION REPORT
===================
Subproject:  backend + frontend
Level:       base
Type check:  [PASS/FAIL]
Lint:        [PASS/FAIL]
Tests:       [PASS/FAIL] (X passed, Y failed)
Build:       SKIPPED
E2E:         SKIPPED

Files changed: ~10
Status: [READY/NOT READY]