Files
remotion_service/docs/superpowers/plans/2026-03-24-docker-hardening.md
T
2026-04-06 01:44:58 +03:00

22 KiB

Docker Infrastructure Hardening — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Harden all Docker infrastructure across the monorepo — security, build optimization, service organization, health checks, and networking.

Architecture: 4-phase approach: quick config fixes first (no code changes), then Dockerfile improvements, then health endpoints + networking, then resource limits. Each phase produces a working stack.

Tech Stack: Docker, Docker Compose, FastAPI (Python), ElysiaJS (Bun/TypeScript), PostgreSQL, Redis, MinIO


Task 1: Add .env to .gitignore files

Files:

  • Modify: cofee_backend/.gitignore

  • Modify: cofee_frontend/.gitignore

  • Step 1: Add .env exclusion to backend .gitignore

Append to cofee_backend/.gitignore:

# Environment
.env
.env.*
  • Step 2: Add .env exclusion to frontend .gitignore

The frontend .gitignore has .env*.local but not .env itself. Add before the # local env files section in cofee_frontend/.gitignore:

# Environment
.env

Note: Keep the existing .env*.local line too.

  • Step 3: Verify .env files are not tracked

Run: git ls-files | grep '\.env' Expected: no output. If any .env files are tracked, run git rm --cached <file> for each.

  • Step 4: Commit
git add cofee_backend/.gitignore cofee_frontend/.gitignore
git commit -m "fix(infra): add .env to backend and frontend .gitignore"

Task 2: Add .env to backend .dockerignore

Files:

  • Modify: cofee_backend/.dockerignore

  • Step 1: Add .env exclusion

Add to cofee_backend/.dockerignore:

.env
.env.*
  • Step 2: Commit
git add cofee_backend/.dockerignore
git commit -m "fix(infra): exclude .env from backend Docker build context"

Task 3: DRY up docker-compose env vars with YAML anchor

Files:

  • Modify: cofee_backend/docker-compose.yml

The api and worker services share 14 identical env vars. Extract into an x-backend-env anchor. Also adds the missing JWT_SECRET_KEY to worker.

  • Step 1: Add x-backend-env anchor and refactor services

Replace the entire cofee_backend/docker-compose.yml with:

x-backend-image: &backend-image
  image: cpv3-backend:dev
  build:
    context: .
    dockerfile: Dockerfile
    target: dev

x-backend-env: &backend-env
  DEBUG: ${DEBUG:-1}
  JWT_SECRET_KEY: ${JWT_SECRET_KEY:-dev-secret}

  POSTGRES_USER: ${POSTGRES_USER:-postgres}
  POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-postgres}
  POSTGRES_HOST: db
  POSTGRES_PORT: 5432
  POSTGRES_DATABASE: ${POSTGRES_DATABASE:-coffee_project_db}

  STORAGE_BACKEND: ${STORAGE_BACKEND:-S3}

  S3_ACCESS_KEY: ${MINIO_ROOT_USER:-minioadmin}
  S3_SECRET_KEY: ${MINIO_ROOT_PASSWORD:-minioadmin}
  S3_BUCKET_NAME: ${S3_BUCKET_NAME:-coffee-bucket}
  S3_ENDPOINT_URL_INTERNAL: http://minio:9000
  S3_ENDPOINT_URL_PUBLIC: http://localhost:9000

  REDIS_URL: redis://redis:6379/0
  WEBHOOK_BASE_URL: http://api:8000

  REMOTION_SERVICE_URL: ${REMOTION_SERVICE_URL:-http://remotion:3001}

services:
  db:
    container_name: cpv3_postgres
    image: postgres:16
    restart: unless-stopped
    environment:
      POSTGRES_USER: ${POSTGRES_USER:-postgres}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-postgres}
      POSTGRES_DB: ${POSTGRES_DATABASE:-coffee_project_db}
    ports:
      - "127.0.0.1:5332:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-postgres} -d ${POSTGRES_DB:-coffee_project_db}"]
      interval: 5s
      timeout: 3s
      retries: 20
    volumes:
      - cpv3_db:/var/lib/postgresql/data

  minio:
    container_name: cpv3_minio
    image: minio/minio:RELEASE.2024-11-07T00-52-20Z
    restart: unless-stopped
    ports:
      - "127.0.0.1:9000:9000"
      - "127.0.0.1:9001:9001"
    environment:
      MINIO_ROOT_USER: ${MINIO_ROOT_USER:-minioadmin}
      MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD:-minioadmin}
    command: server /data --console-address ":9001"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 10s
      timeout: 5s
      retries: 5
    volumes:
      - cpv3_minio:/data

  redis:
    container_name: cpv3_redis
    image: redis:7-alpine
    restart: unless-stopped
    ports:
      - "127.0.0.1:6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 10
    volumes:
      - cpv3_redis:/data

  api:
    container_name: cpv3_api
    <<: *backend-image
    restart: unless-stopped
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    environment:
      <<: *backend-env
    ports:
      - "127.0.0.1:8000:8000"
    volumes:
      - ./cpv3:/app/cpv3
      - ./alembic:/app/alembic
      - ./alembic.ini:/app/alembic.ini

  worker:
    container_name: cpv3_worker
    <<: *backend-image
    restart: unless-stopped
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    environment:
      <<: *backend-env
    command: >
      watchfiles --filter python 'dramatiq cpv3.modules.tasks.service --processes 1 --threads 2' /app/cpv3
    volumes:
      - ./cpv3:/app/cpv3

volumes:
  cpv3_db:
  cpv3_minio:
  cpv3_redis:

Key changes in this file:

  • x-backend-env anchor with all shared env vars (DRY)

  • JWT_SECRET_KEY added to worker (was missing)

  • restart: unless-stopped on all services

  • All ports bound to 127.0.0.1 (not 0.0.0.0)

  • MinIO pinned to RELEASE.2024-11-07T00-52-20Z

  • MinIO health check added (curl on /minio/health/live)

  • Removed inline comments for cleanliness

  • Step 2: Validate compose syntax

Run: cd cofee_backend && docker compose config > /dev/null Expected: no errors.

  • Step 3: Test stack starts

Run: cd cofee_backend && docker compose up -d Wait 30s, then: docker compose ps Expected: all services Up or Up (healthy).

  • Step 4: Commit
git add cofee_backend/docker-compose.yml
git commit -m "refactor(infra): DRY env vars, pin images, bind localhost, add restart policies"

Task 4: Move build-essential out of base stage in backend Dockerfile

Files:

  • Modify: cofee_backend/Dockerfile

build-essential is only needed during uv sync (compiling C extensions). Moving it from base to deps saves ~200MB in the prod image since the prod stage inherits from deps but the compiled artifacts are in .venv, not the system packages.

  • Step 1: Restructure Dockerfile stages

Replace the entire cofee_backend/Dockerfile with:

# syntax=docker/dockerfile:1.7

# ---------------------------------------------------------------------------
# Stage 1: base — minimal runtime dependencies (shared by dev and prod)
# ---------------------------------------------------------------------------
FROM python:3.11-slim AS base

COPY --from=ghcr.io/astral-sh/uv:0.8.15 /uv /uvx /bin/

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PATH="/app/.venv/bin:${PATH}"

WORKDIR /app

RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
    --mount=type=cache,target=/var/lib/apt,sharing=locked \
    apt-get update && apt-get install -y --no-install-recommends \
        ffmpeg \
    && rm -rf /var/lib/apt/lists/*

# ---------------------------------------------------------------------------
# Stage 2: deps — install Python dependencies (build-essential here only)
# ---------------------------------------------------------------------------
FROM base AS deps

RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
    --mount=type=cache,target=/var/lib/apt,sharing=locked \
    apt-get update && apt-get install -y --no-install-recommends \
        build-essential \
    && rm -rf /var/lib/apt/lists/*

COPY pyproject.toml uv.lock ./
RUN --mount=type=cache,target=/root/.cache/uv \
    uv sync --frozen --no-dev --no-install-project

# ---------------------------------------------------------------------------
# Stage 3: dev — development target (used by docker-compose)
# ---------------------------------------------------------------------------
FROM deps AS dev

ENV PYTHONPATH=/app

EXPOSE 8000

CMD ["sh", "-c", "alembic upgrade head && uvicorn cpv3.main:app --host 0.0.0.0 --port 8000 --reload --reload-dir /app/cpv3"]

# ---------------------------------------------------------------------------
# Stage 4: prod — production target (no build-essential, non-root user)
# ---------------------------------------------------------------------------
FROM base AS prod

RUN groupadd --gid 1000 app && \
    useradd --uid 1000 --gid app --create-home app

COPY --from=deps /app/.venv /app/.venv
COPY pyproject.toml uv.lock ./

ENV UV_LINK_MODE=copy

COPY cpv3 ./cpv3
COPY alembic ./alembic
COPY alembic.ini ./
RUN --mount=type=cache,target=/root/.cache/uv \
    uv sync --frozen --no-dev

RUN chown -R app:app /app
USER app

EXPOSE 8000

CMD ["sh", "-c", "alembic upgrade head && uvicorn cpv3.main:app --host 0.0.0.0 --port 8000"]

Key changes:

  • build-essential moved from base to deps — prod image is ~200MB smaller

  • prod stage inherits from base (not deps) — no compiler in production

  • prod copies only .venv from deps stage — gets compiled packages without build tools

  • Non-root app user (uid 1000) added to prod stage

  • dev stage still inherits from deps (has build-essential for potential ad-hoc installs)

  • Step 2: Build and verify prod stage

Run: cd cofee_backend && docker build --target prod -t cpv3-backend:prod-test . Expected: builds successfully.

  • Step 3: Build and verify dev stage

Run: cd cofee_backend && docker build --target dev -t cpv3-backend:dev-test . Expected: builds successfully.

  • Step 4: Verify dev stack still works

Run: cd cofee_backend && docker compose up -d --build Wait 30s, then: docker compose ps Expected: all services running.

  • Step 5: Commit
git add cofee_backend/Dockerfile
git commit -m "perf(infra): move build-essential to deps stage, add non-root user to prod"

Task 5: Add BuildKit cache mounts and non-root user to Remotion Dockerfile

Files:

  • Modify: remotion_service/Dockerfile

  • Step 1: Update Remotion Dockerfile

Replace the entire remotion_service/Dockerfile with:

# syntax=docker/dockerfile:1.7-labs
FROM oven/bun:1.3.10 AS base

ENV APP_HOME=/app \
  PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1 \
  REMOTION_PUPPETEER_NO_SANDBOX=1 \
  NODE_ENV=production

WORKDIR ${APP_HOME}

RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
    --mount=type=cache,target=/var/lib/apt,sharing=locked \
    apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
    ca-certificates \
    ffmpeg \
    chromium \
    libglib2.0-0 \
    libnss3 \
    libatk1.0-0 \
    libatk-bridge2.0-0 \
    libdrm2 \
    libxkbcommon0 \
    libgbm1 \
    fonts-noto-color-emoji \
    curl \
    && rm -rf /var/lib/apt/lists/*

FROM base AS deps
WORKDIR ${APP_HOME}
COPY package.json bun.lock ./
RUN NODE_ENV=development bun install --frozen-lockfile

FROM base AS runner
WORKDIR ${APP_HOME}

RUN groupadd --gid 1000 app && \
    useradd --uid 1000 --gid app --create-home app

COPY --from=deps ${APP_HOME}/node_modules ./node_modules
COPY package.json bun.lock ./
COPY tsconfig.json remotion.config.ts ./
COPY public ./public
COPY src ./src
COPY server ./server

RUN mkdir -p out && chown -R app:app /app

USER app

EXPOSE 3001

CMD ["bun", "run", "server"]

Key changes:

  • BuildKit apt cache mounts added (matches backend pattern)

  • Non-root app user (uid 1000) in runner stage

  • chown before USER app so the app owns all files including out/

  • Step 2: Build and verify

Run: cd remotion_service && docker build --target runner -t remotion:test . Expected: builds successfully.

  • Step 3: Commit
git add remotion_service/Dockerfile
git commit -m "perf(infra): add BuildKit cache mounts and non-root user to Remotion Dockerfile"

Task 6: Add resource limits and cap_drop to Remotion docker-compose

Files:

  • Modify: remotion_service/docker-compose.yml

  • Step 1: Update Remotion docker-compose.yml

Replace the entire remotion_service/docker-compose.yml with:

services:
  remotion:
    build:
      context: .
      dockerfile: Dockerfile
      target: runner
    command: >
      sh -lc "NODE_ENV=development bun install --frozen-lockfile && bun run server"
    restart: unless-stopped
    env_file: .env
    environment:
      S3_ENDPOINT_URL: http://minio:9000
      REDIS_URL: redis://redis:6379/0
    ports:
      - "127.0.0.1:3001:3001"
    deploy:
      resources:
        limits:
          memory: 4g
          cpus: "2"
        reservations:
          memory: 1g
          cpus: "0.5"
    cap_drop:
      - ALL
    cap_add:
      - SYS_ADMIN
    volumes:
      - .:/app:cached
      - remotion_node_modules:/app/node_modules
    networks:
      - backend
    stdin_open: true
    tty: true

volumes:
  remotion_node_modules:

networks:
  backend:
    external: true
    name: cofee_backend_default

Key changes:

  • restart: unless-stopped

  • Port bound to 127.0.0.1

  • Resource limits: 4GB memory / 2 CPUs (Chromium + FFmpeg need this)

  • Resource reservations: 1GB / 0.5 CPU (scheduling guarantees)

  • cap_drop: ALL + cap_add: SYS_ADMIN (SYS_ADMIN needed for Chromium sandbox)

  • Step 2: Validate compose syntax

Run: cd remotion_service && docker compose config > /dev/null Expected: no errors.

  • Step 3: Commit
git add remotion_service/docker-compose.yml
git commit -m "fix(infra): add resource limits, cap_drop, restart policy to Remotion compose"

Task 7: Add resource limits and cap_drop to backend docker-compose

Files:

  • Modify: cofee_backend/docker-compose.yml

  • Step 1: Add deploy and cap_drop sections to each service

Add to the db service after volumes:

    cap_drop:
      - ALL
    cap_add:
      - CHOWN
      - DAC_OVERRIDE
      - FOWNER
      - SETGID
      - SETUID

Add to the minio service after volumes:

    cap_drop:
      - ALL
    cap_add:
      - CHOWN
      - DAC_OVERRIDE
      - FOWNER
      - SETGID
      - SETUID

Add to the redis service after volumes:

    cap_drop:
      - ALL

Add to the api service after volumes:

    deploy:
      resources:
        limits:
          memory: 512m
          cpus: "1"
    cap_drop:
      - ALL

Add to the worker service after volumes:

    deploy:
      resources:
        limits:
          memory: 1g
          cpus: "1"
    cap_drop:
      - ALL
  • Step 2: Validate compose syntax

Run: cd cofee_backend && docker compose config > /dev/null Expected: no errors.

  • Step 3: Commit
git add cofee_backend/docker-compose.yml
git commit -m "fix(infra): add resource limits and capability dropping to backend compose"

Task 8: Add health check endpoint to backend API

Files:

  • Modify: cofee_backend/cpv3/modules/system/router.py

The existing /api/ping/ only returns a static response. We need a /api/health/ endpoint that checks DB and Redis connectivity for Docker health checks.

  • Step 1: Add health endpoint to system router

Replace the contents of cofee_backend/cpv3/modules/system/router.py with:

from __future__ import annotations

from fastapi import APIRouter, Depends
from sqlalchemy import text
from sqlalchemy.ext.asyncio import AsyncSession

from cpv3.db.session import get_db
from cpv3.infrastructure.settings import get_settings

router = APIRouter(prefix="/api", tags=["System"])

_settings = get_settings()


@router.get("/ping/")
async def ping() -> dict[str, str]:
    return {"status": "ok"}


@router.get("/health/")
async def health(db: AsyncSession = Depends(get_db)) -> dict[str, str]:
    """Health check for Docker/K8s probes. Verifies DB connectivity."""
    try:
        await db.execute(text("SELECT 1"))
        db_status = "connected"
    except Exception:
        db_status = "disconnected"

    status = "ok" if db_status == "connected" else "degraded"
    return {"status": status, "database": db_status}
  • Step 2: Run linter

Run: cd cofee_backend && uv run ruff check cpv3/modules/system/router.py Expected: no errors.

  • Step 3: Run existing tests

Run: cd cofee_backend && uv run pytest -x -q 2>&1 | tail -10 Expected: all tests pass (health endpoint is additive, no breaking changes).

  • Step 4: Commit
git add cofee_backend/cpv3/modules/system/router.py
git commit -m "feat(backend): add /api/health/ endpoint for Docker health checks"

Task 9: Add health check endpoint to Remotion service

Files:

  • Modify: remotion_service/server/index.ts

  • Step 1: Add /health endpoint before app.listen

Add before the app.listen(...) line (around line 138) in remotion_service/server/index.ts:

app.get("/health", async () => {
  return { status: "ok" };
});

Note: This is outside the /api prefix since it's at the Elysia instance level. The endpoint will be available at GET /api/health because the Elysia instance has prefix: "/api".

  • Step 2: Type check

Run: cd remotion_service && bunx tsc --noEmit Expected: no new errors.

  • Step 3: Commit
git add remotion_service/server/index.ts
git commit -m "feat(remotion): add /api/health endpoint for Docker health checks"

Task 10: Add health checks for api, worker, and remotion in compose files

Files:

  • Modify: cofee_backend/docker-compose.yml

  • Modify: remotion_service/docker-compose.yml

  • Step 1: Add healthcheck to api service

Add to api service in cofee_backend/docker-compose.yml (after depends_on):

    healthcheck:
      test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/api/health/')"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s
  • Step 2: Add healthcheck to worker service

The worker has no HTTP port. Use a process check. Add to worker service:

    healthcheck:
      test: ["CMD-SHELL", "pgrep -f dramatiq || exit 1"]
      interval: 15s
      timeout: 5s
      retries: 3
  • Step 3: Add healthcheck to remotion service

Add to remotion service in remotion_service/docker-compose.yml (after environment):

    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3001/api/health"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 15s
  • Step 4: Validate both compose files

Run: cd cofee_backend && docker compose config > /dev/null && cd ../remotion_service && docker compose config > /dev/null Expected: no errors.

  • Step 5: Commit
git add cofee_backend/docker-compose.yml remotion_service/docker-compose.yml
git commit -m "feat(infra): add health checks to api, worker, and remotion services"

Task 11: Add network segmentation to backend compose

Files:

  • Modify: cofee_backend/docker-compose.yml

Currently all services share one flat network. Separate into db-net (data stores) and app-net (application services). This prevents Remotion from reaching DB/Redis directly.

  • Step 1: Add networks to compose

Add at the bottom of cofee_backend/docker-compose.yml, replacing the existing volumes: section:

volumes:
  cpv3_db:
  cpv3_minio:
  cpv3_redis:

networks:
  db-net:
    driver: bridge
  app-net:
    driver: bridge
  • Step 2: Add network assignments to each service

Add to db:

    networks:
      - db-net

Add to redis:

    networks:
      - db-net

Add to minio:

    networks:
      - db-net
      - app-net

Add to api:

    networks:
      - db-net
      - app-net

Add to worker:

    networks:
      - db-net
      - app-net
  • Step 3: Update Remotion compose to use app-net

In remotion_service/docker-compose.yml, change the networks section:

networks:
  backend:
    external: true
    name: cofee_backend_app-net

This ensures Remotion can reach MinIO and API (on app-net) but NOT PostgreSQL or Redis (on db-net).

  • Step 4: Validate both compose files

Run: cd cofee_backend && docker compose config > /dev/null && cd ../remotion_service && docker compose config > /dev/null Expected: no errors.

  • Step 5: Test full stack connectivity

Run:

cd cofee_backend && docker compose down && docker compose up -d
# Wait for healthy
cd ../remotion_service && docker compose down && docker compose up -d

Verify API can reach DB, Redis, MinIO. Verify Remotion can reach MinIO but NOT DB.

  • Step 6: Commit
git add cofee_backend/docker-compose.yml remotion_service/docker-compose.yml
git commit -m "feat(infra): add network segmentation — db-net and app-net isolation"

Task 12: Final verification

  • Step 1: Bring down everything
cd cofee_backend && docker compose down
cd ../remotion_service && docker compose down
  • Step 2: Clean build
cd cofee_backend && docker compose build --no-cache
cd ../remotion_service && docker compose build --no-cache
  • Step 3: Start backend stack
cd cofee_backend && docker compose up -d

Wait for: docker compose ps shows all services healthy.

  • Step 4: Start Remotion stack
cd remotion_service && docker compose up -d

Wait for: docker compose ps shows remotion healthy.

  • Step 5: Test API health

Run: curl http://127.0.0.1:8000/api/health/ Expected: {"status":"ok","database":"connected"}

  • Step 6: Test Remotion health

Run: curl http://127.0.0.1:3001/api/health Expected: {"status":"ok"}

  • Step 7: Verify port binding

Run: docker compose -f cofee_backend/docker-compose.yml ps --format '{{.Name}} {{.Ports}}' Expected: all ports show 127.0.0.1:XXXX->YYYY/tcp (not 0.0.0.0).

  • Step 8: Verify resource limits

Run: docker inspect cpv3_api --format '{{.HostConfig.Memory}}' Expected: 536870912 (512MB).

Run: docker inspect remotion --format '{{.HostConfig.Memory}}' Expected: 4294967296 (4GB).