# Docker Infrastructure Hardening — Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Harden all Docker infrastructure across the monorepo — security, build optimization, service organization, health checks, and networking. **Architecture:** 4-phase approach: quick config fixes first (no code changes), then Dockerfile improvements, then health endpoints + networking, then resource limits. Each phase produces a working stack. **Tech Stack:** Docker, Docker Compose, FastAPI (Python), ElysiaJS (Bun/TypeScript), PostgreSQL, Redis, MinIO --- ### Task 1: Add .env to .gitignore files **Files:** - Modify: `cofee_backend/.gitignore` - Modify: `cofee_frontend/.gitignore` - [ ] **Step 1: Add .env exclusion to backend .gitignore** Append to `cofee_backend/.gitignore`: ``` # Environment .env .env.* ``` - [ ] **Step 2: Add .env exclusion to frontend .gitignore** The frontend `.gitignore` has `.env*.local` but not `.env` itself. Add before the `# local env files` section in `cofee_frontend/.gitignore`: ``` # Environment .env ``` Note: Keep the existing `.env*.local` line too. - [ ] **Step 3: Verify .env files are not tracked** Run: `git ls-files | grep '\.env'` Expected: no output. If any .env files are tracked, run `git rm --cached ` for each. - [ ] **Step 4: Commit** ```bash git add cofee_backend/.gitignore cofee_frontend/.gitignore git commit -m "fix(infra): add .env to backend and frontend .gitignore" ``` --- ### Task 2: Add .env to backend .dockerignore **Files:** - Modify: `cofee_backend/.dockerignore` - [ ] **Step 1: Add .env exclusion** Add to `cofee_backend/.dockerignore`: ``` .env .env.* ``` - [ ] **Step 2: Commit** ```bash git add cofee_backend/.dockerignore git commit -m "fix(infra): exclude .env from backend Docker build context" ``` --- ### Task 3: DRY up docker-compose env vars with YAML anchor **Files:** - Modify: `cofee_backend/docker-compose.yml` The `api` and `worker` services share 14 identical env vars. Extract into an `x-backend-env` anchor. Also adds the missing `JWT_SECRET_KEY` to worker. - [ ] **Step 1: Add x-backend-env anchor and refactor services** Replace the entire `cofee_backend/docker-compose.yml` with: ```yaml x-backend-image: &backend-image image: cpv3-backend:dev build: context: . dockerfile: Dockerfile target: dev x-backend-env: &backend-env DEBUG: ${DEBUG:-1} JWT_SECRET_KEY: ${JWT_SECRET_KEY:-dev-secret} POSTGRES_USER: ${POSTGRES_USER:-postgres} POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-postgres} POSTGRES_HOST: db POSTGRES_PORT: 5432 POSTGRES_DATABASE: ${POSTGRES_DATABASE:-coffee_project_db} STORAGE_BACKEND: ${STORAGE_BACKEND:-S3} S3_ACCESS_KEY: ${MINIO_ROOT_USER:-minioadmin} S3_SECRET_KEY: ${MINIO_ROOT_PASSWORD:-minioadmin} S3_BUCKET_NAME: ${S3_BUCKET_NAME:-coffee-bucket} S3_ENDPOINT_URL_INTERNAL: http://minio:9000 S3_ENDPOINT_URL_PUBLIC: http://localhost:9000 REDIS_URL: redis://redis:6379/0 WEBHOOK_BASE_URL: http://api:8000 REMOTION_SERVICE_URL: ${REMOTION_SERVICE_URL:-http://remotion:3001} services: db: container_name: cpv3_postgres image: postgres:16 restart: unless-stopped environment: POSTGRES_USER: ${POSTGRES_USER:-postgres} POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-postgres} POSTGRES_DB: ${POSTGRES_DATABASE:-coffee_project_db} ports: - "127.0.0.1:5332:5432" healthcheck: test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-postgres} -d ${POSTGRES_DB:-coffee_project_db}"] interval: 5s timeout: 3s retries: 20 volumes: - cpv3_db:/var/lib/postgresql/data minio: container_name: cpv3_minio image: minio/minio:RELEASE.2024-11-07T00-52-20Z restart: unless-stopped ports: - "127.0.0.1:9000:9000" - "127.0.0.1:9001:9001" environment: MINIO_ROOT_USER: ${MINIO_ROOT_USER:-minioadmin} MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD:-minioadmin} command: server /data --console-address ":9001" healthcheck: test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"] interval: 10s timeout: 5s retries: 5 volumes: - cpv3_minio:/data redis: container_name: cpv3_redis image: redis:7-alpine restart: unless-stopped ports: - "127.0.0.1:6379:6379" healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 5s timeout: 3s retries: 10 volumes: - cpv3_redis:/data api: container_name: cpv3_api <<: *backend-image restart: unless-stopped depends_on: db: condition: service_healthy redis: condition: service_healthy environment: <<: *backend-env ports: - "127.0.0.1:8000:8000" volumes: - ./cpv3:/app/cpv3 - ./alembic:/app/alembic - ./alembic.ini:/app/alembic.ini worker: container_name: cpv3_worker <<: *backend-image restart: unless-stopped depends_on: db: condition: service_healthy redis: condition: service_healthy environment: <<: *backend-env command: > watchfiles --filter python 'dramatiq cpv3.modules.tasks.service --processes 1 --threads 2' /app/cpv3 volumes: - ./cpv3:/app/cpv3 volumes: cpv3_db: cpv3_minio: cpv3_redis: ``` Key changes in this file: - `x-backend-env` anchor with all shared env vars (DRY) - `JWT_SECRET_KEY` added to worker (was missing) - `restart: unless-stopped` on all services - All ports bound to `127.0.0.1` (not `0.0.0.0`) - MinIO pinned to `RELEASE.2024-11-07T00-52-20Z` - MinIO health check added (`curl` on `/minio/health/live`) - Removed inline comments for cleanliness - [ ] **Step 2: Validate compose syntax** Run: `cd cofee_backend && docker compose config > /dev/null` Expected: no errors. - [ ] **Step 3: Test stack starts** Run: `cd cofee_backend && docker compose up -d` Wait 30s, then: `docker compose ps` Expected: all services `Up` or `Up (healthy)`. - [ ] **Step 4: Commit** ```bash git add cofee_backend/docker-compose.yml git commit -m "refactor(infra): DRY env vars, pin images, bind localhost, add restart policies" ``` --- ### Task 4: Move build-essential out of base stage in backend Dockerfile **Files:** - Modify: `cofee_backend/Dockerfile` `build-essential` is only needed during `uv sync` (compiling C extensions). Moving it from `base` to `deps` saves ~200MB in the prod image since the `prod` stage inherits from `deps` but the compiled artifacts are in `.venv`, not the system packages. - [ ] **Step 1: Restructure Dockerfile stages** Replace the entire `cofee_backend/Dockerfile` with: ```dockerfile # syntax=docker/dockerfile:1.7 # --------------------------------------------------------------------------- # Stage 1: base — minimal runtime dependencies (shared by dev and prod) # --------------------------------------------------------------------------- FROM python:3.11-slim AS base COPY --from=ghcr.io/astral-sh/uv:0.8.15 /uv /uvx /bin/ ENV PYTHONDONTWRITEBYTECODE=1 \ PYTHONUNBUFFERED=1 \ PATH="/app/.venv/bin:${PATH}" WORKDIR /app RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ --mount=type=cache,target=/var/lib/apt,sharing=locked \ apt-get update && apt-get install -y --no-install-recommends \ ffmpeg \ && rm -rf /var/lib/apt/lists/* # --------------------------------------------------------------------------- # Stage 2: deps — install Python dependencies (build-essential here only) # --------------------------------------------------------------------------- FROM base AS deps RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ --mount=type=cache,target=/var/lib/apt,sharing=locked \ apt-get update && apt-get install -y --no-install-recommends \ build-essential \ && rm -rf /var/lib/apt/lists/* COPY pyproject.toml uv.lock ./ RUN --mount=type=cache,target=/root/.cache/uv \ uv sync --frozen --no-dev --no-install-project # --------------------------------------------------------------------------- # Stage 3: dev — development target (used by docker-compose) # --------------------------------------------------------------------------- FROM deps AS dev ENV PYTHONPATH=/app EXPOSE 8000 CMD ["sh", "-c", "alembic upgrade head && uvicorn cpv3.main:app --host 0.0.0.0 --port 8000 --reload --reload-dir /app/cpv3"] # --------------------------------------------------------------------------- # Stage 4: prod — production target (no build-essential, non-root user) # --------------------------------------------------------------------------- FROM base AS prod RUN groupadd --gid 1000 app && \ useradd --uid 1000 --gid app --create-home app COPY --from=deps /app/.venv /app/.venv COPY pyproject.toml uv.lock ./ ENV UV_LINK_MODE=copy COPY cpv3 ./cpv3 COPY alembic ./alembic COPY alembic.ini ./ RUN --mount=type=cache,target=/root/.cache/uv \ uv sync --frozen --no-dev RUN chown -R app:app /app USER app EXPOSE 8000 CMD ["sh", "-c", "alembic upgrade head && uvicorn cpv3.main:app --host 0.0.0.0 --port 8000"] ``` Key changes: - `build-essential` moved from `base` to `deps` — prod image is ~200MB smaller - `prod` stage inherits from `base` (not `deps`) — no compiler in production - `prod` copies only `.venv` from `deps` stage — gets compiled packages without build tools - Non-root `app` user (uid 1000) added to `prod` stage - `dev` stage still inherits from `deps` (has build-essential for potential ad-hoc installs) - [ ] **Step 2: Build and verify prod stage** Run: `cd cofee_backend && docker build --target prod -t cpv3-backend:prod-test .` Expected: builds successfully. - [ ] **Step 3: Build and verify dev stage** Run: `cd cofee_backend && docker build --target dev -t cpv3-backend:dev-test .` Expected: builds successfully. - [ ] **Step 4: Verify dev stack still works** Run: `cd cofee_backend && docker compose up -d --build` Wait 30s, then: `docker compose ps` Expected: all services running. - [ ] **Step 5: Commit** ```bash git add cofee_backend/Dockerfile git commit -m "perf(infra): move build-essential to deps stage, add non-root user to prod" ``` --- ### Task 5: Add BuildKit cache mounts and non-root user to Remotion Dockerfile **Files:** - Modify: `remotion_service/Dockerfile` - [ ] **Step 1: Update Remotion Dockerfile** Replace the entire `remotion_service/Dockerfile` with: ```dockerfile # syntax=docker/dockerfile:1.7-labs FROM oven/bun:1.3.10 AS base ENV APP_HOME=/app \ PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1 \ REMOTION_PUPPETEER_NO_SANDBOX=1 \ NODE_ENV=production WORKDIR ${APP_HOME} RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \ --mount=type=cache,target=/var/lib/apt,sharing=locked \ apt-get update && \ DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ ca-certificates \ ffmpeg \ chromium \ libglib2.0-0 \ libnss3 \ libatk1.0-0 \ libatk-bridge2.0-0 \ libdrm2 \ libxkbcommon0 \ libgbm1 \ fonts-noto-color-emoji \ curl \ && rm -rf /var/lib/apt/lists/* FROM base AS deps WORKDIR ${APP_HOME} COPY package.json bun.lock ./ RUN NODE_ENV=development bun install --frozen-lockfile FROM base AS runner WORKDIR ${APP_HOME} RUN groupadd --gid 1000 app && \ useradd --uid 1000 --gid app --create-home app COPY --from=deps ${APP_HOME}/node_modules ./node_modules COPY package.json bun.lock ./ COPY tsconfig.json remotion.config.ts ./ COPY public ./public COPY src ./src COPY server ./server RUN mkdir -p out && chown -R app:app /app USER app EXPOSE 3001 CMD ["bun", "run", "server"] ``` Key changes: - BuildKit apt cache mounts added (matches backend pattern) - Non-root `app` user (uid 1000) in runner stage - `chown` before `USER app` so the app owns all files including `out/` - [ ] **Step 2: Build and verify** Run: `cd remotion_service && docker build --target runner -t remotion:test .` Expected: builds successfully. - [ ] **Step 3: Commit** ```bash git add remotion_service/Dockerfile git commit -m "perf(infra): add BuildKit cache mounts and non-root user to Remotion Dockerfile" ``` --- ### Task 6: Add resource limits and cap_drop to Remotion docker-compose **Files:** - Modify: `remotion_service/docker-compose.yml` - [ ] **Step 1: Update Remotion docker-compose.yml** Replace the entire `remotion_service/docker-compose.yml` with: ```yaml services: remotion: build: context: . dockerfile: Dockerfile target: runner command: > sh -lc "NODE_ENV=development bun install --frozen-lockfile && bun run server" restart: unless-stopped env_file: .env environment: S3_ENDPOINT_URL: http://minio:9000 REDIS_URL: redis://redis:6379/0 ports: - "127.0.0.1:3001:3001" deploy: resources: limits: memory: 4g cpus: "2" reservations: memory: 1g cpus: "0.5" cap_drop: - ALL cap_add: - SYS_ADMIN volumes: - .:/app:cached - remotion_node_modules:/app/node_modules networks: - backend stdin_open: true tty: true volumes: remotion_node_modules: networks: backend: external: true name: cofee_backend_default ``` Key changes: - `restart: unless-stopped` - Port bound to `127.0.0.1` - Resource limits: 4GB memory / 2 CPUs (Chromium + FFmpeg need this) - Resource reservations: 1GB / 0.5 CPU (scheduling guarantees) - `cap_drop: ALL` + `cap_add: SYS_ADMIN` (SYS_ADMIN needed for Chromium sandbox) - [ ] **Step 2: Validate compose syntax** Run: `cd remotion_service && docker compose config > /dev/null` Expected: no errors. - [ ] **Step 3: Commit** ```bash git add remotion_service/docker-compose.yml git commit -m "fix(infra): add resource limits, cap_drop, restart policy to Remotion compose" ``` --- ### Task 7: Add resource limits and cap_drop to backend docker-compose **Files:** - Modify: `cofee_backend/docker-compose.yml` - [ ] **Step 1: Add deploy and cap_drop sections to each service** Add to the `db` service after `volumes`: ```yaml cap_drop: - ALL cap_add: - CHOWN - DAC_OVERRIDE - FOWNER - SETGID - SETUID ``` Add to the `minio` service after `volumes`: ```yaml cap_drop: - ALL cap_add: - CHOWN - DAC_OVERRIDE - FOWNER - SETGID - SETUID ``` Add to the `redis` service after `volumes`: ```yaml cap_drop: - ALL ``` Add to the `api` service after `volumes`: ```yaml deploy: resources: limits: memory: 512m cpus: "1" cap_drop: - ALL ``` Add to the `worker` service after `volumes`: ```yaml deploy: resources: limits: memory: 1g cpus: "1" cap_drop: - ALL ``` - [ ] **Step 2: Validate compose syntax** Run: `cd cofee_backend && docker compose config > /dev/null` Expected: no errors. - [ ] **Step 3: Commit** ```bash git add cofee_backend/docker-compose.yml git commit -m "fix(infra): add resource limits and capability dropping to backend compose" ``` --- ### Task 8: Add health check endpoint to backend API **Files:** - Modify: `cofee_backend/cpv3/modules/system/router.py` The existing `/api/ping/` only returns a static response. We need a `/api/health/` endpoint that checks DB and Redis connectivity for Docker health checks. - [ ] **Step 1: Add health endpoint to system router** Replace the contents of `cofee_backend/cpv3/modules/system/router.py` with: ```python from __future__ import annotations from fastapi import APIRouter, Depends from sqlalchemy import text from sqlalchemy.ext.asyncio import AsyncSession from cpv3.db.session import get_db from cpv3.infrastructure.settings import get_settings router = APIRouter(prefix="/api", tags=["System"]) _settings = get_settings() @router.get("/ping/") async def ping() -> dict[str, str]: return {"status": "ok"} @router.get("/health/") async def health(db: AsyncSession = Depends(get_db)) -> dict[str, str]: """Health check for Docker/K8s probes. Verifies DB connectivity.""" try: await db.execute(text("SELECT 1")) db_status = "connected" except Exception: db_status = "disconnected" status = "ok" if db_status == "connected" else "degraded" return {"status": status, "database": db_status} ``` - [ ] **Step 2: Run linter** Run: `cd cofee_backend && uv run ruff check cpv3/modules/system/router.py` Expected: no errors. - [ ] **Step 3: Run existing tests** Run: `cd cofee_backend && uv run pytest -x -q 2>&1 | tail -10` Expected: all tests pass (health endpoint is additive, no breaking changes). - [ ] **Step 4: Commit** ```bash git add cofee_backend/cpv3/modules/system/router.py git commit -m "feat(backend): add /api/health/ endpoint for Docker health checks" ``` --- ### Task 9: Add health check endpoint to Remotion service **Files:** - Modify: `remotion_service/server/index.ts` - [ ] **Step 1: Add /health endpoint before app.listen** Add before the `app.listen(...)` line (around line 138) in `remotion_service/server/index.ts`: ```typescript app.get("/health", async () => { return { status: "ok" }; }); ``` Note: This is outside the `/api` prefix since it's at the Elysia instance level. The endpoint will be available at `GET /api/health` because the Elysia instance has `prefix: "/api"`. - [ ] **Step 2: Type check** Run: `cd remotion_service && bunx tsc --noEmit` Expected: no new errors. - [ ] **Step 3: Commit** ```bash git add remotion_service/server/index.ts git commit -m "feat(remotion): add /api/health endpoint for Docker health checks" ``` --- ### Task 10: Add health checks for api, worker, and remotion in compose files **Files:** - Modify: `cofee_backend/docker-compose.yml` - Modify: `remotion_service/docker-compose.yml` - [ ] **Step 1: Add healthcheck to api service** Add to `api` service in `cofee_backend/docker-compose.yml` (after `depends_on`): ```yaml healthcheck: test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/api/health/')"] interval: 10s timeout: 5s retries: 5 start_period: 30s ``` - [ ] **Step 2: Add healthcheck to worker service** The worker has no HTTP port. Use a process check. Add to `worker` service: ```yaml healthcheck: test: ["CMD-SHELL", "pgrep -f dramatiq || exit 1"] interval: 15s timeout: 5s retries: 3 ``` - [ ] **Step 3: Add healthcheck to remotion service** Add to `remotion` service in `remotion_service/docker-compose.yml` (after `environment`): ```yaml healthcheck: test: ["CMD", "curl", "-f", "http://localhost:3001/api/health"] interval: 10s timeout: 5s retries: 5 start_period: 15s ``` - [ ] **Step 4: Validate both compose files** Run: `cd cofee_backend && docker compose config > /dev/null && cd ../remotion_service && docker compose config > /dev/null` Expected: no errors. - [ ] **Step 5: Commit** ```bash git add cofee_backend/docker-compose.yml remotion_service/docker-compose.yml git commit -m "feat(infra): add health checks to api, worker, and remotion services" ``` --- ### Task 11: Add network segmentation to backend compose **Files:** - Modify: `cofee_backend/docker-compose.yml` Currently all services share one flat network. Separate into `db-net` (data stores) and `app-net` (application services). This prevents Remotion from reaching DB/Redis directly. - [ ] **Step 1: Add networks to compose** Add at the bottom of `cofee_backend/docker-compose.yml`, replacing the existing `volumes:` section: ```yaml volumes: cpv3_db: cpv3_minio: cpv3_redis: networks: db-net: driver: bridge app-net: driver: bridge ``` - [ ] **Step 2: Add network assignments to each service** Add to `db`: ```yaml networks: - db-net ``` Add to `redis`: ```yaml networks: - db-net ``` Add to `minio`: ```yaml networks: - db-net - app-net ``` Add to `api`: ```yaml networks: - db-net - app-net ``` Add to `worker`: ```yaml networks: - db-net - app-net ``` - [ ] **Step 3: Update Remotion compose to use app-net** In `remotion_service/docker-compose.yml`, change the networks section: ```yaml networks: backend: external: true name: cofee_backend_app-net ``` This ensures Remotion can reach MinIO and API (on `app-net`) but NOT PostgreSQL or Redis (on `db-net`). - [ ] **Step 4: Validate both compose files** Run: `cd cofee_backend && docker compose config > /dev/null && cd ../remotion_service && docker compose config > /dev/null` Expected: no errors. - [ ] **Step 5: Test full stack connectivity** Run: ```bash cd cofee_backend && docker compose down && docker compose up -d # Wait for healthy cd ../remotion_service && docker compose down && docker compose up -d ``` Verify API can reach DB, Redis, MinIO. Verify Remotion can reach MinIO but NOT DB. - [ ] **Step 6: Commit** ```bash git add cofee_backend/docker-compose.yml remotion_service/docker-compose.yml git commit -m "feat(infra): add network segmentation — db-net and app-net isolation" ``` --- ### Task 12: Final verification - [ ] **Step 1: Bring down everything** ```bash cd cofee_backend && docker compose down cd ../remotion_service && docker compose down ``` - [ ] **Step 2: Clean build** ```bash cd cofee_backend && docker compose build --no-cache cd ../remotion_service && docker compose build --no-cache ``` - [ ] **Step 3: Start backend stack** ```bash cd cofee_backend && docker compose up -d ``` Wait for: `docker compose ps` shows all services healthy. - [ ] **Step 4: Start Remotion stack** ```bash cd remotion_service && docker compose up -d ``` Wait for: `docker compose ps` shows remotion healthy. - [ ] **Step 5: Test API health** Run: `curl http://127.0.0.1:8000/api/health/` Expected: `{"status":"ok","database":"connected"}` - [ ] **Step 6: Test Remotion health** Run: `curl http://127.0.0.1:3001/api/health` Expected: `{"status":"ok"}` - [ ] **Step 7: Verify port binding** Run: `docker compose -f cofee_backend/docker-compose.yml ps --format '{{.Name}} {{.Ports}}'` Expected: all ports show `127.0.0.1:XXXX->YYYY/tcp` (not `0.0.0.0`). - [ ] **Step 8: Verify resource limits** Run: `docker inspect cpv3_api --format '{{.HostConfig.Memory}}'` Expected: `536870912` (512MB). Run: `docker inspect remotion --format '{{.HostConfig.Memory}}'` Expected: `4294967296` (4GB).