22 KiB
Docker Infrastructure Hardening — Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Harden all Docker infrastructure across the monorepo — security, build optimization, service organization, health checks, and networking.
Architecture: 4-phase approach: quick config fixes first (no code changes), then Dockerfile improvements, then health endpoints + networking, then resource limits. Each phase produces a working stack.
Tech Stack: Docker, Docker Compose, FastAPI (Python), ElysiaJS (Bun/TypeScript), PostgreSQL, Redis, MinIO
Task 1: Add .env to .gitignore files
Files:
-
Modify:
cofee_backend/.gitignore -
Modify:
cofee_frontend/.gitignore -
Step 1: Add .env exclusion to backend .gitignore
Append to cofee_backend/.gitignore:
# Environment
.env
.env.*
- Step 2: Add .env exclusion to frontend .gitignore
The frontend .gitignore has .env*.local but not .env itself. Add before the # local env files section in cofee_frontend/.gitignore:
# Environment
.env
Note: Keep the existing .env*.local line too.
- Step 3: Verify .env files are not tracked
Run: git ls-files | grep '\.env'
Expected: no output. If any .env files are tracked, run git rm --cached <file> for each.
- Step 4: Commit
git add cofee_backend/.gitignore cofee_frontend/.gitignore
git commit -m "fix(infra): add .env to backend and frontend .gitignore"
Task 2: Add .env to backend .dockerignore
Files:
-
Modify:
cofee_backend/.dockerignore -
Step 1: Add .env exclusion
Add to cofee_backend/.dockerignore:
.env
.env.*
- Step 2: Commit
git add cofee_backend/.dockerignore
git commit -m "fix(infra): exclude .env from backend Docker build context"
Task 3: DRY up docker-compose env vars with YAML anchor
Files:
- Modify:
cofee_backend/docker-compose.yml
The api and worker services share 14 identical env vars. Extract into an x-backend-env anchor. Also adds the missing JWT_SECRET_KEY to worker.
- Step 1: Add x-backend-env anchor and refactor services
Replace the entire cofee_backend/docker-compose.yml with:
x-backend-image: &backend-image
image: cpv3-backend:dev
build:
context: .
dockerfile: Dockerfile
target: dev
x-backend-env: &backend-env
DEBUG: ${DEBUG:-1}
JWT_SECRET_KEY: ${JWT_SECRET_KEY:-dev-secret}
POSTGRES_USER: ${POSTGRES_USER:-postgres}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-postgres}
POSTGRES_HOST: db
POSTGRES_PORT: 5432
POSTGRES_DATABASE: ${POSTGRES_DATABASE:-coffee_project_db}
STORAGE_BACKEND: ${STORAGE_BACKEND:-S3}
S3_ACCESS_KEY: ${MINIO_ROOT_USER:-minioadmin}
S3_SECRET_KEY: ${MINIO_ROOT_PASSWORD:-minioadmin}
S3_BUCKET_NAME: ${S3_BUCKET_NAME:-coffee-bucket}
S3_ENDPOINT_URL_INTERNAL: http://minio:9000
S3_ENDPOINT_URL_PUBLIC: http://localhost:9000
REDIS_URL: redis://redis:6379/0
WEBHOOK_BASE_URL: http://api:8000
REMOTION_SERVICE_URL: ${REMOTION_SERVICE_URL:-http://remotion:3001}
services:
db:
container_name: cpv3_postgres
image: postgres:16
restart: unless-stopped
environment:
POSTGRES_USER: ${POSTGRES_USER:-postgres}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-postgres}
POSTGRES_DB: ${POSTGRES_DATABASE:-coffee_project_db}
ports:
- "127.0.0.1:5332:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-postgres} -d ${POSTGRES_DB:-coffee_project_db}"]
interval: 5s
timeout: 3s
retries: 20
volumes:
- cpv3_db:/var/lib/postgresql/data
minio:
container_name: cpv3_minio
image: minio/minio:RELEASE.2024-11-07T00-52-20Z
restart: unless-stopped
ports:
- "127.0.0.1:9000:9000"
- "127.0.0.1:9001:9001"
environment:
MINIO_ROOT_USER: ${MINIO_ROOT_USER:-minioadmin}
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD:-minioadmin}
command: server /data --console-address ":9001"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 10s
timeout: 5s
retries: 5
volumes:
- cpv3_minio:/data
redis:
container_name: cpv3_redis
image: redis:7-alpine
restart: unless-stopped
ports:
- "127.0.0.1:6379:6379"
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 10
volumes:
- cpv3_redis:/data
api:
container_name: cpv3_api
<<: *backend-image
restart: unless-stopped
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
environment:
<<: *backend-env
ports:
- "127.0.0.1:8000:8000"
volumes:
- ./cpv3:/app/cpv3
- ./alembic:/app/alembic
- ./alembic.ini:/app/alembic.ini
worker:
container_name: cpv3_worker
<<: *backend-image
restart: unless-stopped
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
environment:
<<: *backend-env
command: >
watchfiles --filter python 'dramatiq cpv3.modules.tasks.service --processes 1 --threads 2' /app/cpv3
volumes:
- ./cpv3:/app/cpv3
volumes:
cpv3_db:
cpv3_minio:
cpv3_redis:
Key changes in this file:
-
x-backend-envanchor with all shared env vars (DRY) -
JWT_SECRET_KEYadded to worker (was missing) -
restart: unless-stoppedon all services -
All ports bound to
127.0.0.1(not0.0.0.0) -
MinIO pinned to
RELEASE.2024-11-07T00-52-20Z -
MinIO health check added (
curlon/minio/health/live) -
Removed inline comments for cleanliness
-
Step 2: Validate compose syntax
Run: cd cofee_backend && docker compose config > /dev/null
Expected: no errors.
- Step 3: Test stack starts
Run: cd cofee_backend && docker compose up -d
Wait 30s, then: docker compose ps
Expected: all services Up or Up (healthy).
- Step 4: Commit
git add cofee_backend/docker-compose.yml
git commit -m "refactor(infra): DRY env vars, pin images, bind localhost, add restart policies"
Task 4: Move build-essential out of base stage in backend Dockerfile
Files:
- Modify:
cofee_backend/Dockerfile
build-essential is only needed during uv sync (compiling C extensions). Moving it from base to deps saves ~200MB in the prod image since the prod stage inherits from deps but the compiled artifacts are in .venv, not the system packages.
- Step 1: Restructure Dockerfile stages
Replace the entire cofee_backend/Dockerfile with:
# syntax=docker/dockerfile:1.7
# ---------------------------------------------------------------------------
# Stage 1: base — minimal runtime dependencies (shared by dev and prod)
# ---------------------------------------------------------------------------
FROM python:3.11-slim AS base
COPY --from=ghcr.io/astral-sh/uv:0.8.15 /uv /uvx /bin/
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PATH="/app/.venv/bin:${PATH}"
WORKDIR /app
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
apt-get update && apt-get install -y --no-install-recommends \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*
# ---------------------------------------------------------------------------
# Stage 2: deps — install Python dependencies (build-essential here only)
# ---------------------------------------------------------------------------
FROM base AS deps
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
apt-get update && apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*
COPY pyproject.toml uv.lock ./
RUN --mount=type=cache,target=/root/.cache/uv \
uv sync --frozen --no-dev --no-install-project
# ---------------------------------------------------------------------------
# Stage 3: dev — development target (used by docker-compose)
# ---------------------------------------------------------------------------
FROM deps AS dev
ENV PYTHONPATH=/app
EXPOSE 8000
CMD ["sh", "-c", "alembic upgrade head && uvicorn cpv3.main:app --host 0.0.0.0 --port 8000 --reload --reload-dir /app/cpv3"]
# ---------------------------------------------------------------------------
# Stage 4: prod — production target (no build-essential, non-root user)
# ---------------------------------------------------------------------------
FROM base AS prod
RUN groupadd --gid 1000 app && \
useradd --uid 1000 --gid app --create-home app
COPY --from=deps /app/.venv /app/.venv
COPY pyproject.toml uv.lock ./
ENV UV_LINK_MODE=copy
COPY cpv3 ./cpv3
COPY alembic ./alembic
COPY alembic.ini ./
RUN --mount=type=cache,target=/root/.cache/uv \
uv sync --frozen --no-dev
RUN chown -R app:app /app
USER app
EXPOSE 8000
CMD ["sh", "-c", "alembic upgrade head && uvicorn cpv3.main:app --host 0.0.0.0 --port 8000"]
Key changes:
-
build-essentialmoved frombasetodeps— prod image is ~200MB smaller -
prodstage inherits frombase(notdeps) — no compiler in production -
prodcopies only.venvfromdepsstage — gets compiled packages without build tools -
Non-root
appuser (uid 1000) added toprodstage -
devstage still inherits fromdeps(has build-essential for potential ad-hoc installs) -
Step 2: Build and verify prod stage
Run: cd cofee_backend && docker build --target prod -t cpv3-backend:prod-test .
Expected: builds successfully.
- Step 3: Build and verify dev stage
Run: cd cofee_backend && docker build --target dev -t cpv3-backend:dev-test .
Expected: builds successfully.
- Step 4: Verify dev stack still works
Run: cd cofee_backend && docker compose up -d --build
Wait 30s, then: docker compose ps
Expected: all services running.
- Step 5: Commit
git add cofee_backend/Dockerfile
git commit -m "perf(infra): move build-essential to deps stage, add non-root user to prod"
Task 5: Add BuildKit cache mounts and non-root user to Remotion Dockerfile
Files:
-
Modify:
remotion_service/Dockerfile -
Step 1: Update Remotion Dockerfile
Replace the entire remotion_service/Dockerfile with:
# syntax=docker/dockerfile:1.7-labs
FROM oven/bun:1.3.10 AS base
ENV APP_HOME=/app \
PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1 \
REMOTION_PUPPETEER_NO_SANDBOX=1 \
NODE_ENV=production
WORKDIR ${APP_HOME}
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
ca-certificates \
ffmpeg \
chromium \
libglib2.0-0 \
libnss3 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libdrm2 \
libxkbcommon0 \
libgbm1 \
fonts-noto-color-emoji \
curl \
&& rm -rf /var/lib/apt/lists/*
FROM base AS deps
WORKDIR ${APP_HOME}
COPY package.json bun.lock ./
RUN NODE_ENV=development bun install --frozen-lockfile
FROM base AS runner
WORKDIR ${APP_HOME}
RUN groupadd --gid 1000 app && \
useradd --uid 1000 --gid app --create-home app
COPY --from=deps ${APP_HOME}/node_modules ./node_modules
COPY package.json bun.lock ./
COPY tsconfig.json remotion.config.ts ./
COPY public ./public
COPY src ./src
COPY server ./server
RUN mkdir -p out && chown -R app:app /app
USER app
EXPOSE 3001
CMD ["bun", "run", "server"]
Key changes:
-
BuildKit apt cache mounts added (matches backend pattern)
-
Non-root
appuser (uid 1000) in runner stage -
chownbeforeUSER appso the app owns all files includingout/ -
Step 2: Build and verify
Run: cd remotion_service && docker build --target runner -t remotion:test .
Expected: builds successfully.
- Step 3: Commit
git add remotion_service/Dockerfile
git commit -m "perf(infra): add BuildKit cache mounts and non-root user to Remotion Dockerfile"
Task 6: Add resource limits and cap_drop to Remotion docker-compose
Files:
-
Modify:
remotion_service/docker-compose.yml -
Step 1: Update Remotion docker-compose.yml
Replace the entire remotion_service/docker-compose.yml with:
services:
remotion:
build:
context: .
dockerfile: Dockerfile
target: runner
command: >
sh -lc "NODE_ENV=development bun install --frozen-lockfile && bun run server"
restart: unless-stopped
env_file: .env
environment:
S3_ENDPOINT_URL: http://minio:9000
REDIS_URL: redis://redis:6379/0
ports:
- "127.0.0.1:3001:3001"
deploy:
resources:
limits:
memory: 4g
cpus: "2"
reservations:
memory: 1g
cpus: "0.5"
cap_drop:
- ALL
cap_add:
- SYS_ADMIN
volumes:
- .:/app:cached
- remotion_node_modules:/app/node_modules
networks:
- backend
stdin_open: true
tty: true
volumes:
remotion_node_modules:
networks:
backend:
external: true
name: cofee_backend_default
Key changes:
-
restart: unless-stopped -
Port bound to
127.0.0.1 -
Resource limits: 4GB memory / 2 CPUs (Chromium + FFmpeg need this)
-
Resource reservations: 1GB / 0.5 CPU (scheduling guarantees)
-
cap_drop: ALL+cap_add: SYS_ADMIN(SYS_ADMIN needed for Chromium sandbox) -
Step 2: Validate compose syntax
Run: cd remotion_service && docker compose config > /dev/null
Expected: no errors.
- Step 3: Commit
git add remotion_service/docker-compose.yml
git commit -m "fix(infra): add resource limits, cap_drop, restart policy to Remotion compose"
Task 7: Add resource limits and cap_drop to backend docker-compose
Files:
-
Modify:
cofee_backend/docker-compose.yml -
Step 1: Add deploy and cap_drop sections to each service
Add to the db service after volumes:
cap_drop:
- ALL
cap_add:
- CHOWN
- DAC_OVERRIDE
- FOWNER
- SETGID
- SETUID
Add to the minio service after volumes:
cap_drop:
- ALL
cap_add:
- CHOWN
- DAC_OVERRIDE
- FOWNER
- SETGID
- SETUID
Add to the redis service after volumes:
cap_drop:
- ALL
Add to the api service after volumes:
deploy:
resources:
limits:
memory: 512m
cpus: "1"
cap_drop:
- ALL
Add to the worker service after volumes:
deploy:
resources:
limits:
memory: 1g
cpus: "1"
cap_drop:
- ALL
- Step 2: Validate compose syntax
Run: cd cofee_backend && docker compose config > /dev/null
Expected: no errors.
- Step 3: Commit
git add cofee_backend/docker-compose.yml
git commit -m "fix(infra): add resource limits and capability dropping to backend compose"
Task 8: Add health check endpoint to backend API
Files:
- Modify:
cofee_backend/cpv3/modules/system/router.py
The existing /api/ping/ only returns a static response. We need a /api/health/ endpoint that checks DB and Redis connectivity for Docker health checks.
- Step 1: Add health endpoint to system router
Replace the contents of cofee_backend/cpv3/modules/system/router.py with:
from __future__ import annotations
from fastapi import APIRouter, Depends
from sqlalchemy import text
from sqlalchemy.ext.asyncio import AsyncSession
from cpv3.db.session import get_db
from cpv3.infrastructure.settings import get_settings
router = APIRouter(prefix="/api", tags=["System"])
_settings = get_settings()
@router.get("/ping/")
async def ping() -> dict[str, str]:
return {"status": "ok"}
@router.get("/health/")
async def health(db: AsyncSession = Depends(get_db)) -> dict[str, str]:
"""Health check for Docker/K8s probes. Verifies DB connectivity."""
try:
await db.execute(text("SELECT 1"))
db_status = "connected"
except Exception:
db_status = "disconnected"
status = "ok" if db_status == "connected" else "degraded"
return {"status": status, "database": db_status}
- Step 2: Run linter
Run: cd cofee_backend && uv run ruff check cpv3/modules/system/router.py
Expected: no errors.
- Step 3: Run existing tests
Run: cd cofee_backend && uv run pytest -x -q 2>&1 | tail -10
Expected: all tests pass (health endpoint is additive, no breaking changes).
- Step 4: Commit
git add cofee_backend/cpv3/modules/system/router.py
git commit -m "feat(backend): add /api/health/ endpoint for Docker health checks"
Task 9: Add health check endpoint to Remotion service
Files:
-
Modify:
remotion_service/server/index.ts -
Step 1: Add /health endpoint before app.listen
Add before the app.listen(...) line (around line 138) in remotion_service/server/index.ts:
app.get("/health", async () => {
return { status: "ok" };
});
Note: This is outside the /api prefix since it's at the Elysia instance level. The endpoint will be available at GET /api/health because the Elysia instance has prefix: "/api".
- Step 2: Type check
Run: cd remotion_service && bunx tsc --noEmit
Expected: no new errors.
- Step 3: Commit
git add remotion_service/server/index.ts
git commit -m "feat(remotion): add /api/health endpoint for Docker health checks"
Task 10: Add health checks for api, worker, and remotion in compose files
Files:
-
Modify:
cofee_backend/docker-compose.yml -
Modify:
remotion_service/docker-compose.yml -
Step 1: Add healthcheck to api service
Add to api service in cofee_backend/docker-compose.yml (after depends_on):
healthcheck:
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/api/health/')"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
- Step 2: Add healthcheck to worker service
The worker has no HTTP port. Use a process check. Add to worker service:
healthcheck:
test: ["CMD-SHELL", "pgrep -f dramatiq || exit 1"]
interval: 15s
timeout: 5s
retries: 3
- Step 3: Add healthcheck to remotion service
Add to remotion service in remotion_service/docker-compose.yml (after environment):
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3001/api/health"]
interval: 10s
timeout: 5s
retries: 5
start_period: 15s
- Step 4: Validate both compose files
Run: cd cofee_backend && docker compose config > /dev/null && cd ../remotion_service && docker compose config > /dev/null
Expected: no errors.
- Step 5: Commit
git add cofee_backend/docker-compose.yml remotion_service/docker-compose.yml
git commit -m "feat(infra): add health checks to api, worker, and remotion services"
Task 11: Add network segmentation to backend compose
Files:
- Modify:
cofee_backend/docker-compose.yml
Currently all services share one flat network. Separate into db-net (data stores) and app-net (application services). This prevents Remotion from reaching DB/Redis directly.
- Step 1: Add networks to compose
Add at the bottom of cofee_backend/docker-compose.yml, replacing the existing volumes: section:
volumes:
cpv3_db:
cpv3_minio:
cpv3_redis:
networks:
db-net:
driver: bridge
app-net:
driver: bridge
- Step 2: Add network assignments to each service
Add to db:
networks:
- db-net
Add to redis:
networks:
- db-net
Add to minio:
networks:
- db-net
- app-net
Add to api:
networks:
- db-net
- app-net
Add to worker:
networks:
- db-net
- app-net
- Step 3: Update Remotion compose to use app-net
In remotion_service/docker-compose.yml, change the networks section:
networks:
backend:
external: true
name: cofee_backend_app-net
This ensures Remotion can reach MinIO and API (on app-net) but NOT PostgreSQL or Redis (on db-net).
- Step 4: Validate both compose files
Run: cd cofee_backend && docker compose config > /dev/null && cd ../remotion_service && docker compose config > /dev/null
Expected: no errors.
- Step 5: Test full stack connectivity
Run:
cd cofee_backend && docker compose down && docker compose up -d
# Wait for healthy
cd ../remotion_service && docker compose down && docker compose up -d
Verify API can reach DB, Redis, MinIO. Verify Remotion can reach MinIO but NOT DB.
- Step 6: Commit
git add cofee_backend/docker-compose.yml remotion_service/docker-compose.yml
git commit -m "feat(infra): add network segmentation — db-net and app-net isolation"
Task 12: Final verification
- Step 1: Bring down everything
cd cofee_backend && docker compose down
cd ../remotion_service && docker compose down
- Step 2: Clean build
cd cofee_backend && docker compose build --no-cache
cd ../remotion_service && docker compose build --no-cache
- Step 3: Start backend stack
cd cofee_backend && docker compose up -d
Wait for: docker compose ps shows all services healthy.
- Step 4: Start Remotion stack
cd remotion_service && docker compose up -d
Wait for: docker compose ps shows remotion healthy.
- Step 5: Test API health
Run: curl http://127.0.0.1:8000/api/health/
Expected: {"status":"ok","database":"connected"}
- Step 6: Test Remotion health
Run: curl http://127.0.0.1:3001/api/health
Expected: {"status":"ok"}
- Step 7: Verify port binding
Run: docker compose -f cofee_backend/docker-compose.yml ps --format '{{.Name}} {{.Ports}}'
Expected: all ports show 127.0.0.1:XXXX->YYYY/tcp (not 0.0.0.0).
- Step 8: Verify resource limits
Run: docker inspect cpv3_api --format '{{.HostConfig.Memory}}'
Expected: 536870912 (512MB).
Run: docker inspect remotion --format '{{.HostConfig.Memory}}'
Expected: 4294967296 (4GB).