Files
remotion_service/docs/superpowers/plans/2026-03-24-docker-hardening.md
T
2026-04-06 01:44:58 +03:00

889 lines
22 KiB
Markdown

# Docker Infrastructure Hardening — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Harden all Docker infrastructure across the monorepo — security, build optimization, service organization, health checks, and networking.
**Architecture:** 4-phase approach: quick config fixes first (no code changes), then Dockerfile improvements, then health endpoints + networking, then resource limits. Each phase produces a working stack.
**Tech Stack:** Docker, Docker Compose, FastAPI (Python), ElysiaJS (Bun/TypeScript), PostgreSQL, Redis, MinIO
---
### Task 1: Add .env to .gitignore files
**Files:**
- Modify: `cofee_backend/.gitignore`
- Modify: `cofee_frontend/.gitignore`
- [ ] **Step 1: Add .env exclusion to backend .gitignore**
Append to `cofee_backend/.gitignore`:
```
# Environment
.env
.env.*
```
- [ ] **Step 2: Add .env exclusion to frontend .gitignore**
The frontend `.gitignore` has `.env*.local` but not `.env` itself. Add before the `# local env files` section in `cofee_frontend/.gitignore`:
```
# Environment
.env
```
Note: Keep the existing `.env*.local` line too.
- [ ] **Step 3: Verify .env files are not tracked**
Run: `git ls-files | grep '\.env'`
Expected: no output. If any .env files are tracked, run `git rm --cached <file>` for each.
- [ ] **Step 4: Commit**
```bash
git add cofee_backend/.gitignore cofee_frontend/.gitignore
git commit -m "fix(infra): add .env to backend and frontend .gitignore"
```
---
### Task 2: Add .env to backend .dockerignore
**Files:**
- Modify: `cofee_backend/.dockerignore`
- [ ] **Step 1: Add .env exclusion**
Add to `cofee_backend/.dockerignore`:
```
.env
.env.*
```
- [ ] **Step 2: Commit**
```bash
git add cofee_backend/.dockerignore
git commit -m "fix(infra): exclude .env from backend Docker build context"
```
---
### Task 3: DRY up docker-compose env vars with YAML anchor
**Files:**
- Modify: `cofee_backend/docker-compose.yml`
The `api` and `worker` services share 14 identical env vars. Extract into an `x-backend-env` anchor. Also adds the missing `JWT_SECRET_KEY` to worker.
- [ ] **Step 1: Add x-backend-env anchor and refactor services**
Replace the entire `cofee_backend/docker-compose.yml` with:
```yaml
x-backend-image: &backend-image
image: cpv3-backend:dev
build:
context: .
dockerfile: Dockerfile
target: dev
x-backend-env: &backend-env
DEBUG: ${DEBUG:-1}
JWT_SECRET_KEY: ${JWT_SECRET_KEY:-dev-secret}
POSTGRES_USER: ${POSTGRES_USER:-postgres}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-postgres}
POSTGRES_HOST: db
POSTGRES_PORT: 5432
POSTGRES_DATABASE: ${POSTGRES_DATABASE:-coffee_project_db}
STORAGE_BACKEND: ${STORAGE_BACKEND:-S3}
S3_ACCESS_KEY: ${MINIO_ROOT_USER:-minioadmin}
S3_SECRET_KEY: ${MINIO_ROOT_PASSWORD:-minioadmin}
S3_BUCKET_NAME: ${S3_BUCKET_NAME:-coffee-bucket}
S3_ENDPOINT_URL_INTERNAL: http://minio:9000
S3_ENDPOINT_URL_PUBLIC: http://localhost:9000
REDIS_URL: redis://redis:6379/0
WEBHOOK_BASE_URL: http://api:8000
REMOTION_SERVICE_URL: ${REMOTION_SERVICE_URL:-http://remotion:3001}
services:
db:
container_name: cpv3_postgres
image: postgres:16
restart: unless-stopped
environment:
POSTGRES_USER: ${POSTGRES_USER:-postgres}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-postgres}
POSTGRES_DB: ${POSTGRES_DATABASE:-coffee_project_db}
ports:
- "127.0.0.1:5332:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-postgres} -d ${POSTGRES_DB:-coffee_project_db}"]
interval: 5s
timeout: 3s
retries: 20
volumes:
- cpv3_db:/var/lib/postgresql/data
minio:
container_name: cpv3_minio
image: minio/minio:RELEASE.2024-11-07T00-52-20Z
restart: unless-stopped
ports:
- "127.0.0.1:9000:9000"
- "127.0.0.1:9001:9001"
environment:
MINIO_ROOT_USER: ${MINIO_ROOT_USER:-minioadmin}
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD:-minioadmin}
command: server /data --console-address ":9001"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 10s
timeout: 5s
retries: 5
volumes:
- cpv3_minio:/data
redis:
container_name: cpv3_redis
image: redis:7-alpine
restart: unless-stopped
ports:
- "127.0.0.1:6379:6379"
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 10
volumes:
- cpv3_redis:/data
api:
container_name: cpv3_api
<<: *backend-image
restart: unless-stopped
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
environment:
<<: *backend-env
ports:
- "127.0.0.1:8000:8000"
volumes:
- ./cpv3:/app/cpv3
- ./alembic:/app/alembic
- ./alembic.ini:/app/alembic.ini
worker:
container_name: cpv3_worker
<<: *backend-image
restart: unless-stopped
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
environment:
<<: *backend-env
command: >
watchfiles --filter python 'dramatiq cpv3.modules.tasks.service --processes 1 --threads 2' /app/cpv3
volumes:
- ./cpv3:/app/cpv3
volumes:
cpv3_db:
cpv3_minio:
cpv3_redis:
```
Key changes in this file:
- `x-backend-env` anchor with all shared env vars (DRY)
- `JWT_SECRET_KEY` added to worker (was missing)
- `restart: unless-stopped` on all services
- All ports bound to `127.0.0.1` (not `0.0.0.0`)
- MinIO pinned to `RELEASE.2024-11-07T00-52-20Z`
- MinIO health check added (`curl` on `/minio/health/live`)
- Removed inline comments for cleanliness
- [ ] **Step 2: Validate compose syntax**
Run: `cd cofee_backend && docker compose config > /dev/null`
Expected: no errors.
- [ ] **Step 3: Test stack starts**
Run: `cd cofee_backend && docker compose up -d`
Wait 30s, then: `docker compose ps`
Expected: all services `Up` or `Up (healthy)`.
- [ ] **Step 4: Commit**
```bash
git add cofee_backend/docker-compose.yml
git commit -m "refactor(infra): DRY env vars, pin images, bind localhost, add restart policies"
```
---
### Task 4: Move build-essential out of base stage in backend Dockerfile
**Files:**
- Modify: `cofee_backend/Dockerfile`
`build-essential` is only needed during `uv sync` (compiling C extensions). Moving it from `base` to `deps` saves ~200MB in the prod image since the `prod` stage inherits from `deps` but the compiled artifacts are in `.venv`, not the system packages.
- [ ] **Step 1: Restructure Dockerfile stages**
Replace the entire `cofee_backend/Dockerfile` with:
```dockerfile
# syntax=docker/dockerfile:1.7
# ---------------------------------------------------------------------------
# Stage 1: base — minimal runtime dependencies (shared by dev and prod)
# ---------------------------------------------------------------------------
FROM python:3.11-slim AS base
COPY --from=ghcr.io/astral-sh/uv:0.8.15 /uv /uvx /bin/
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PATH="/app/.venv/bin:${PATH}"
WORKDIR /app
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
apt-get update && apt-get install -y --no-install-recommends \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*
# ---------------------------------------------------------------------------
# Stage 2: deps — install Python dependencies (build-essential here only)
# ---------------------------------------------------------------------------
FROM base AS deps
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
apt-get update && apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*
COPY pyproject.toml uv.lock ./
RUN --mount=type=cache,target=/root/.cache/uv \
uv sync --frozen --no-dev --no-install-project
# ---------------------------------------------------------------------------
# Stage 3: dev — development target (used by docker-compose)
# ---------------------------------------------------------------------------
FROM deps AS dev
ENV PYTHONPATH=/app
EXPOSE 8000
CMD ["sh", "-c", "alembic upgrade head && uvicorn cpv3.main:app --host 0.0.0.0 --port 8000 --reload --reload-dir /app/cpv3"]
# ---------------------------------------------------------------------------
# Stage 4: prod — production target (no build-essential, non-root user)
# ---------------------------------------------------------------------------
FROM base AS prod
RUN groupadd --gid 1000 app && \
useradd --uid 1000 --gid app --create-home app
COPY --from=deps /app/.venv /app/.venv
COPY pyproject.toml uv.lock ./
ENV UV_LINK_MODE=copy
COPY cpv3 ./cpv3
COPY alembic ./alembic
COPY alembic.ini ./
RUN --mount=type=cache,target=/root/.cache/uv \
uv sync --frozen --no-dev
RUN chown -R app:app /app
USER app
EXPOSE 8000
CMD ["sh", "-c", "alembic upgrade head && uvicorn cpv3.main:app --host 0.0.0.0 --port 8000"]
```
Key changes:
- `build-essential` moved from `base` to `deps` — prod image is ~200MB smaller
- `prod` stage inherits from `base` (not `deps`) — no compiler in production
- `prod` copies only `.venv` from `deps` stage — gets compiled packages without build tools
- Non-root `app` user (uid 1000) added to `prod` stage
- `dev` stage still inherits from `deps` (has build-essential for potential ad-hoc installs)
- [ ] **Step 2: Build and verify prod stage**
Run: `cd cofee_backend && docker build --target prod -t cpv3-backend:prod-test .`
Expected: builds successfully.
- [ ] **Step 3: Build and verify dev stage**
Run: `cd cofee_backend && docker build --target dev -t cpv3-backend:dev-test .`
Expected: builds successfully.
- [ ] **Step 4: Verify dev stack still works**
Run: `cd cofee_backend && docker compose up -d --build`
Wait 30s, then: `docker compose ps`
Expected: all services running.
- [ ] **Step 5: Commit**
```bash
git add cofee_backend/Dockerfile
git commit -m "perf(infra): move build-essential to deps stage, add non-root user to prod"
```
---
### Task 5: Add BuildKit cache mounts and non-root user to Remotion Dockerfile
**Files:**
- Modify: `remotion_service/Dockerfile`
- [ ] **Step 1: Update Remotion Dockerfile**
Replace the entire `remotion_service/Dockerfile` with:
```dockerfile
# syntax=docker/dockerfile:1.7-labs
FROM oven/bun:1.3.10 AS base
ENV APP_HOME=/app \
PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1 \
REMOTION_PUPPETEER_NO_SANDBOX=1 \
NODE_ENV=production
WORKDIR ${APP_HOME}
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
--mount=type=cache,target=/var/lib/apt,sharing=locked \
apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
ca-certificates \
ffmpeg \
chromium \
libglib2.0-0 \
libnss3 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libdrm2 \
libxkbcommon0 \
libgbm1 \
fonts-noto-color-emoji \
curl \
&& rm -rf /var/lib/apt/lists/*
FROM base AS deps
WORKDIR ${APP_HOME}
COPY package.json bun.lock ./
RUN NODE_ENV=development bun install --frozen-lockfile
FROM base AS runner
WORKDIR ${APP_HOME}
RUN groupadd --gid 1000 app && \
useradd --uid 1000 --gid app --create-home app
COPY --from=deps ${APP_HOME}/node_modules ./node_modules
COPY package.json bun.lock ./
COPY tsconfig.json remotion.config.ts ./
COPY public ./public
COPY src ./src
COPY server ./server
RUN mkdir -p out && chown -R app:app /app
USER app
EXPOSE 3001
CMD ["bun", "run", "server"]
```
Key changes:
- BuildKit apt cache mounts added (matches backend pattern)
- Non-root `app` user (uid 1000) in runner stage
- `chown` before `USER app` so the app owns all files including `out/`
- [ ] **Step 2: Build and verify**
Run: `cd remotion_service && docker build --target runner -t remotion:test .`
Expected: builds successfully.
- [ ] **Step 3: Commit**
```bash
git add remotion_service/Dockerfile
git commit -m "perf(infra): add BuildKit cache mounts and non-root user to Remotion Dockerfile"
```
---
### Task 6: Add resource limits and cap_drop to Remotion docker-compose
**Files:**
- Modify: `remotion_service/docker-compose.yml`
- [ ] **Step 1: Update Remotion docker-compose.yml**
Replace the entire `remotion_service/docker-compose.yml` with:
```yaml
services:
remotion:
build:
context: .
dockerfile: Dockerfile
target: runner
command: >
sh -lc "NODE_ENV=development bun install --frozen-lockfile && bun run server"
restart: unless-stopped
env_file: .env
environment:
S3_ENDPOINT_URL: http://minio:9000
REDIS_URL: redis://redis:6379/0
ports:
- "127.0.0.1:3001:3001"
deploy:
resources:
limits:
memory: 4g
cpus: "2"
reservations:
memory: 1g
cpus: "0.5"
cap_drop:
- ALL
cap_add:
- SYS_ADMIN
volumes:
- .:/app:cached
- remotion_node_modules:/app/node_modules
networks:
- backend
stdin_open: true
tty: true
volumes:
remotion_node_modules:
networks:
backend:
external: true
name: cofee_backend_default
```
Key changes:
- `restart: unless-stopped`
- Port bound to `127.0.0.1`
- Resource limits: 4GB memory / 2 CPUs (Chromium + FFmpeg need this)
- Resource reservations: 1GB / 0.5 CPU (scheduling guarantees)
- `cap_drop: ALL` + `cap_add: SYS_ADMIN` (SYS_ADMIN needed for Chromium sandbox)
- [ ] **Step 2: Validate compose syntax**
Run: `cd remotion_service && docker compose config > /dev/null`
Expected: no errors.
- [ ] **Step 3: Commit**
```bash
git add remotion_service/docker-compose.yml
git commit -m "fix(infra): add resource limits, cap_drop, restart policy to Remotion compose"
```
---
### Task 7: Add resource limits and cap_drop to backend docker-compose
**Files:**
- Modify: `cofee_backend/docker-compose.yml`
- [ ] **Step 1: Add deploy and cap_drop sections to each service**
Add to the `db` service after `volumes`:
```yaml
cap_drop:
- ALL
cap_add:
- CHOWN
- DAC_OVERRIDE
- FOWNER
- SETGID
- SETUID
```
Add to the `minio` service after `volumes`:
```yaml
cap_drop:
- ALL
cap_add:
- CHOWN
- DAC_OVERRIDE
- FOWNER
- SETGID
- SETUID
```
Add to the `redis` service after `volumes`:
```yaml
cap_drop:
- ALL
```
Add to the `api` service after `volumes`:
```yaml
deploy:
resources:
limits:
memory: 512m
cpus: "1"
cap_drop:
- ALL
```
Add to the `worker` service after `volumes`:
```yaml
deploy:
resources:
limits:
memory: 1g
cpus: "1"
cap_drop:
- ALL
```
- [ ] **Step 2: Validate compose syntax**
Run: `cd cofee_backend && docker compose config > /dev/null`
Expected: no errors.
- [ ] **Step 3: Commit**
```bash
git add cofee_backend/docker-compose.yml
git commit -m "fix(infra): add resource limits and capability dropping to backend compose"
```
---
### Task 8: Add health check endpoint to backend API
**Files:**
- Modify: `cofee_backend/cpv3/modules/system/router.py`
The existing `/api/ping/` only returns a static response. We need a `/api/health/` endpoint that checks DB and Redis connectivity for Docker health checks.
- [ ] **Step 1: Add health endpoint to system router**
Replace the contents of `cofee_backend/cpv3/modules/system/router.py` with:
```python
from __future__ import annotations
from fastapi import APIRouter, Depends
from sqlalchemy import text
from sqlalchemy.ext.asyncio import AsyncSession
from cpv3.db.session import get_db
from cpv3.infrastructure.settings import get_settings
router = APIRouter(prefix="/api", tags=["System"])
_settings = get_settings()
@router.get("/ping/")
async def ping() -> dict[str, str]:
return {"status": "ok"}
@router.get("/health/")
async def health(db: AsyncSession = Depends(get_db)) -> dict[str, str]:
"""Health check for Docker/K8s probes. Verifies DB connectivity."""
try:
await db.execute(text("SELECT 1"))
db_status = "connected"
except Exception:
db_status = "disconnected"
status = "ok" if db_status == "connected" else "degraded"
return {"status": status, "database": db_status}
```
- [ ] **Step 2: Run linter**
Run: `cd cofee_backend && uv run ruff check cpv3/modules/system/router.py`
Expected: no errors.
- [ ] **Step 3: Run existing tests**
Run: `cd cofee_backend && uv run pytest -x -q 2>&1 | tail -10`
Expected: all tests pass (health endpoint is additive, no breaking changes).
- [ ] **Step 4: Commit**
```bash
git add cofee_backend/cpv3/modules/system/router.py
git commit -m "feat(backend): add /api/health/ endpoint for Docker health checks"
```
---
### Task 9: Add health check endpoint to Remotion service
**Files:**
- Modify: `remotion_service/server/index.ts`
- [ ] **Step 1: Add /health endpoint before app.listen**
Add before the `app.listen(...)` line (around line 138) in `remotion_service/server/index.ts`:
```typescript
app.get("/health", async () => {
return { status: "ok" };
});
```
Note: This is outside the `/api` prefix since it's at the Elysia instance level. The endpoint will be available at `GET /api/health` because the Elysia instance has `prefix: "/api"`.
- [ ] **Step 2: Type check**
Run: `cd remotion_service && bunx tsc --noEmit`
Expected: no new errors.
- [ ] **Step 3: Commit**
```bash
git add remotion_service/server/index.ts
git commit -m "feat(remotion): add /api/health endpoint for Docker health checks"
```
---
### Task 10: Add health checks for api, worker, and remotion in compose files
**Files:**
- Modify: `cofee_backend/docker-compose.yml`
- Modify: `remotion_service/docker-compose.yml`
- [ ] **Step 1: Add healthcheck to api service**
Add to `api` service in `cofee_backend/docker-compose.yml` (after `depends_on`):
```yaml
healthcheck:
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/api/health/')"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s
```
- [ ] **Step 2: Add healthcheck to worker service**
The worker has no HTTP port. Use a process check. Add to `worker` service:
```yaml
healthcheck:
test: ["CMD-SHELL", "pgrep -f dramatiq || exit 1"]
interval: 15s
timeout: 5s
retries: 3
```
- [ ] **Step 3: Add healthcheck to remotion service**
Add to `remotion` service in `remotion_service/docker-compose.yml` (after `environment`):
```yaml
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3001/api/health"]
interval: 10s
timeout: 5s
retries: 5
start_period: 15s
```
- [ ] **Step 4: Validate both compose files**
Run: `cd cofee_backend && docker compose config > /dev/null && cd ../remotion_service && docker compose config > /dev/null`
Expected: no errors.
- [ ] **Step 5: Commit**
```bash
git add cofee_backend/docker-compose.yml remotion_service/docker-compose.yml
git commit -m "feat(infra): add health checks to api, worker, and remotion services"
```
---
### Task 11: Add network segmentation to backend compose
**Files:**
- Modify: `cofee_backend/docker-compose.yml`
Currently all services share one flat network. Separate into `db-net` (data stores) and `app-net` (application services). This prevents Remotion from reaching DB/Redis directly.
- [ ] **Step 1: Add networks to compose**
Add at the bottom of `cofee_backend/docker-compose.yml`, replacing the existing `volumes:` section:
```yaml
volumes:
cpv3_db:
cpv3_minio:
cpv3_redis:
networks:
db-net:
driver: bridge
app-net:
driver: bridge
```
- [ ] **Step 2: Add network assignments to each service**
Add to `db`:
```yaml
networks:
- db-net
```
Add to `redis`:
```yaml
networks:
- db-net
```
Add to `minio`:
```yaml
networks:
- db-net
- app-net
```
Add to `api`:
```yaml
networks:
- db-net
- app-net
```
Add to `worker`:
```yaml
networks:
- db-net
- app-net
```
- [ ] **Step 3: Update Remotion compose to use app-net**
In `remotion_service/docker-compose.yml`, change the networks section:
```yaml
networks:
backend:
external: true
name: cofee_backend_app-net
```
This ensures Remotion can reach MinIO and API (on `app-net`) but NOT PostgreSQL or Redis (on `db-net`).
- [ ] **Step 4: Validate both compose files**
Run: `cd cofee_backend && docker compose config > /dev/null && cd ../remotion_service && docker compose config > /dev/null`
Expected: no errors.
- [ ] **Step 5: Test full stack connectivity**
Run:
```bash
cd cofee_backend && docker compose down && docker compose up -d
# Wait for healthy
cd ../remotion_service && docker compose down && docker compose up -d
```
Verify API can reach DB, Redis, MinIO. Verify Remotion can reach MinIO but NOT DB.
- [ ] **Step 6: Commit**
```bash
git add cofee_backend/docker-compose.yml remotion_service/docker-compose.yml
git commit -m "feat(infra): add network segmentation — db-net and app-net isolation"
```
---
### Task 12: Final verification
- [ ] **Step 1: Bring down everything**
```bash
cd cofee_backend && docker compose down
cd ../remotion_service && docker compose down
```
- [ ] **Step 2: Clean build**
```bash
cd cofee_backend && docker compose build --no-cache
cd ../remotion_service && docker compose build --no-cache
```
- [ ] **Step 3: Start backend stack**
```bash
cd cofee_backend && docker compose up -d
```
Wait for: `docker compose ps` shows all services healthy.
- [ ] **Step 4: Start Remotion stack**
```bash
cd remotion_service && docker compose up -d
```
Wait for: `docker compose ps` shows remotion healthy.
- [ ] **Step 5: Test API health**
Run: `curl http://127.0.0.1:8000/api/health/`
Expected: `{"status":"ok","database":"connected"}`
- [ ] **Step 6: Test Remotion health**
Run: `curl http://127.0.0.1:3001/api/health`
Expected: `{"status":"ok"}`
- [ ] **Step 7: Verify port binding**
Run: `docker compose -f cofee_backend/docker-compose.yml ps --format '{{.Name}} {{.Ports}}'`
Expected: all ports show `127.0.0.1:XXXX->YYYY/tcp` (not `0.0.0.0`).
- [ ] **Step 8: Verify resource limits**
Run: `docker inspect cpv3_api --format '{{.HostConfig.Memory}}'`
Expected: `536870912` (512MB).
Run: `docker inspect remotion --format '{{.HostConfig.Memory}}'`
Expected: `4294967296` (4GB).