Monitoring Checklist¶
What to monitor in a Docling Studio deployment.
Health Endpoint¶
The primary monitoring signal is the health endpoint:
Expected response:
Alert if: status != "ok", endpoint unreachable, or response time > 5s.
Four Golden Signals¶
1. Latency¶
| Endpoint | Expected | Alert threshold |
|---|---|---|
GET /api/health |
< 100ms | > 1s |
POST /api/documents (upload) |
< 2s | > 10s |
POST /api/analyses (create) |
< 500ms (queuing only) | > 5s |
GET /api/analyses/:id (results) |
< 500ms | > 3s |
2. Traffic¶
| Metric | What to watch |
|---|---|
| Requests per minute | Baseline for normal usage |
| Uploads per hour | Capacity planning |
| Concurrent analyses | Should stay <= MAX_CONCURRENT_ANALYSES |
3. Errors¶
| Signal | Alert threshold |
|---|---|
| HTTP 5xx rate | > 1% of requests |
| Analysis failure rate | > 10% of analyses |
| Rate limit hits (429) | Spike = possible abuse |
4. Saturation¶
| Resource | Check command | Alert threshold |
|---|---|---|
| CPU | docker stats |
> 90% sustained |
| Memory | docker stats |
> 85% (especially in local mode with PyTorch) |
| Disk (SQLite + uploads) | du -sh data/ |
> 80% of volume |
| Docker container restarts | docker inspect --format='{{.RestartCount}}' |
> 0 |
Docker Health Check¶
The docker-compose.yml includes a built-in health check:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/api/health"]
interval: 30s
timeout: 10s
retries: 3
Docker will mark the container as unhealthy after 3 consecutive failures.
Log Monitoring¶
Backend logs (uvicorn)¶
Watch for:
- ERROR or CRITICAL log levels
- TimeoutError from Docling processing
- sqlite3.OperationalError (DB issues)
- 429 Too Many Requests spikes
Frontend logs (nginx)¶
Watch for:
- 502 Bad Gateway (backend down)
- 413 Request Entity Too Large (file size limit)
Recommended Setup¶
For production deployments, consider:
- Uptime monitor — ping
/api/healthevery 60s (UptimeRobot, Healthchecks.io) - Log aggregation — ship Docker logs to a central service
- Alerting — notify on container restart, health check failure, or error spike