Readur/docs/self-hosting/monitoring.md

617 lines
14 KiB
Markdown

# Monitoring and Observability Guide
## Overview
This guide covers setting up comprehensive monitoring for Readur, including metrics collection, log aggregation, alerting, and dashboard creation.
## Monitoring Stack Components
### Core Components
1. **Metrics Collection**: Prometheus + Node Exporter
2. **Visualization**: Grafana
3. **Log Aggregation**: Loki or ELK Stack
4. **Alerting**: AlertManager
5. **Application Monitoring**: Custom metrics and health checks
6. **Uptime Monitoring**: Uptime Kuma or Pingdom
## Health Monitoring
### Built-in Health Endpoints
```bash
# Basic health check
curl http://localhost:8000/health
# Detailed health status
curl http://localhost:8000/health/detailed
# Response format
{
"status": "healthy",
"database": "connected",
"redis": "connected",
"storage": "accessible",
"ocr_queue": 45,
"version": "2.5.4",
"uptime": 345600
}
```
### Custom Health Checks
```python
# health_checks.py
from typing import Dict, Any
class HealthMonitor:
@staticmethod
def check_database() -> Dict[str, Any]:
try:
db.session.execute("SELECT 1")
return {"status": "healthy", "response_time": 0.005}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
@staticmethod
def check_storage() -> Dict[str, Any]:
try:
# Check if storage is accessible
storage.list_files(limit=1)
return {"status": "healthy", "available_space": storage.get_free_space()}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
@staticmethod
def check_ocr_workers() -> Dict[str, Any]:
active = celery.control.inspect().active()
return {
"status": "healthy" if active else "degraded",
"active_workers": len(active or {}),
"queue_length": redis.llen("ocr_queue")
}
```
## Prometheus Setup
### Installation and Configuration
```yaml
# docker-compose.monitoring.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
networks:
- monitoring
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
networks:
- monitoring
postgres-exporter:
image: prometheuscommunity/postgres-exporter:latest
container_name: postgres-exporter
environment:
DATA_SOURCE_NAME: "postgresql://readur:password@postgres:5432/readur?sslmode=disable"
ports:
- "9187:9187"
networks:
- monitoring
redis-exporter:
image: oliver006/redis_exporter:latest
container_name: redis-exporter
environment:
REDIS_ADDR: "redis://redis:6379"
ports:
- "9121:9121"
networks:
- monitoring
networks:
monitoring:
external: true
volumes:
prometheus_data:
```
### Prometheus Configuration
```yaml
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'readur-monitor'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- '/etc/prometheus/alerts/*.yml'
scrape_configs:
- job_name: 'readur'
static_configs:
- targets: ['readur:8000']
metrics_path: '/metrics'
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
```
## Grafana Dashboards
### Setup Grafana
```yaml
# Add to docker-compose.monitoring.yml
grafana:
image: grafana/grafana:latest
container_name: grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_SERVER_ROOT_URL=https://grafana.readur.company.com
- GF_INSTALL_PLUGINS=redis-datasource
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
ports:
- "3000:3000"
networks:
- monitoring
```
### Dashboard Configuration
```json
# grafana/provisioning/dashboards/readur.json
{
"dashboard": {
"title": "Readur Performance Dashboard",
"panels": [
{
"title": "Request Rate",
"targets": [{
"expr": "rate(readur_requests_total[5m])"
}]
},
{
"title": "Response Time",
"targets": [{
"expr": "histogram_quantile(0.95, rate(readur_request_duration_seconds_bucket[5m]))"
}]
},
{
"title": "OCR Queue",
"targets": [{
"expr": "readur_ocr_queue_length"
}]
},
{
"title": "Database Connections",
"targets": [{
"expr": "pg_stat_database_numbackends{datname='readur'}"
}]
}
]
}
}
```
## Application Metrics
### Custom Metrics Implementation
```python
# metrics.py
from prometheus_client import Counter, Histogram, Gauge, generate_latest
# Define metrics
request_count = Counter('readur_requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('readur_request_duration_seconds', 'Request duration')
ocr_queue_length = Gauge('readur_ocr_queue_length', 'OCR queue length')
active_users = Gauge('readur_active_users', 'Active users in last 5 minutes')
document_count = Gauge('readur_documents_total', 'Total documents', ['status'])
# Middleware to track requests
class MetricsMiddleware:
def __init__(self, app):
self.app = app
def __call__(self, environ, start_response):
path = environ.get('PATH_INFO', '/')
method = environ.get('REQUEST_METHOD', 'GET')
with request_duration.time():
request_count.labels(method=method, endpoint=path).inc()
return self.app(environ, start_response)
# Metrics endpoint
@app.route('/metrics')
def metrics():
# Update gauges
ocr_queue_length.set(redis.llen('ocr_queue'))
active_users.set(get_active_user_count())
document_count.labels(status='processed').set(get_document_count('processed'))
return generate_latest(), 200, {'Content-Type': 'text/plain'}
```
## Log Aggregation
### Loki Setup
```yaml
# Add to docker-compose.monitoring.yml
loki:
image: grafana/loki:latest
container_name: loki
ports:
- "3100:3100"
volumes:
- ./loki/loki-config.yml:/etc/loki/loki-config.yml
- loki_data:/loki
command: -config.file=/etc/loki/loki-config.yml
networks:
- monitoring
promtail:
image: grafana/promtail:latest
container_name: promtail
volumes:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./promtail/promtail-config.yml:/etc/promtail/promtail-config.yml
command: -config.file=/etc/promtail/promtail-config.yml
networks:
- monitoring
```
### Log Configuration
```yaml
# promtail/promtail-config.yml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: readur
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
filters:
- name: label
values: ["com.docker.compose.project=readur"]
relabel_configs:
- source_labels: ['__meta_docker_container_name']
regex: '/(.*)'
target_label: 'container'
- source_labels: ['__meta_docker_container_log_stream']
target_label: 'logstream'
```
## Alerting
### AlertManager Configuration
```yaml
# alertmanager/config.yml
global:
smtp_from: 'alertmanager@readur.company.com'
smtp_smarthost: 'smtp.company.com:587'
smtp_auth_username: 'alertmanager@readur.company.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'team-admins'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'team-admins'
receivers:
- name: 'team-admins'
email_configs:
- to: 'admin-team@company.com'
headers:
Subject: 'Readur Alert: {{ .GroupLabels.alertname }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
```
### Alert Rules
```yaml
# prometheus/alerts/readur.yml
groups:
- name: readur
rules:
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(readur_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High response time on {{ $labels.instance }}"
description: "95th percentile response time is {{ $value }}s"
- alert: DatabaseDown
expr: up{job="postgres"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database is down"
description: "PostgreSQL database is not responding"
- alert: HighOCRQueue
expr: readur_ocr_queue_length > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "OCR queue backlog"
description: "OCR queue has {{ $value }} pending items"
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space"
description: "Only {{ $value | humanizePercentage }} disk space remaining"
```
## Performance Monitoring
### APM Integration
```python
# apm_config.py
from elasticapm import Client
# Configure APM
apm_client = Client({
'SERVICE_NAME': 'readur',
'SERVER_URL': 'http://apm-server:8200',
'ENVIRONMENT': 'production',
'SECRET_TOKEN': 'your-secret-token',
})
# Instrument Flask app
from elasticapm.contrib.flask import ElasticAPM
apm = ElasticAPM(app, client=apm_client)
```
### Custom Performance Metrics
```python
# performance_metrics.py
import time
from contextlib import contextmanager
@contextmanager
def track_performance(operation_name):
start_time = time.time()
try:
yield
finally:
duration = time.time() - start_time
metrics.record_operation_time(operation_name, duration)
if duration > 1.0: # Log slow operations
logger.warning(f"Slow operation: {operation_name} took {duration:.2f}s")
# Usage
with track_performance("document_processing"):
process_document(doc_id)
```
## Uptime Monitoring
### External Monitoring
```yaml
# uptime-kuma/docker-compose.yml
version: '3.8'
services:
uptime-kuma:
image: louislam/uptime-kuma:latest
container_name: uptime-kuma
volumes:
- uptime-kuma_data:/app/data
ports:
- "3001:3001"
restart: unless-stopped
volumes:
uptime-kuma_data:
```
### Status Page Configuration
```nginx
# Public status page
server {
listen 443 ssl;
server_name status.readur.company.com;
location / {
proxy_pass http://localhost:3001;
proxy_set_header Host $host;
}
}
```
## Dashboard Examples
### Key Metrics Dashboard
```sql
-- Query for document processing stats
SELECT
DATE(created_at) as date,
COUNT(*) as documents_processed,
AVG(processing_time) as avg_processing_time,
MAX(processing_time) as max_processing_time
FROM documents
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;
```
### Real-time Monitoring
```javascript
// WebSocket monitoring dashboard
const ws = new WebSocket('wss://readur.company.com/ws/metrics');
ws.onmessage = (event) => {
const metrics = JSON.parse(event.data);
updateDashboard({
activeUsers: metrics.active_users,
queueLength: metrics.queue_length,
responseTime: metrics.response_time,
errorRate: metrics.error_rate
});
};
```
## Troubleshooting Monitoring Issues
### Prometheus Not Scraping
```bash
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets
# Verify metrics endpoint
curl http://localhost:8000/metrics
# Check network connectivity
docker network inspect monitoring
```
### Missing Metrics
```bash
# Debug metric collection
docker-compose exec readur python -c "
from prometheus_client import REGISTRY
for collector in REGISTRY._collector_to_names:
print(collector)
"
```
### High Memory Usage
```bash
# Check Prometheus storage
du -sh /var/lib/prometheus
# Reduce retention
docker-compose exec prometheus promtool tsdb analyze /prometheus
# Clean old data
docker-compose exec prometheus promtool tsdb clean /prometheus
```
## Best Practices
### Monitoring Strategy
1. **Start Simple**: Begin with basic health checks and expand
2. **Alert Fatigue**: Only alert on actionable issues
3. **SLI/SLO Definition**: Define and track service level indicators
4. **Dashboard Organization**: Create role-specific dashboards
5. **Log Retention**: Balance storage costs with debugging needs
6. **Security**: Protect monitoring endpoints and dashboards
7. **Documentation**: Document alert runbooks and response procedures
### Maintenance
```bash
# Weekly maintenance tasks
#!/bin/bash
# Rotate logs
docker-compose exec readur logrotate -f /etc/logrotate.conf
# Clean up old metrics
curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones
# Backup Grafana dashboards
docker-compose exec grafana grafana-cli admin export-dashboard
# Update monitoring stack
docker-compose -f docker-compose.monitoring.yml pull
docker-compose -f docker-compose.monitoring.yml up -d
```
## Related Documentation
- [Performance Tuning](./performance.md)
- [Health Monitoring Guide](../health-monitoring-guide.md)
- [Backup Strategies](./backup.md)
- [Troubleshooting Guide](../troubleshooting.md)