Readur/docs/self-hosting/monitoring.md

# Monitoring and Observability Guide

## Overview

This guide covers setting up comprehensive monitoring for Readur, including metrics collection, log aggregation, alerting, and dashboard creation.

## Monitoring Stack Components

### Core Components

1. **Metrics Collection**: Prometheus + Node Exporter
2. **Visualization**: Grafana
3. **Log Aggregation**: Loki or ELK Stack
4. **Alerting**: AlertManager
5. **Application Monitoring**: Custom metrics and health checks
6. **Uptime Monitoring**: Uptime Kuma or Pingdom

## Health Monitoring

### Built-in Health Endpoints

```bash
# Basic health check
curl http://localhost:8000/health

# Detailed health status
curl http://localhost:8000/health/detailed

# Response format
{
  "status": "healthy",
  "database": "connected",
  "redis": "connected",
  "storage": "accessible",
  "ocr_queue": 45,
  "version": "2.5.4",
  "uptime": 345600
}
```

### Custom Health Checks

```python
# health_checks.py
from typing import Dict, Any

class HealthMonitor:
    @staticmethod
    def check_database() -> Dict[str, Any]:
        try:
            db.session.execute("SELECT 1")
            return {"status": "healthy", "response_time": 0.005}
        except Exception as e:
            return {"status": "unhealthy", "error": str(e)}

    @staticmethod
    def check_storage() -> Dict[str, Any]:
        try:
            # Check if storage is accessible
            storage.list_files(limit=1)
            return {"status": "healthy", "available_space": storage.get_free_space()}
        except Exception as e:
            return {"status": "unhealthy", "error": str(e)}

    @staticmethod
    def check_ocr_workers() -> Dict[str, Any]:
        active = celery.control.inspect().active()
        return {
            "status": "healthy" if active else "degraded",
            "active_workers": len(active or {}),
            "queue_length": redis.llen("ocr_queue")
        }
```

## Prometheus Setup

### Installation and Configuration

```yaml
# docker-compose.monitoring.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
    networks:
      - monitoring

  postgres-exporter:
    image: prometheuscommunity/postgres-exporter:latest
    container_name: postgres-exporter
    environment:
      DATA_SOURCE_NAME: "postgresql://readur:password@postgres:5432/readur?sslmode=disable"
    ports:
      - "9187:9187"
    networks:
      - monitoring

  redis-exporter:
    image: oliver006/redis_exporter:latest
    container_name: redis-exporter
    environment:
      REDIS_ADDR: "redis://redis:6379"
    ports:
      - "9121:9121"
    networks:
      - monitoring

networks:
  monitoring:
    external: true

volumes:
  prometheus_data:
```

### Prometheus Configuration

```yaml
# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'readur-monitor'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - '/etc/prometheus/alerts/*.yml'

scrape_configs:
  - job_name: 'readur'
    static_configs:
      - targets: ['readur:8000']
    metrics_path: '/metrics'

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
```

## Grafana Dashboards

### Setup Grafana

```yaml
# Add to docker-compose.monitoring.yml
grafana:
  image: grafana/grafana:latest
  container_name: grafana
  environment:
    - GF_SECURITY_ADMIN_USER=admin
    - GF_SECURITY_ADMIN_PASSWORD=changeme
    - GF_SERVER_ROOT_URL=https://grafana.readur.company.com
    - GF_INSTALL_PLUGINS=redis-datasource
  volumes:
    - grafana_data:/var/lib/grafana
    - ./grafana/provisioning:/etc/grafana/provisioning
  ports:
    - "3000:3000"
  networks:
    - monitoring
```

### Dashboard Configuration

```json
# grafana/provisioning/dashboards/readur.json
{
  "dashboard": {
    "title": "Readur Performance Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [{
          "expr": "rate(readur_requests_total[5m])"
        }]
      },
      {
        "title": "Response Time",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(readur_request_duration_seconds_bucket[5m]))"
        }]
      },
      {
        "title": "OCR Queue",
        "targets": [{
          "expr": "readur_ocr_queue_length"
        }]
      },
      {
        "title": "Database Connections",
        "targets": [{
          "expr": "pg_stat_database_numbackends{datname='readur'}"
        }]
      }
    ]
  }
}
```

## Application Metrics

### Custom Metrics Implementation

```python
# metrics.py
from prometheus_client import Counter, Histogram, Gauge, generate_latest

# Define metrics
request_count = Counter('readur_requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('readur_request_duration_seconds', 'Request duration')
ocr_queue_length = Gauge('readur_ocr_queue_length', 'OCR queue length')
active_users = Gauge('readur_active_users', 'Active users in last 5 minutes')
document_count = Gauge('readur_documents_total', 'Total documents', ['status'])

# Middleware to track requests
class MetricsMiddleware:
    def __init__(self, app):
        self.app = app

    def __call__(self, environ, start_response):
        path = environ.get('PATH_INFO', '/')
        method = environ.get('REQUEST_METHOD', 'GET')

        with request_duration.time():
            request_count.labels(method=method, endpoint=path).inc()
            return self.app(environ, start_response)

# Metrics endpoint
@app.route('/metrics')
def metrics():
    # Update gauges
    ocr_queue_length.set(redis.llen('ocr_queue'))
    active_users.set(get_active_user_count())
    document_count.labels(status='processed').set(get_document_count('processed'))

    return generate_latest(), 200, {'Content-Type': 'text/plain'}
```

## Log Aggregation

### Loki Setup

```yaml
# Add to docker-compose.monitoring.yml
loki:
  image: grafana/loki:latest
  container_name: loki
  ports:
    - "3100:3100"
  volumes:
    - ./loki/loki-config.yml:/etc/loki/loki-config.yml
    - loki_data:/loki
  command: -config.file=/etc/loki/loki-config.yml
  networks:
    - monitoring

promtail:
  image: grafana/promtail:latest
  container_name: promtail
  volumes:
    - /var/log:/var/log:ro
    - /var/lib/docker/containers:/var/lib/docker/containers:ro
    - ./promtail/promtail-config.yml:/etc/promtail/promtail-config.yml
  command: -config.file=/etc/promtail/promtail-config.yml
  networks:
    - monitoring
```

### Log Configuration

```yaml
# promtail/promtail-config.yml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: readur
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
        filters:
          - name: label
            values: ["com.docker.compose.project=readur"]
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        regex: '/(.*)'
        target_label: 'container'
      - source_labels: ['__meta_docker_container_log_stream']
        target_label: 'logstream'
```

## Alerting

### AlertManager Configuration

```yaml
# alertmanager/config.yml
global:
  smtp_from: 'alertmanager@readur.company.com'
  smtp_smarthost: 'smtp.company.com:587'
  smtp_auth_username: 'alertmanager@readur.company.com'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'team-admins'

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true

    - match:
        severity: warning
      receiver: 'team-admins'

receivers:
  - name: 'team-admins'
    email_configs:
      - to: 'admin-team@company.com'
        headers:
          Subject: 'Readur Alert: {{ .GroupLabels.alertname }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'
```

### Alert Rules

```yaml
# prometheus/alerts/readur.yml
groups:
  - name: readur
    rules:
      - alert: HighResponseTime
        expr: histogram_quantile(0.95, rate(readur_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High response time on {{ $labels.instance }}"
          description: "95th percentile response time is {{ $value }}s"

      - alert: DatabaseDown
        expr: up{job="postgres"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database is down"
          description: "PostgreSQL database is not responding"

      - alert: HighOCRQueue
        expr: readur_ocr_queue_length > 1000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "OCR queue backlog"
          description: "OCR queue has {{ $value }} pending items"

      - alert: DiskSpaceLow
        expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space"
          description: "Only {{ $value | humanizePercentage }} disk space remaining"
```

## Performance Monitoring

### APM Integration

```python
# apm_config.py
from elasticapm import Client

# Configure APM
apm_client = Client({
    'SERVICE_NAME': 'readur',
    'SERVER_URL': 'http://apm-server:8200',
    'ENVIRONMENT': 'production',
    'SECRET_TOKEN': 'your-secret-token',
})

# Instrument Flask app
from elasticapm.contrib.flask import ElasticAPM
apm = ElasticAPM(app, client=apm_client)
```

### Custom Performance Metrics

```python
# performance_metrics.py
import time
from contextlib import contextmanager

@contextmanager
def track_performance(operation_name):
    start_time = time.time()
    try:
        yield
    finally:
        duration = time.time() - start_time
        metrics.record_operation_time(operation_name, duration)

        if duration > 1.0:  # Log slow operations
            logger.warning(f"Slow operation: {operation_name} took {duration:.2f}s")

# Usage
with track_performance("document_processing"):
    process_document(doc_id)
```

## Uptime Monitoring

### External Monitoring

```yaml
# uptime-kuma/docker-compose.yml
version: '3.8'

services:
  uptime-kuma:
    image: louislam/uptime-kuma:latest
    container_name: uptime-kuma
    volumes:
      - uptime-kuma_data:/app/data
    ports:
      - "3001:3001"
    restart: unless-stopped

volumes:
  uptime-kuma_data:
```

### Status Page Configuration

```nginx
# Public status page
server {
    listen 443 ssl;
    server_name status.readur.company.com;

    location / {
        proxy_pass http://localhost:3001;
        proxy_set_header Host $host;
    }
}
```

## Dashboard Examples

### Key Metrics Dashboard

```sql
-- Query for document processing stats
SELECT
    DATE(created_at) as date,
    COUNT(*) as documents_processed,
    AVG(processing_time) as avg_processing_time,
    MAX(processing_time) as max_processing_time
FROM documents
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;
```

### Real-time Monitoring

```javascript
// WebSocket monitoring dashboard
const ws = new WebSocket('wss://readur.company.com/ws/metrics');

ws.onmessage = (event) => {
    const metrics = JSON.parse(event.data);
    updateDashboard({
        activeUsers: metrics.active_users,
        queueLength: metrics.queue_length,
        responseTime: metrics.response_time,
        errorRate: metrics.error_rate
    });
};
```

## Troubleshooting Monitoring Issues

### Prometheus Not Scraping

```bash
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Verify metrics endpoint
curl http://localhost:8000/metrics

# Check network connectivity
docker network inspect monitoring
```

### Missing Metrics

```bash
# Debug metric collection
docker-compose exec readur python -c "
from prometheus_client import REGISTRY
for collector in REGISTRY._collector_to_names:
    print(collector)
"
```

### High Memory Usage

```bash
# Check Prometheus storage
du -sh /var/lib/prometheus

# Reduce retention
docker-compose exec prometheus promtool tsdb analyze /prometheus

# Clean old data
docker-compose exec prometheus promtool tsdb clean /prometheus
```

## Best Practices

### Monitoring Strategy

1. **Start Simple**: Begin with basic health checks and expand
2. **Alert Fatigue**: Only alert on actionable issues
3. **SLI/SLO Definition**: Define and track service level indicators
4. **Dashboard Organization**: Create role-specific dashboards
5. **Log Retention**: Balance storage costs with debugging needs
6. **Security**: Protect monitoring endpoints and dashboards
7. **Documentation**: Document alert runbooks and response procedures

### Maintenance

```bash
# Weekly maintenance tasks
#!/bin/bash

# Rotate logs
docker-compose exec readur logrotate -f /etc/logrotate.conf

# Clean up old metrics
curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones

# Backup Grafana dashboards
docker-compose exec grafana grafana-cli admin export-dashboard

# Update monitoring stack
docker-compose -f docker-compose.monitoring.yml pull
docker-compose -f docker-compose.monitoring.yml up -d
```

## Related Documentation

- [Performance Tuning](./performance.md)
- [Health Monitoring Guide](../health-monitoring-guide.md)
- [Backup Strategies](./backup.md)
- [Troubleshooting Guide](../troubleshooting.md)