Readur/docs/getting-started/configuration.md

517 lines
10 KiB
Markdown

# Configuration Guide
Configure Readur for your specific needs and optimize for your workload.
## Configuration Overview
Readur uses environment variables for configuration, making it easy to deploy in containerized environments. Configuration can be set through:
1. **Environment variables** - Direct system environment
2. **`.env` file** - Docker Compose automatically loads this
3. **`docker-compose.yml`** - Directly in the compose file
4. **Kubernetes ConfigMaps** - For K8s deployments
## Essential Configuration
### Security Settings
These MUST be changed from defaults in production:
```bash
# Generate secure secrets
JWT_SECRET=$(openssl rand -base64 32)
DB_PASSWORD=$(openssl rand -base64 32)
# CRITICAL: Always change JWT_SECRET from default!
# Default values are insecure and should never be used in production
# Set admin password
ADMIN_PASSWORD=your_secure_password_here
# Enable HTTPS (reverse proxy recommended)
FORCE_HTTPS=true
SECURE_COOKIES=true
# WARNING: Only disable SSL verification for development/testing
# S3_VERIFY_SSL=false # NEVER use in production
```
### Database Configuration
```bash
# PostgreSQL connection
DATABASE_URL=postgresql://readur:${DB_PASSWORD}@postgres:5432/readur
# WARNING: Never include passwords directly in DATABASE_URL in config files
# Connection pool settings
DB_POOL_SIZE=20
DB_MAX_OVERFLOW=40
DB_POOL_TIMEOUT=30
# PostgreSQL specific optimizations
POSTGRES_SHARED_BUFFERS=256MB
POSTGRES_EFFECTIVE_CACHE_SIZE=1GB
```
### Storage Configuration
#### Local Storage (Default)
```bash
# File storage paths
UPLOAD_PATH=/app/uploads
TEMP_PATH=/app/temp
# Size limits
MAX_FILE_SIZE_MB=50
TOTAL_STORAGE_LIMIT_GB=100
# File types
ALLOWED_FILE_TYPES=pdf,png,jpg,jpeg,tiff,bmp,gif,txt,rtf,doc,docx
```
#### S3 Storage (Scalable)
```bash
# Enable S3 backend
STORAGE_BACKEND=s3
S3_ENABLED=true
# AWS S3
S3_BUCKET_NAME=readur-documents
S3_REGION=us-east-1
S3_ACCESS_KEY_ID=your_access_key
S3_SECRET_ACCESS_KEY=your_secret_key
# Or S3-compatible (MinIO, Wasabi, etc.)
S3_ENDPOINT=https://s3.example.com
S3_PATH_STYLE=true # For MinIO
```
## OCR Configuration
### Language Settings
```bash
# Single language (fastest)
OCR_LANGUAGE=eng
# Multiple languages
OCR_LANGUAGE=eng+deu+fra+spa
# Available languages (partial list):
# eng - English
# deu - German (Deutsch)
# fra - French (Français)
# spa - Spanish (Español)
# ita - Italian (Italiano)
# por - Portuguese
# rus - Russian
# chi_sim - Chinese Simplified
# jpn - Japanese
# ara - Arabic
```
### Performance Tuning
```bash
# Concurrent processing
CONCURRENT_OCR_JOBS=3 # OCR runtime uses 3 threads
OCR_WORKER_THREADS=2 # Background runtime uses 2 threads
# Note: Database runtime also uses 2 threads
# Timeouts and limits
OCR_TIMEOUT_SECONDS=300
OCR_MAX_PAGES=500
MAX_FILE_SIZE_MB=100
# Memory management
OCR_MEMORY_LIMIT_MB=512 # Per job
ENABLE_MEMORY_PROFILING=false
# Processing options
OCR_DPI=300 # Higher = better quality, slower
ENABLE_PREPROCESSING=true
ENABLE_AUTO_ROTATION=true
ENABLE_DESKEW=true
```
### Quality vs Speed
#### High Quality (Slow)
```bash
OCR_QUALITY_PRESET=high
OCR_DPI=300
ENABLE_PREPROCESSING=true
ENABLE_DESKEW=true
ENABLE_AUTO_ROTATION=true
OCR_ENGINE_MODE=3 # LSTM only
```
#### Balanced (Default)
```bash
OCR_QUALITY_PRESET=balanced
OCR_DPI=200
ENABLE_PREPROCESSING=true
ENABLE_DESKEW=false
ENABLE_AUTO_ROTATION=true
OCR_ENGINE_MODE=2 # LSTM + Legacy
```
#### Fast (Lower Quality)
```bash
OCR_QUALITY_PRESET=fast
OCR_DPI=150
ENABLE_PREPROCESSING=false
ENABLE_DESKEW=false
ENABLE_AUTO_ROTATION=false
OCR_ENGINE_MODE=0 # Legacy only
```
## Source Synchronization
### Watch Folders
```bash
# Global watch folder
WATCH_FOLDER=/app/watch
WATCH_INTERVAL_SECONDS=60
FILE_STABILITY_CHECK_MS=2000
# Per-user watch folders
ENABLE_PER_USER_WATCH=true
USER_WATCH_BASE_DIR=/app/user_watch
# Processing rules
WATCH_PROCESS_HIDDEN_FILES=false
WATCH_RECURSIVE=true
WATCH_MAX_DEPTH=5
DELETE_AFTER_IMPORT=false
```
### WebDAV Sources
```bash
# Default WebDAV settings
WEBDAV_TIMEOUT_SECONDS=30
WEBDAV_MAX_RETRIES=3
WEBDAV_CHUNK_SIZE_MB=10
WEBDAV_VERIFY_SSL=true
```
### S3 Sources
```bash
# S3 sync settings
S3_SYNC_INTERVAL_MINUTES=30
S3_BATCH_SIZE=100
S3_MULTIPART_THRESHOLD_MB=100
S3_CONCURRENT_DOWNLOADS=4
```
## Authentication & Security
### Local Authentication
```bash
# Password policy
PASSWORD_MIN_LENGTH=12
PASSWORD_REQUIRE_UPPERCASE=true
PASSWORD_REQUIRE_NUMBERS=true
PASSWORD_REQUIRE_SPECIAL=true
# Session management
SESSION_TIMEOUT_MINUTES=60
REMEMBER_ME_DURATION_DAYS=30
MAX_LOGIN_ATTEMPTS=5
LOCKOUT_DURATION_MINUTES=15
```
### OIDC/SSO Configuration
```bash
# Enable OIDC
OIDC_ENABLED=true
# Provider configuration
OIDC_ISSUER=https://login.microsoftonline.com/tenant-id/v2.0
OIDC_CLIENT_ID=your-client-id
OIDC_CLIENT_SECRET=your-client-secret
OIDC_REDIRECT_URI=https://readur.example.com/auth/callback
# Optional settings
OIDC_SCOPE=openid profile email
OIDC_USER_CLAIM=email
OIDC_GROUPS_CLAIM=groups
OIDC_ADMIN_GROUP=readur-admins
# Auto-provisioning
OIDC_AUTO_CREATE_USERS=true
OIDC_DEFAULT_ROLE=user
```
## Search Configuration
### Search Engine
```bash
# PostgreSQL Full-Text Search settings
SEARCH_LANGUAGE=english
SEARCH_RANKING_NORMALIZATION=32
ENABLE_PHRASE_SEARCH=true
ENABLE_FUZZY_SEARCH=true
FUZZY_SEARCH_DISTANCE=2
# Search results
SEARCH_RESULTS_PER_PAGE=20
SEARCH_SNIPPET_LENGTH=200
SEARCH_HIGHLIGHT_TAG=mark
```
### Search Performance
```bash
# Index management
AUTO_REINDEX=true
REINDEX_SCHEDULE=0 3 * * * # 3 AM daily
SEARCH_CACHE_TTL_SECONDS=300
SEARCH_CACHE_SIZE_MB=100
# Query optimization
MAX_SEARCH_TERMS=10
ENABLE_SEARCH_SUGGESTIONS=true
SUGGESTION_MIN_LENGTH=3
```
## Monitoring & Logging
### Logging Configuration
```bash
# Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
LOG_LEVEL=INFO
LOG_FORMAT=json # or text
# Log outputs
LOG_TO_FILE=true
LOG_FILE_PATH=/app/logs/readur.log
LOG_FILE_MAX_SIZE_MB=100
LOG_FILE_BACKUP_COUNT=10
# Detailed logging
LOG_SQL_QUERIES=false
LOG_HTTP_REQUESTS=true
LOG_OCR_DETAILS=false
```
### Health Monitoring
```bash
# Health check endpoints
HEALTH_CHECK_ENABLED=true
HEALTH_CHECK_PATH=/health
METRICS_ENABLED=true
METRICS_PATH=/metrics
# Alerting thresholds
ALERT_QUEUE_SIZE=100
ALERT_OCR_FAILURE_RATE=0.1
ALERT_DISK_USAGE_PERCENT=80
ALERT_MEMORY_USAGE_PERCENT=90
```
## Performance Optimization
### System Resources
```bash
# Memory limits
MEMORY_LIMIT_MB=2048
MEMORY_SOFT_LIMIT_MB=1536
# CPU settings
CPU_CORES=4
WORKER_PROCESSES=auto # or specific number
WORKER_THREADS=2
# Connection limits
MAX_CONNECTIONS=100
CONNECTION_TIMEOUT=30
```
### Caching
```bash
# Enable caching layers
ENABLE_CACHE=true
CACHE_TYPE=redis # or memory
# Redis cache (if used)
REDIS_URL=redis://redis:6379/0
REDIS_MAX_CONNECTIONS=50
# Cache TTLs
DOCUMENT_CACHE_TTL=3600
SEARCH_CACHE_TTL=300
USER_CACHE_TTL=1800
```
### Queue Management
```bash
# Background job processing
QUEUE_TYPE=database # or redis
MAX_QUEUE_SIZE=1000
QUEUE_POLL_INTERVAL=5
# Job priorities
OCR_JOB_PRIORITY=5
SYNC_JOB_PRIORITY=3
CLEANUP_JOB_PRIORITY=1
# Retry configuration
MAX_JOB_RETRIES=3
RETRY_DELAY_SECONDS=60
EXPONENTIAL_BACKOFF=true
```
## Environment-Specific Configurations
### Development
```bash
# .env.development
DEBUG=true
LOG_LEVEL=DEBUG
RELOAD_ON_CHANGE=true
CONCURRENT_OCR_JOBS=1
DISABLE_RATE_LIMITING=true
```
### Staging
```bash
# .env.staging
DEBUG=false
LOG_LEVEL=INFO
CONCURRENT_OCR_JOBS=2
ENABLE_PROFILING=true
MOCK_EXTERNAL_SERVICES=true
```
### Production
```bash
# .env.production
DEBUG=false
LOG_LEVEL=WARNING
CONCURRENT_OCR_JOBS=8
ENABLE_RATE_LIMITING=true
SECURE_COOKIES=true
FORCE_HTTPS=true
```
## Configuration Validation
### Check Configuration
```bash
# Validate current configuration
docker exec readur python validate_config.py
# Test specific settings
docker exec readur python -c "
from config import settings
print(f'OCR Languages: {settings.OCR_LANGUAGE}')
print(f'Storage Backend: {settings.STORAGE_BACKEND}')
print(f'Max File Size: {settings.MAX_FILE_SIZE_MB}MB')
"
```
### Common Validation Errors
```bash
# Missing required S3 credentials
ERROR: S3_ENABLED=true but S3_BUCKET_NAME not set
# Invalid language code
ERROR: OCR_LANGUAGE 'xyz' not supported
# Insufficient resources
WARNING: CONCURRENT_OCR_JOBS=8 but only 2 CPU cores available
```
## Configuration Best Practices
### Security
1. **Never commit secrets** - Use `.env` files and add to `.gitignore`
2. **Change JWT_SECRET immediately** - Never use default values
3. **Rotate secrets regularly** - Especially JWT_SECRET and API keys
4. **Use strong passwords** - Minimum 16 characters for admin
5. **Enable HTTPS** - Always in production
6. **Restrict file types** - Only allow necessary formats
7. **Never expose secrets in command lines** - They appear in process lists
8. **Always verify SSL certificates** - Only disable for local development
### Performance
1. **Match workers to cores** - CONCURRENT_OCR_JOBS ≤ CPU cores
2. **Monitor memory usage** - Adjust limits based on usage
3. **Use S3 for scale** - Local storage limited by disk
4. **Enable caching** - Reduces database load
5. **Tune PostgreSQL** - Adjust shared_buffers and work_mem
### Reliability
1. **Set reasonable timeouts** - Prevent hanging jobs
2. **Configure retries** - Handle transient failures
3. **Enable health checks** - For load balancer integration
4. **Set up logging** - Essential for troubleshooting
5. **Regular backups** - Automate database backups
## Configuration Examples
### Small Office (5-10 users)
```bash
# Minimal resources, local storage
CONCURRENT_OCR_JOBS=2
MEMORY_LIMIT_MB=1024
STORAGE_BACKEND=local
MAX_FILE_SIZE_MB=20
SEARCH_CACHE_TTL=600
```
### Medium Business (50-100 users)
```bash
# Balanced performance, S3 storage
CONCURRENT_OCR_JOBS=4
MEMORY_LIMIT_MB=4096
STORAGE_BACKEND=s3
MAX_FILE_SIZE_MB=50
ENABLE_CACHE=true
CACHE_TYPE=redis
```
### Enterprise (500+ users)
```bash
# High performance, full features
CONCURRENT_OCR_JOBS=16
MEMORY_LIMIT_MB=16384
STORAGE_BACKEND=s3
MAX_FILE_SIZE_MB=100
ENABLE_CACHE=true
CACHE_TYPE=redis
QUEUE_TYPE=redis
OIDC_ENABLED=true
```
## Next Steps
- [Installation Guide](installation.md) - Deploy Readur
- [User Guide](../user-guide.md) - Learn the interface
- [API Reference](../api-reference.md) - Integrate with Readur
- [Deployment Guide](../deployment.md) - Production setup