diff --git a/README.md b/README.md index 5259840..0d5d12e 100644 --- a/README.md +++ b/README.md @@ -16,10 +16,6 @@ A powerful, modern document management system built with Rust and React. Readur ## ๐Ÿš€ Quick Start -### Using Docker Compose (Recommended) - -The fastest way to get Readur running: - ```bash # Clone the repository git clone https://github.com/perfectra1n/readur @@ -38,278 +34,26 @@ open http://localhost:8000 > โš ๏ธ **Important**: Change the default admin password immediately after first login! -### What You Get +## ๐Ÿ“š Documentation -After deployment, you'll have: -- **Web Interface**: Modern document management UI at `http://localhost:8000` -- **PostgreSQL Database**: Document metadata and full-text search indexes -- **File Storage**: Persistent document storage with OCR processing -- **Watch Folder**: Automatic file ingestion from mounted directories -- **REST API**: Full API access for integrations +### Getting Started +- [๐Ÿ“ฆ Installation Guide](docs/installation.md) - Docker & manual installation instructions +- [๐Ÿ”ง Configuration](docs/configuration.md) - Environment variables and settings +- [๐Ÿ“– User Guide](docs/user-guide.md) - How to use Readur effectively -## ๐Ÿณ Docker Deployment Guide +### Deployment & Operations +- [๐Ÿš€ Deployment Guide](docs/deployment.md) - Production deployment, SSL, monitoring +- [๐Ÿ”„ Reverse Proxy Setup](docs/REVERSE_PROXY.md) - Nginx, Traefik, and more +- [๐Ÿ“ Watch Folder Guide](docs/WATCH_FOLDER.md) - Automatic document ingestion -### Production Docker Compose +### Development +- [๐Ÿ—๏ธ Developer Documentation](docs/dev/) - Architecture, development setup, testing +- [๐Ÿ”Œ API Reference](docs/api-reference.md) - REST API documentation -For production deployments, create a custom `docker-compose.prod.yml`: - -```yaml -services: - readur: - image: readur:latest - ports: - - "8000:8000" - environment: - # Core Configuration - - DATABASE_URL=postgresql://readur:${DB_PASSWORD}@postgres:5432/readur - - JWT_SECRET=${JWT_SECRET} - - SERVER_ADDRESS=0.0.0.0:8000 - - # File Storage - - UPLOAD_PATH=/app/uploads - - WATCH_FOLDER=/app/watch - - ALLOWED_FILE_TYPES=pdf,png,jpg,jpeg,tiff,bmp,gif,txt,doc,docx - - # Watch Folder Settings - - WATCH_INTERVAL_SECONDS=30 - - FILE_STABILITY_CHECK_MS=500 - - MAX_FILE_AGE_HOURS=168 - - # OCR Configuration - - OCR_LANGUAGE=eng - - CONCURRENT_OCR_JOBS=4 - - OCR_TIMEOUT_SECONDS=300 - - MAX_FILE_SIZE_MB=100 - - # Performance Tuning - - MEMORY_LIMIT_MB=1024 - - CPU_PRIORITY=normal - - ENABLE_COMPRESSION=true - - volumes: - # Document storage - - ./data/uploads:/app/uploads - - # Watch folder - mount your network drives here - - /mnt/nfs/documents:/app/watch - # or SMB: - /mnt/smb/shared:/app/watch - # or S3: - /mnt/s3/bucket:/app/watch - - depends_on: - - postgres - restart: unless-stopped - - # Resource limits for production - deploy: - resources: - limits: - memory: 2G - cpus: '2.0' - reservations: - memory: 512M - cpus: '0.5' - - postgres: - image: postgres:15 - environment: - - POSTGRES_USER=readur - - POSTGRES_PASSWORD=${DB_PASSWORD} - - POSTGRES_DB=readur - - POSTGRES_INITDB_ARGS=--encoding=UTF-8 --lc-collate=en_US.UTF-8 --lc-ctype=en_US.UTF-8 - - volumes: - - postgres_data:/var/lib/postgresql/data - - ./postgres-config:/etc/postgresql/conf.d:ro - - # PostgreSQL optimization for document search - command: > - postgres - -c shared_buffers=256MB - -c effective_cache_size=1GB - -c max_connections=100 - -c default_text_search_config=pg_catalog.english - - restart: unless-stopped - - # Don't expose port in production - # ports: - # - "5433:5432" - -volumes: - postgres_data: - driver: local -``` - -### Environment Variables - -#### Port Configuration - -Readur supports flexible port configuration: - -```bash -# Method 1: Specify full server address -SERVER_ADDRESS=0.0.0.0:8000 - -# Method 2: Use separate host and port (recommended) -SERVER_HOST=0.0.0.0 -SERVER_PORT=8000 - -# For development: Configure frontend port -CLIENT_PORT=5173 -BACKEND_PORT=8000 -``` - -#### Security Configuration - -Create a `.env` file for your secrets: - -```bash -# Generate secure secrets -JWT_SECRET=$(openssl rand -base64 64) -DB_PASSWORD=$(openssl rand -base64 32) - -# Save to .env file -cat > .env << EOF -JWT_SECRET=${JWT_SECRET} -DB_PASSWORD=${DB_PASSWORD} -EOF -``` - -Deploy with: -```bash -docker compose -f docker-compose.prod.yml --env-file .env up -d -``` - -### Network Filesystem Mounts - -#### NFS Mounts -```bash -# Mount NFS share -sudo mount -t nfs 192.168.1.100:/documents /mnt/nfs/documents - -# Add to docker-compose.yml -volumes: - - /mnt/nfs/documents:/app/watch -environment: - - WATCH_INTERVAL_SECONDS=60 - - FILE_STABILITY_CHECK_MS=1000 - - FORCE_POLLING_WATCH=1 -``` - -#### SMB/CIFS Mounts -```bash -# Mount SMB share -sudo mount -t cifs //server/share /mnt/smb/shared -o username=user,password=pass - -# Docker volume configuration -volumes: - - /mnt/smb/shared:/app/watch -environment: - - WATCH_INTERVAL_SECONDS=30 - - FILE_STABILITY_CHECK_MS=2000 -``` - -#### S3 Mounts (using s3fs) -```bash -# Mount S3 bucket -s3fs mybucket /mnt/s3/bucket -o passwd_file=~/.passwd-s3fs - -# Docker configuration for S3 -volumes: - - /mnt/s3/bucket:/app/watch -environment: - - WATCH_INTERVAL_SECONDS=120 - - FILE_STABILITY_CHECK_MS=5000 - - FORCE_POLLING_WATCH=1 -``` - -### SSL/HTTPS Setup - -Use a reverse proxy like Nginx or Traefik: - -#### Nginx Configuration -```nginx -server { - listen 443 ssl http2; - server_name readur.yourdomain.com; - - ssl_certificate /path/to/cert.pem; - ssl_certificate_key /path/to/key.pem; - - location / { - proxy_pass http://localhost:8000; - proxy_set_header Host $host; - proxy_set_header X-Real-IP $remote_addr; - proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; - proxy_set_header X-Forwarded-Proto $scheme; - - # For file uploads - client_max_body_size 100M; - proxy_read_timeout 300s; - proxy_send_timeout 300s; - } -} -``` - -#### Traefik Configuration -```yaml -services: - readur: - labels: - - "traefik.enable=true" - - "traefik.http.routers.readur.rule=Host(`readur.yourdomain.com`)" - - "traefik.http.routers.readur.tls=true" - - "traefik.http.routers.readur.tls.certresolver=letsencrypt" -``` - -> ๐Ÿ“˜ **For detailed reverse proxy configurations** including Apache, Caddy, custom ports, load balancing, and advanced scenarios, see [REVERSE_PROXY.md](./REVERSE_PROXY.md). - -### Health Checks - -Add health checks to your Docker configuration: - -```yaml -services: - readur: - healthcheck: - test: ["CMD", "curl", "-f", "http://localhost:8000/api/health"] - interval: 30s - timeout: 10s - retries: 3 - start_period: 40s -``` - -### Backup Strategy - -```bash -#!/bin/bash -# backup.sh - Automated backup script - -# Backup database -docker exec readur-postgres-1 pg_dump -U readur readur | gzip > backup_$(date +%Y%m%d_%H%M%S).sql.gz - -# Backup uploaded files -tar -czf uploads_backup_$(date +%Y%m%d_%H%M%S).tar.gz -C ./data uploads/ - -# Clean old backups (keep 30 days) -find . -name "backup_*.sql.gz" -mtime +30 -delete -find . -name "uploads_backup_*.tar.gz" -mtime +30 -delete -``` - -### Monitoring - -Monitor your deployment with Docker stats: - -```bash -# Real-time resource usage -docker stats - -# Container logs -docker compose logs -f readur - -# Watch folder activity -docker compose logs -f readur | grep watcher -``` +### Advanced Topics +- [๐Ÿ” OCR Optimization](docs/dev/OCR_OPTIMIZATION_GUIDE.md) - Improve OCR performance +- [๐Ÿ—„๏ธ Database Best Practices](docs/dev/DATABASE_GUARDRAILS.md) - Concurrency and safety +- [๐Ÿ“Š Queue Architecture](docs/dev/QUEUE_IMPROVEMENTS.md) - Background job processing ## ๐Ÿ—๏ธ Architecture @@ -327,495 +71,24 @@ docker compose logs -f readur | grep watcher ## ๐Ÿ“‹ System Requirements -### Minimum Requirements -- **CPU**: 2 cores -- **RAM**: 2GB -- **Storage**: 10GB free space -- **OS**: Linux, macOS, or Windows with Docker +### Minimum +- 2 CPU cores, 2GB RAM, 10GB storage +- Docker or manual installation prerequisites ### Recommended for Production -- **CPU**: 4+ cores -- **RAM**: 4GB+ -- **Storage**: 50GB+ SSD -- **Network**: Stable internet connection for OCR processing - -## ๐Ÿ› ๏ธ Manual Installation - -For development or custom deployments without Docker: - -### Prerequisites - -Install these dependencies on your system: - -```bash -# Ubuntu/Debian -sudo apt-get update -sudo apt-get install -y \ - tesseract-ocr tesseract-ocr-eng \ - libtesseract-dev libleptonica-dev \ - postgresql postgresql-contrib \ - pkg-config libclang-dev - -# macOS (requires Homebrew) -brew install tesseract leptonica postgresql rust nodejs npm - -# Install Rust (if not already installed) -curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -``` - -### Backend Setup - -1. **Configure Database**: -```bash -# Create database and user -sudo -u postgres psql -CREATE DATABASE readur; -CREATE USER readur_user WITH ENCRYPTED PASSWORD 'your_password'; -GRANT ALL PRIVILEGES ON DATABASE readur TO readur_user; -\q -``` - -2. **Environment Configuration**: -```bash -# Copy environment template -cp .env.example .env - -# Edit configuration -nano .env -``` - -Required environment variables: -```env -DATABASE_URL=postgresql://readur_user:your_password@localhost/readur -JWT_SECRET=your-super-secret-jwt-key-change-this -SERVER_ADDRESS=0.0.0.0:8000 -UPLOAD_PATH=./uploads -WATCH_FOLDER=./watch -ALLOWED_FILE_TYPES=pdf,png,jpg,jpeg,gif,bmp,tiff,txt,rtf,doc,docx -``` - -3. **Build and Run Backend**: -```bash -# Install dependencies and run -cargo build --release -cargo run -``` - -### Frontend Setup - -1. **Install Dependencies**: -```bash -cd frontend -npm install -``` - -2. **Development Mode**: -```bash -npm run dev -# Frontend available at http://localhost:5173 -``` - -3. **Production Build**: -```bash -npm run build -# Built files in frontend/dist/ -``` - -## ๐Ÿ“– User Guide - -### Getting Started - -1. **First Login**: Use the default admin credentials to access the system -2. **Upload Documents**: Drag and drop files or use the upload button -3. **Wait for Processing**: OCR processing happens automatically in the background -4. **Search and Organize**: Use the powerful search features to find your documents - -### Supported File Types - -| Type | Extensions | OCR Support | Notes | -|------|-----------|-------------|-------| -| **PDF** | `.pdf` | โœ… | Text extraction + OCR for scanned pages | -| **Images** | `.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.gif` | โœ… | Full OCR text extraction | -| **Text** | `.txt`, `.rtf` | โŒ | Direct text indexing | -| **Office** | `.doc`, `.docx` | โš ๏ธ | Limited support | - -### Using the Interface - -#### Dashboard -- **Document Statistics**: Total documents, storage usage, OCR status -- **Recent Activity**: Latest uploads and processing status -- **Quick Actions**: Fast access to upload and search - -#### Document Management -- **List/Grid View**: Toggle between different viewing modes -- **Sorting**: Sort by date, name, size, or file type -- **Filtering**: Filter by tags, file types, and OCR status -- **Bulk Actions**: Select multiple documents for batch operations - -#### Advanced Search -- **Full-text Search**: Search within document content -- **Metadata Filters**: Filter by upload date, file size, type -- **Tag System**: Organize documents with custom tags -- **OCR Status**: Find processed vs. pending documents - -#### Folder Watching -- **Non-destructive**: Unlike paperless-ngx, source files remain untouched -- **Automatic Processing**: New files are detected and processed automatically -- **Configurable**: Set custom watch directories - -### Tips for Best Results - -1. **OCR Quality**: Higher resolution images (300+ DPI) produce better OCR results -2. **File Organization**: Use consistent naming conventions for easier searching -3. **Regular Backups**: Backup both database and file storage regularly -4. **Performance**: For large document collections, consider increasing server resources - -## ๐Ÿ”ง Configuration - -### Environment Variables - -All application settings can be configured via environment variables: - -#### Core Configuration -| Variable | Default | Description | -|----------|---------|-------------| -| `DATABASE_URL` | `postgresql://readur:readur@localhost/readur` | PostgreSQL connection string | -| `JWT_SECRET` | `your-secret-key` | Secret key for JWT tokens โš ๏ธ **Change in production!** | -| `SERVER_ADDRESS` | `0.0.0.0:8000` | Server bind address and port | - -#### File Storage & Upload -| Variable | Default | Description | -|----------|---------|-------------| -| `UPLOAD_PATH` | `./uploads` | Document storage directory | -| `ALLOWED_FILE_TYPES` | `pdf,txt,doc,docx,png,jpg,jpeg` | Comma-separated allowed file extensions | - -#### Watch Folder Configuration -| Variable | Default | Description | -|----------|---------|-------------| -| `WATCH_FOLDER` | `./watch` | Directory to monitor for new files | -| `WATCH_INTERVAL_SECONDS` | `30` | Polling interval for network filesystems (seconds) | -| `FILE_STABILITY_CHECK_MS` | `500` | Time to wait for file write completion (milliseconds) | -| `MAX_FILE_AGE_HOURS` | _(none)_ | Skip files older than this many hours | -| `FORCE_POLLING_WATCH` | _(none)_ | Force polling mode even for local filesystems | - -#### OCR & Processing Settings -*Note: These settings can also be configured per-user via the web interface* - -| Variable | Default | Description | -|----------|---------|-------------| -| `OCR_LANGUAGE` | `eng` | OCR language code (eng, fra, deu, spa, etc.) | -| `CONCURRENT_OCR_JOBS` | `4` | Maximum parallel OCR processes | -| `OCR_TIMEOUT_SECONDS` | `300` | OCR processing timeout per file | -| `MAX_FILE_SIZE_MB` | `50` | Maximum file size for processing | -| `AUTO_ROTATE_IMAGES` | `true` | Automatically rotate images for better OCR | -| `ENABLE_IMAGE_PREPROCESSING` | `true` | Apply image enhancement before OCR | - -#### Search & Performance -| Variable | Default | Description | -|----------|---------|-------------| -| `SEARCH_RESULTS_PER_PAGE` | `25` | Default number of search results per page | -| `SEARCH_SNIPPET_LENGTH` | `200` | Length of text snippets in search results | -| `FUZZY_SEARCH_THRESHOLD` | `0.8` | Similarity threshold for fuzzy search (0.0-1.0) | -| `MEMORY_LIMIT_MB` | `512` | Memory limit for OCR processes | -| `CPU_PRIORITY` | `normal` | CPU priority: `low`, `normal`, `high` | - -#### Data Management -| Variable | Default | Description | -|----------|---------|-------------| -| `RETENTION_DAYS` | _(none)_ | Auto-delete documents after N days | -| `ENABLE_AUTO_CLEANUP` | `false` | Enable automatic cleanup of old documents | -| `ENABLE_COMPRESSION` | `false` | Compress stored documents to save space | -| `ENABLE_BACKGROUND_OCR` | `true` | Process OCR in background queue | - -### Example Production Configuration - -```env -# Core settings -DATABASE_URL=postgresql://readur:secure_password@postgres:5432/readur -JWT_SECRET=your-very-long-random-secret-key-generated-with-openssl -SERVER_ADDRESS=0.0.0.0:8000 - -# File handling -UPLOAD_PATH=/app/uploads -ALLOWED_FILE_TYPES=pdf,png,jpg,jpeg,tiff,bmp,gif,txt,rtf,doc,docx - -# Watch folder for NFS mount -WATCH_FOLDER=/mnt/nfs/documents -WATCH_INTERVAL_SECONDS=60 -FILE_STABILITY_CHECK_MS=1000 -MAX_FILE_AGE_HOURS=168 -FORCE_POLLING_WATCH=1 - -# OCR optimization -OCR_LANGUAGE=eng -CONCURRENT_OCR_JOBS=8 -OCR_TIMEOUT_SECONDS=600 -MAX_FILE_SIZE_MB=200 -AUTO_ROTATE_IMAGES=true -ENABLE_IMAGE_PREPROCESSING=true - -# Performance tuning -MEMORY_LIMIT_MB=2048 -CPU_PRIORITY=high -ENABLE_COMPRESSION=true -ENABLE_BACKGROUND_OCR=true - -# Search optimization -SEARCH_RESULTS_PER_PAGE=50 -SEARCH_SNIPPET_LENGTH=300 -FUZZY_SEARCH_THRESHOLD=0.7 - -# Data management -RETENTION_DAYS=2555 # 7 years -ENABLE_AUTO_CLEANUP=true -``` - -### Runtime Settings vs Environment Variables - -Some settings can be configured in two ways: - -1. **Environment Variables**: Set at container startup, affects the entire application -2. **User Settings**: Configured per-user via the web interface, stored in database - -**Environment variables take precedence** and provide system-wide defaults. User settings override these defaults for individual users where applicable. - -Settings configurable via web interface: -- OCR language preferences -- Search result limits -- File type restrictions -- OCR processing options -- Data retention policies - -### Configuration Priority - -Settings are applied in this order (later values override earlier ones): - -1. **Application defaults** (built into the code) -2. **Environment variables** (system-wide configuration) -3. **User settings** (per-user database settings via web interface) - -This allows for flexible deployment where system administrators can set defaults while users can customize their experience. - -### Quick Reference - Essential Variables - -For a minimal production deployment, configure these essential variables: - -```bash -# Security (REQUIRED) -JWT_SECRET=your-secure-random-key-here -DATABASE_URL=postgresql://user:password@host:port/database - -# File Storage -UPLOAD_PATH=/app/uploads -WATCH_FOLDER=/path/to/mounted/folder - -# Watch Folder (for network mounts) -WATCH_INTERVAL_SECONDS=60 -FORCE_POLLING_WATCH=1 - -# Performance -CONCURRENT_OCR_JOBS=4 -MAX_FILE_SIZE_MB=100 -``` - -### Database Tuning - -For better search performance with large document collections: - -```sql --- Increase shared_buffers for better caching -ALTER SYSTEM SET shared_buffers = '256MB'; - --- Optimize for full-text search -ALTER SYSTEM SET default_text_search_config = 'pg_catalog.english'; - --- Restart PostgreSQL after changes -``` - -## ๐Ÿ”Œ API Reference - -### Authentication Endpoints - -```bash -# Register new user -POST /api/auth/register -Content-Type: application/json -{ - "username": "john_doe", - "email": "john@example.com", - "password": "secure_password" -} - -# Login -POST /api/auth/login -Content-Type: application/json -{ - "username": "john_doe", - "password": "secure_password" -} - -# Get current user -GET /api/auth/me -Authorization: Bearer -``` - -### Document Management - -```bash -# Upload document -POST /api/documents -Authorization: Bearer -Content-Type: multipart/form-data -file: - -# List documents -GET /api/documents?limit=50&offset=0 -Authorization: Bearer - -# Download document -GET /api/documents/{id}/download -Authorization: Bearer -``` - -### Search - -```bash -# Search documents -GET /api/search?query=contract&limit=20 -Authorization: Bearer - -# Advanced search with filters -GET /api/search?query=invoice&mime_types=application/pdf&tags=important -Authorization: Bearer -``` - -## ๐Ÿงช Testing - -### Run All Tests - -```bash -# Backend tests -cargo test - -# Frontend tests -cd frontend && npm test - -# Integration tests with Docker -docker compose -f docker-compose.test.yml up --build -``` - -### Test Coverage - -```bash -# Install cargo-tarpaulin for coverage -cargo install cargo-tarpaulin - -# Generate coverage report -cargo tarpaulin --out Html -``` - -## ๐Ÿ”’ Security Considerations - -### Production Deployment - -1. **Change Default Credentials**: Update admin password immediately -2. **Use Strong JWT Secret**: Generate a secure random key -3. **Enable HTTPS**: Use a reverse proxy with SSL/TLS -4. **Database Security**: Use strong passwords and restrict network access -5. **File Permissions**: Ensure proper file system permissions -6. **Regular Updates**: Keep dependencies and base images updated - -### Recommended Production Setup - -```bash -# Use environment-specific secrets -JWT_SECRET=$(openssl rand -base64 64) - -# Restrict database access -# Only allow connections from application container - -# Use read-only file system where possible -# Mount uploads and watch folders as separate volumes -``` - -## ๐Ÿš€ Deployment Options - -### Docker Swarm - -```yaml -version: '3.8' -services: - readur: - image: readur:latest - deploy: - replicas: 2 - restart_policy: - condition: on-failure - networks: - - readur-network - secrets: - - jwt_secret - - db_password -``` - -### Kubernetes - -```yaml -apiVersion: apps/v1 -kind: Deployment -metadata: - name: readur -spec: - replicas: 3 - selector: - matchLabels: - app: readur - template: - spec: - containers: - - name: readur - image: readur:latest - env: - - name: JWT_SECRET - valueFrom: - secretKeyRef: - name: readur-secrets - key: jwt-secret -``` - -### Cloud Platforms - -- **AWS**: Use ECS with RDS PostgreSQL -- **Google Cloud**: Deploy to Cloud Run with Cloud SQL -- **Azure**: Use Container Instances with Azure Database -- **DigitalOcean**: App Platform with Managed Database +- 4+ CPU cores, 4GB+ RAM, 50GB+ SSD +- See [deployment guide](docs/deployment.md) for details ## ๐Ÿค Contributing -We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details. +We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) and [Development Setup](docs/dev/development.md) for details. -### Development Setup +## ๐Ÿ”’ Security -```bash -# Fork and clone the repository -git clone https://github.com/yourusername/readur.git -cd readur - -# Create a feature branch -git checkout -b feature/amazing-feature - -# Make your changes and test -cargo test -cd frontend && npm test - -# Submit a pull request -``` - -### Code Style - -- **Rust**: Follow `rustfmt` and `clippy` recommendations -- **Frontend**: Use Prettier and ESLint configurations -- **Commits**: Use conventional commit format +- Change default credentials immediately +- Use HTTPS in production +- Regular security updates +- See [deployment guide](docs/deployment.md#security-considerations) for security best practices ## ๐Ÿ“ License @@ -830,9 +103,9 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file ## ๐Ÿ“ž Support -- **Documentation**: Check this README and inline code comments -- **Issues**: Report bugs and request features on GitHub Issues -- **Discussions**: Join community discussions on GitHub Discussions +- **Documentation**: Start with the [User Guide](docs/user-guide.md) +- **Issues**: Report bugs on [GitHub Issues](https://github.com/perfectra1n/readur/issues) +- **Discussions**: Join our [GitHub Discussions](https://github.com/perfectra1n/readur/discussions) --- diff --git a/docs/WATCH_FOLDER.md b/docs/WATCH_FOLDER.md index 4365387..f297536 100644 --- a/docs/WATCH_FOLDER.md +++ b/docs/WATCH_FOLDER.md @@ -1,31 +1,68 @@ -# Watch Folder Documentation +# Watch Folder Guide -The watch folder feature automatically monitors a directory for new OCR-able files and processes them without deleting the original files. This is perfect for scenarios where files are mounted from various filesystem types including NFS, SMB, S3, and local storage. +The watch folder feature automatically monitors a directory for new files and processes them with OCR, making them searchable in Readur. Your original files are never modified or deleted - Readur simply copies and processes them while leaving the originals untouched. -## Features +## What is Watch Folder? -### ๐Ÿ”„ Cross-Filesystem Compatibility -- **Automatic Detection**: Detects filesystem type and chooses optimal watching strategy -- **Local Filesystems**: Uses efficient inotify-based watching for ext4, NTFS, APFS, etc. -- **Network Filesystems**: Uses polling-based watching for NFS, SMB/CIFS, S3 mounts -- **Hybrid Fallback**: Gracefully falls back to polling if inotify fails +Watch folder allows you to: +- **Drop files anywhere** - Point Readur to any folder (local, network drive, cloud mount) +- **Automatic processing** - New files are automatically detected and processed +- **Non-destructive** - Original files remain exactly where you put them +- **Background operation** - Processing happens in the background while you continue working -### ๐Ÿ“ Smart File Processing -- **OCR-able File Detection**: Only processes supported file types (PDF, images, text, Word docs) -- **Duplicate Prevention**: Checks for existing files with same name and size -- **File Stability**: Waits for files to finish being written before processing -- **System File Exclusion**: Skips hidden files, temporary files, and system directories +Perfect for scenarios where you want to automatically process files from: +- Network drives (NFS, SMB shares) +- Cloud storage mounts (Google Drive, Dropbox, OneDrive) +- Local folders where you save scanned documents +- Shared team folders -### โš™๏ธ Configuration Options +## How It Works -| Environment Variable | Default | Description | -|---------------------|---------|-------------| -| `WATCH_FOLDER` | `./watch` | Path to the folder to monitor | -| `WATCH_INTERVAL_SECONDS` | `30` | Polling interval for network filesystems | -| `FILE_STABILITY_CHECK_MS` | `500` | Time to wait for file stability | -| `MAX_FILE_AGE_HOURS` | `none` | Skip files older than specified hours | -| `ALLOWED_FILE_TYPES` | `pdf,png,jpg,jpeg,tiff,bmp,txt,doc,docx` | Allowed file extensions | -| `FORCE_POLLING_WATCH` | `unset` | Force polling mode even for local filesystems | +1. **Point Readur to your folder** - Set the `WATCH_FOLDER` path to any directory you want monitored +2. **Drop files** - Add documents to that folder (PDFs, images, text files, Word docs) +3. **Automatic detection** - Readur notices new files within seconds (local) or minutes (network) +4. **OCR processing** - Files are automatically processed to extract searchable text +5. **Search and find** - Your documents become searchable in the Readur web interface + +## Key Features + +โœ… **Works with any storage type** - Local drives, network shares, cloud mounts +โœ… **Smart processing** - Only processes supported file types +โœ… **Duplicate prevention** - Won't process the same file twice +โœ… **Safe operation** - Never modifies or deletes your original files +โœ… **Background processing** - Doesn't interrupt your workflow + +## Quick Setup + +### Basic Setup (Docker Compose) + +1. **Edit your docker-compose.yml**: +```yaml +services: + readur: + image: readur:latest + volumes: + # Mount your folder to the watch directory + - /path/to/your/documents:/app/watch + environment: + - WATCH_FOLDER=/app/watch +``` + +2. **Start Readur**: +```bash +docker compose up -d +``` + +3. **Start dropping files** into `/path/to/your/documents` - they'll be automatically processed! + +### Configuration Options + +| Setting | Default | What it does | +|---------|---------|-------------| +| `WATCH_FOLDER` | `./watch` | Which folder to monitor | +| `WATCH_INTERVAL_SECONDS` | `30` | How often to check for new files (network drives) | +| `MAX_FILE_AGE_HOURS` | _(none)_ | Ignore files older than this | +| `ALLOWED_FILE_TYPES` | `pdf,png,jpg,jpeg,tiff,bmp,txt,doc,docx` | Which file types to process | ## Usage diff --git a/docs/api-reference.md b/docs/api-reference.md new file mode 100644 index 0000000..8cdd193 --- /dev/null +++ b/docs/api-reference.md @@ -0,0 +1,618 @@ +# API Reference + +Readur provides a comprehensive REST API for integrating with external systems and building custom workflows. + +## Table of Contents + +- [Base URL](#base-url) +- [Authentication](#authentication) +- [Error Handling](#error-handling) +- [Rate Limiting](#rate-limiting) +- [Endpoints](#endpoints) + - [Authentication](#authentication-endpoints) + - [Documents](#document-endpoints) + - [Search](#search-endpoints) + - [OCR Queue](#ocr-queue-endpoints) + - [Settings](#settings-endpoints) + - [Sources](#sources-endpoints) + - [Labels](#labels-endpoints) + - [Users](#user-endpoints) +- [WebSocket API](#websocket-api) +- [Examples](#examples) + +## Base URL + +``` +http://localhost:8000/api +``` + +For production deployments, replace with your configured domain and ensure HTTPS is used. + +## Authentication + +Readur uses JWT (JSON Web Token) authentication. Include the token in the Authorization header: + +``` +Authorization: Bearer +``` + +### Obtaining a Token + +```bash +POST /api/auth/login +Content-Type: application/json + +{ + "username": "admin", + "password": "your_password" +} +``` + +Response: +```json +{ + "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...", + "user": { + "id": 1, + "username": "admin", + "email": "admin@example.com", + "role": "admin" + } +} +``` + +## Error Handling + +All API errors follow a consistent format: + +```json +{ + "error": { + "code": "VALIDATION_ERROR", + "message": "Invalid request parameters", + "details": { + "field": "email", + "reason": "Invalid email format" + } + } +} +``` + +Common HTTP status codes: +- `200` - Success +- `201` - Created +- `400` - Bad Request +- `401` - Unauthorized +- `403` - Forbidden +- `404` - Not Found +- `422` - Validation Error +- `500` - Internal Server Error + +## Rate Limiting + +API requests are rate-limited to prevent abuse: +- Authenticated users: 1000 requests per hour +- Unauthenticated users: 100 requests per hour + +Rate limit headers: +``` +X-RateLimit-Limit: 1000 +X-RateLimit-Remaining: 999 +X-RateLimit-Reset: 1640995200 +``` + +## Endpoints + +### Authentication Endpoints + +#### Register New User + +```bash +POST /api/auth/register +Content-Type: application/json + +{ + "username": "john_doe", + "email": "john@example.com", + "password": "secure_password" +} +``` + +#### Login + +```bash +POST /api/auth/login +Content-Type: application/json + +{ + "username": "john_doe", + "password": "secure_password" +} +``` + +#### Get Current User + +```bash +GET /api/auth/me +Authorization: Bearer +``` + +#### Logout + +```bash +POST /api/auth/logout +Authorization: Bearer +``` + +### Document Endpoints + +#### Upload Document + +```bash +POST /api/documents +Authorization: Bearer +Content-Type: multipart/form-data + +file: +tags: ["invoice", "2024"] # Optional +``` + +Response: +```json +{ + "id": "550e8400-e29b-41d4-a716-446655440000", + "filename": "invoice_2024.pdf", + "mime_type": "application/pdf", + "size": 1048576, + "uploaded_at": "2024-01-01T00:00:00Z", + "ocr_status": "pending" +} +``` + +#### List Documents + +```bash +GET /api/documents?limit=50&offset=0&sort=-uploaded_at +Authorization: Bearer +``` + +Query parameters: +- `limit` - Number of results (default: 50, max: 100) +- `offset` - Pagination offset +- `sort` - Sort field (prefix with `-` for descending) +- `mime_type` - Filter by MIME type +- `ocr_status` - Filter by OCR status +- `tag` - Filter by tag + +#### Get Document Details + +```bash +GET /api/documents/{id} +Authorization: Bearer +``` + +#### Download Document + +```bash +GET /api/documents/{id}/download +Authorization: Bearer +``` + +#### Delete Document + +```bash +DELETE /api/documents/{id} +Authorization: Bearer +``` + +#### Update Document + +```bash +PATCH /api/documents/{id} +Authorization: Bearer +Content-Type: application/json + +{ + "tags": ["invoice", "paid", "2024"] +} +``` + +### Search Endpoints + +#### Search Documents + +```bash +GET /api/search?query=invoice&limit=20 +Authorization: Bearer +``` + +Query parameters: +- `query` - Search query (required) +- `limit` - Number of results +- `offset` - Pagination offset +- `mime_types` - Comma-separated MIME types +- `tags` - Comma-separated tags +- `date_from` - Start date (ISO 8601) +- `date_to` - End date (ISO 8601) + +Response: +```json +{ + "results": [ + { + "id": "550e8400-e29b-41d4-a716-446655440000", + "filename": "invoice_2024.pdf", + "snippet": "...invoice for services rendered in Q1 2024...", + "score": 0.95, + "highlights": ["invoice", "2024"] + } + ], + "total": 42, + "limit": 20, + "offset": 0 +} +``` + +#### Advanced Search + +```bash +POST /api/search/advanced +Authorization: Bearer +Content-Type: application/json + +{ + "query": "invoice", + "filters": { + "mime_types": ["application/pdf"], + "tags": ["unpaid"], + "date_range": { + "from": "2024-01-01", + "to": "2024-12-31" + }, + "file_size": { + "min": 1024, + "max": 10485760 + } + }, + "options": { + "fuzzy": true, + "snippet_length": 200 + } +} +``` + +### OCR Queue Endpoints + +#### Get Queue Status + +```bash +GET /api/queue/status +Authorization: Bearer +``` + +Response: +```json +{ + "pending": 15, + "processing": 3, + "completed_today": 127, + "failed_today": 2, + "average_processing_time": 4.5 +} +``` + +#### Reprocess Document + +```bash +POST /api/documents/{id}/reprocess +Authorization: Bearer +``` + +#### Get Failed OCR Jobs + +```bash +GET /api/queue/failed +Authorization: Bearer +``` + +### Settings Endpoints + +#### Get User Settings + +```bash +GET /api/settings +Authorization: Bearer +``` + +#### Update User Settings + +```bash +PUT /api/settings +Authorization: Bearer +Content-Type: application/json + +{ + "ocr_language": "eng", + "search_results_per_page": 50, + "enable_notifications": true +} +``` + +### Sources Endpoints + +#### List Sources + +```bash +GET /api/sources +Authorization: Bearer +``` + +#### Create Source + +```bash +POST /api/sources +Authorization: Bearer +Content-Type: application/json + +{ + "name": "Network Drive", + "type": "local_folder", + "config": { + "path": "/mnt/network/documents", + "scan_interval": 3600 + }, + "enabled": true +} +``` + +#### Update Source + +```bash +PUT /api/sources/{id} +Authorization: Bearer +Content-Type: application/json + +{ + "enabled": false +} +``` + +#### Delete Source + +```bash +DELETE /api/sources/{id} +Authorization: Bearer +``` + +#### Sync Source + +```bash +POST /api/sources/{id}/sync +Authorization: Bearer +``` + +### Labels Endpoints + +#### List Labels + +```bash +GET /api/labels +Authorization: Bearer +``` + +#### Create Label + +```bash +POST /api/labels +Authorization: Bearer +Content-Type: application/json + +{ + "name": "Important", + "color": "#FF0000" +} +``` + +#### Update Label + +```bash +PUT /api/labels/{id} +Authorization: Bearer +Content-Type: application/json + +{ + "name": "Very Important", + "color": "#FF00FF" +} +``` + +#### Delete Label + +```bash +DELETE /api/labels/{id} +Authorization: Bearer +``` + +### User Endpoints + +#### List Users (Admin Only) + +```bash +GET /api/users +Authorization: Bearer +``` + +#### Get User + +```bash +GET /api/users/{id} +Authorization: Bearer +``` + +#### Update User + +```bash +PUT /api/users/{id} +Authorization: Bearer +Content-Type: application/json + +{ + "email": "newemail@example.com", + "role": "user" +} +``` + +#### Delete User (Admin Only) + +```bash +DELETE /api/users/{id} +Authorization: Bearer +``` + +## WebSocket API + +Connect to receive real-time updates: + +```javascript +const ws = new WebSocket('ws://localhost:8000/ws'); + +ws.onmessage = (event) => { + const data = JSON.parse(event.data); + console.log('Event:', data); +}; + +// Authenticate +ws.send(JSON.stringify({ + type: 'auth', + token: 'your_jwt_token' +})); +``` + +Event types: +- `document.uploaded` - New document uploaded +- `ocr.completed` - OCR processing completed +- `ocr.failed` - OCR processing failed +- `source.sync.completed` - Source sync finished + +## Examples + +### Python Example + +```python +import requests + +# Configuration +BASE_URL = "http://localhost:8000/api" +USERNAME = "admin" +PASSWORD = "your_password" + +# Login +response = requests.post(f"{BASE_URL}/auth/login", json={ + "username": USERNAME, + "password": PASSWORD +}) +token = response.json()["token"] +headers = {"Authorization": f"Bearer {token}"} + +# Upload document +with open("document.pdf", "rb") as f: + files = {"file": ("document.pdf", f, "application/pdf")} + response = requests.post( + f"{BASE_URL}/documents", + headers=headers, + files=files + ) + document_id = response.json()["id"] + +# Search documents +response = requests.get( + f"{BASE_URL}/search", + headers=headers, + params={"query": "invoice 2024"} +) +results = response.json()["results"] +``` + +### JavaScript Example + +```javascript +// Configuration +const BASE_URL = 'http://localhost:8000/api'; + +// Login +async function login(username, password) { + const response = await fetch(`${BASE_URL}/auth/login`, { + method: 'POST', + headers: { 'Content-Type': 'application/json' }, + body: JSON.stringify({ username, password }) + }); + const data = await response.json(); + return data.token; +} + +// Upload document +async function uploadDocument(token, file) { + const formData = new FormData(); + formData.append('file', file); + + const response = await fetch(`${BASE_URL}/documents`, { + method: 'POST', + headers: { 'Authorization': `Bearer ${token}` }, + body: formData + }); + return response.json(); +} + +// Search documents +async function searchDocuments(token, query) { + const response = await fetch( + `${BASE_URL}/search?query=${encodeURIComponent(query)}`, + { + headers: { 'Authorization': `Bearer ${token}` } + } + ); + return response.json(); +} +``` + +### cURL Examples + +```bash +# Login +TOKEN=$(curl -s -X POST http://localhost:8000/api/auth/login \ + -H "Content-Type: application/json" \ + -d '{"username":"admin","password":"your_password"}' \ + | jq -r .token) + +# Upload document +curl -X POST http://localhost:8000/api/documents \ + -H "Authorization: Bearer $TOKEN" \ + -F "file=@document.pdf" + +# Search documents +curl -X GET "http://localhost:8000/api/search?query=invoice" \ + -H "Authorization: Bearer $TOKEN" + +# Get document +curl -X GET http://localhost:8000/api/documents/550e8400-e29b-41d4-a716-446655440000 \ + -H "Authorization: Bearer $TOKEN" +``` + +## OpenAPI Specification + +The complete OpenAPI specification is available at: +``` +GET /api/openapi.json +``` + +You can use this with tools like Swagger UI or to generate client libraries. + +## SDK Support + +Official SDKs are planned for: +- Python +- JavaScript/TypeScript +- Go +- Ruby + +Check the [GitHub repository](https://github.com/perfectra1n/readur) for the latest SDK availability. \ No newline at end of file diff --git a/docs/configuration.md b/docs/configuration.md new file mode 100644 index 0000000..c79298a --- /dev/null +++ b/docs/configuration.md @@ -0,0 +1,261 @@ +# Configuration Guide + +This guide covers all configuration options available in Readur through environment variables and runtime settings. + +## Table of Contents + +- [Environment Variables](#environment-variables) + - [Core Configuration](#core-configuration) + - [File Storage & Upload](#file-storage--upload) + - [Watch Folder Configuration](#watch-folder-configuration) + - [OCR & Processing Settings](#ocr--processing-settings) + - [Search & Performance](#search--performance) + - [Data Management](#data-management) +- [Port Configuration](#port-configuration) +- [Example Configurations](#example-configurations) +- [Configuration Priority](#configuration-priority) +- [Runtime Settings vs Environment Variables](#runtime-settings-vs-environment-variables) +- [Database Tuning](#database-tuning) + +## Environment Variables + +All application settings can be configured via environment variables: + +### Core Configuration + +| Variable | Default | Description | +|----------|---------|-------------| +| `DATABASE_URL` | `postgresql://readur:readur@localhost/readur` | PostgreSQL connection string | +| `JWT_SECRET` | `your-secret-key` | Secret key for JWT tokens โš ๏ธ **Change in production!** | +| `SERVER_ADDRESS` | `0.0.0.0:8000` | Server bind address and port | + +### File Storage & Upload + +| Variable | Default | Description | +|----------|---------|-------------| +| `UPLOAD_PATH` | `./uploads` | Document storage directory | +| `ALLOWED_FILE_TYPES` | `pdf,txt,doc,docx,png,jpg,jpeg` | Comma-separated allowed file extensions | + +### Watch Folder Configuration + +| Variable | Default | Description | +|----------|---------|-------------| +| `WATCH_FOLDER` | `./watch` | Directory to monitor for new files | +| `WATCH_INTERVAL_SECONDS` | `30` | Polling interval for network filesystems (seconds) | +| `FILE_STABILITY_CHECK_MS` | `500` | Time to wait for file write completion (milliseconds) | +| `MAX_FILE_AGE_HOURS` | _(none)_ | Skip files older than this many hours | +| `FORCE_POLLING_WATCH` | _(none)_ | Force polling mode even for local filesystems | + +### OCR & Processing Settings + +*Note: These settings can also be configured per-user via the web interface* + +| Variable | Default | Description | +|----------|---------|-------------| +| `OCR_LANGUAGE` | `eng` | OCR language code (eng, fra, deu, spa, etc.) | +| `CONCURRENT_OCR_JOBS` | `4` | Maximum parallel OCR processes | +| `OCR_TIMEOUT_SECONDS` | `300` | OCR processing timeout per file | +| `MAX_FILE_SIZE_MB` | `50` | Maximum file size for processing | +| `AUTO_ROTATE_IMAGES` | `true` | Automatically rotate images for better OCR | +| `ENABLE_IMAGE_PREPROCESSING` | `true` | Apply image enhancement before OCR | + +### Search & Performance + +| Variable | Default | Description | +|----------|---------|-------------| +| `SEARCH_RESULTS_PER_PAGE` | `25` | Default number of search results per page | +| `SEARCH_SNIPPET_LENGTH` | `200` | Length of text snippets in search results | +| `FUZZY_SEARCH_THRESHOLD` | `0.8` | Similarity threshold for fuzzy search (0.0-1.0) | +| `MEMORY_LIMIT_MB` | `512` | Memory limit for OCR processes | +| `CPU_PRIORITY` | `normal` | CPU priority: `low`, `normal`, `high` | + +### Data Management + +| Variable | Default | Description | +|----------|---------|-------------| +| `RETENTION_DAYS` | _(none)_ | Auto-delete documents after N days | +| `ENABLE_AUTO_CLEANUP` | `false` | Enable automatic cleanup of old documents | +| `ENABLE_COMPRESSION` | `false` | Compress stored documents to save space | +| `ENABLE_BACKGROUND_OCR` | `true` | Process OCR in background queue | + +## Port Configuration + +Readur supports flexible port configuration: + +```bash +# Method 1: Specify full server address +SERVER_ADDRESS=0.0.0.0:8000 + +# Method 2: Use separate host and port (recommended) +SERVER_HOST=0.0.0.0 +SERVER_PORT=8000 + +# For development: Configure frontend port +CLIENT_PORT=5173 +BACKEND_PORT=8000 +``` + +## Example Configurations + +### Development Configuration + +```env +# Basic development setup +DATABASE_URL=postgresql://readur:readur@localhost/readur +JWT_SECRET=dev-secret-key-not-for-production +SERVER_ADDRESS=0.0.0.0:8000 +UPLOAD_PATH=./uploads +WATCH_FOLDER=./watch +OCR_LANGUAGE=eng +CONCURRENT_OCR_JOBS=2 +``` + +### Production Configuration + +```env +# Core settings +DATABASE_URL=postgresql://readur:secure_password@postgres:5432/readur +JWT_SECRET=your-very-long-random-secret-key-generated-with-openssl +SERVER_ADDRESS=0.0.0.0:8000 + +# File handling +UPLOAD_PATH=/app/uploads +ALLOWED_FILE_TYPES=pdf,png,jpg,jpeg,tiff,bmp,gif,txt,rtf,doc,docx + +# Watch folder for NFS mount +WATCH_FOLDER=/mnt/nfs/documents +WATCH_INTERVAL_SECONDS=60 +FILE_STABILITY_CHECK_MS=1000 +MAX_FILE_AGE_HOURS=168 +FORCE_POLLING_WATCH=1 + +# OCR optimization +OCR_LANGUAGE=eng +CONCURRENT_OCR_JOBS=8 +OCR_TIMEOUT_SECONDS=600 +MAX_FILE_SIZE_MB=200 +AUTO_ROTATE_IMAGES=true +ENABLE_IMAGE_PREPROCESSING=true + +# Performance tuning +MEMORY_LIMIT_MB=2048 +CPU_PRIORITY=high +ENABLE_COMPRESSION=true +ENABLE_BACKGROUND_OCR=true + +# Search optimization +SEARCH_RESULTS_PER_PAGE=50 +SEARCH_SNIPPET_LENGTH=300 +FUZZY_SEARCH_THRESHOLD=0.7 + +# Data management +RETENTION_DAYS=2555 # 7 years +ENABLE_AUTO_CLEANUP=true +``` + +### Network Filesystem Configuration + +```env +# For NFS mounts +WATCH_FOLDER=/mnt/nfs/documents +WATCH_INTERVAL_SECONDS=60 +FILE_STABILITY_CHECK_MS=1000 +FORCE_POLLING_WATCH=1 + +# For SMB/CIFS mounts +WATCH_FOLDER=/mnt/smb/shared +WATCH_INTERVAL_SECONDS=30 +FILE_STABILITY_CHECK_MS=2000 + +# For S3 mounts (using s3fs) +WATCH_FOLDER=/mnt/s3/bucket +WATCH_INTERVAL_SECONDS=120 +FILE_STABILITY_CHECK_MS=5000 +FORCE_POLLING_WATCH=1 +``` + +## Configuration Priority + +Settings are applied in this order (later values override earlier ones): + +1. **Application defaults** (built into the code) +2. **Environment variables** (system-wide configuration) +3. **User settings** (per-user database settings via web interface) + +This allows for flexible deployment where system administrators can set defaults while users can customize their experience. + +## Runtime Settings vs Environment Variables + +Some settings can be configured in two ways: + +1. **Environment Variables**: Set at container startup, affects the entire application +2. **User Settings**: Configured per-user via the web interface, stored in database + +**Environment variables take precedence** and provide system-wide defaults. User settings override these defaults for individual users where applicable. + +Settings configurable via web interface: +- OCR language preferences +- Search result limits +- File type restrictions +- OCR processing options +- Data retention policies + +## Database Tuning + +For better search performance with large document collections: + +```sql +-- Increase shared_buffers for better caching +ALTER SYSTEM SET shared_buffers = '256MB'; + +-- Optimize for full-text search +ALTER SYSTEM SET default_text_search_config = 'pg_catalog.english'; + +-- Restart PostgreSQL after changes +``` + +## Security Configuration + +### Generating Secure Secrets + +```bash +# Generate secure JWT secret +JWT_SECRET=$(openssl rand -base64 64) + +# Generate secure database password +DB_PASSWORD=$(openssl rand -base64 32) + +# Save to .env file +cat > .env << EOF +JWT_SECRET=${JWT_SECRET} +DB_PASSWORD=${DB_PASSWORD} +EOF +``` + +### Quick Reference - Essential Variables + +For a minimal production deployment, configure these essential variables: + +```bash +# Security (REQUIRED) +JWT_SECRET=your-secure-random-key-here +DATABASE_URL=postgresql://user:password@host:port/database + +# File Storage +UPLOAD_PATH=/app/uploads +WATCH_FOLDER=/path/to/mounted/folder + +# Watch Folder (for network mounts) +WATCH_INTERVAL_SECONDS=60 +FORCE_POLLING_WATCH=1 + +# Performance +CONCURRENT_OCR_JOBS=4 +MAX_FILE_SIZE_MB=100 +``` + +## Next Steps + +- Review [deployment options](deployment.md) for production setup +- Learn about [folder watching](WATCH_FOLDER.md) for automatic document ingestion +- Optimize [OCR performance](dev/OCR_OPTIMIZATION_GUIDE.md) for your use case \ No newline at end of file diff --git a/docs/deployment.md b/docs/deployment.md new file mode 100644 index 0000000..0b8a04b --- /dev/null +++ b/docs/deployment.md @@ -0,0 +1,403 @@ +# Deployment Guide + +This guide covers production deployment strategies, SSL setup, monitoring, backups, and best practices for running Readur in production. + +## Table of Contents + +- [Production Docker Compose](#production-docker-compose) +- [Network Filesystem Mounts](#network-filesystem-mounts) + - [NFS Mounts](#nfs-mounts) + - [SMB/CIFS Mounts](#smbcifs-mounts) + - [S3 Mounts](#s3-mounts) +- [SSL/HTTPS Setup](#sslhttps-setup) + - [Nginx Configuration](#nginx-configuration) + - [Traefik Configuration](#traefik-configuration) +- [Health Checks](#health-checks) +- [Backup Strategy](#backup-strategy) +- [Monitoring](#monitoring) +- [Deployment Platforms](#deployment-platforms) + - [Docker Swarm](#docker-swarm) + - [Kubernetes](#kubernetes) + - [Cloud Platforms](#cloud-platforms) +- [Security Considerations](#security-considerations) + +## Production Docker Compose + +For production deployments, create a custom `docker-compose.prod.yml`: + +```yaml +services: + readur: + image: readur:latest + ports: + - "8000:8000" + environment: + # Core Configuration + - DATABASE_URL=postgresql://readur:${DB_PASSWORD}@postgres:5432/readur + - JWT_SECRET=${JWT_SECRET} + - SERVER_ADDRESS=0.0.0.0:8000 + + # File Storage + - UPLOAD_PATH=/app/uploads + - WATCH_FOLDER=/app/watch + - ALLOWED_FILE_TYPES=pdf,png,jpg,jpeg,tiff,bmp,gif,txt,doc,docx + + # Watch Folder Settings + - WATCH_INTERVAL_SECONDS=30 + - FILE_STABILITY_CHECK_MS=500 + - MAX_FILE_AGE_HOURS=168 + + # OCR Configuration + - OCR_LANGUAGE=eng + - CONCURRENT_OCR_JOBS=4 + - OCR_TIMEOUT_SECONDS=300 + - MAX_FILE_SIZE_MB=100 + + # Performance Tuning + - MEMORY_LIMIT_MB=1024 + - CPU_PRIORITY=normal + - ENABLE_COMPRESSION=true + + volumes: + # Document storage + - ./data/uploads:/app/uploads + + # Watch folder - mount your network drives here + - /mnt/nfs/documents:/app/watch + # or SMB: - /mnt/smb/shared:/app/watch + # or S3: - /mnt/s3/bucket:/app/watch + + depends_on: + - postgres + restart: unless-stopped + + # Resource limits for production + deploy: + resources: + limits: + memory: 2G + cpus: '2.0' + reservations: + memory: 512M + cpus: '0.5' + + postgres: + image: postgres:15 + environment: + - POSTGRES_USER=readur + - POSTGRES_PASSWORD=${DB_PASSWORD} + - POSTGRES_DB=readur + - POSTGRES_INITDB_ARGS=--encoding=UTF-8 --lc-collate=en_US.UTF-8 --lc-ctype=en_US.UTF-8 + + volumes: + - postgres_data:/var/lib/postgresql/data + - ./postgres-config:/etc/postgresql/conf.d:ro + + # PostgreSQL optimization for document search + command: > + postgres + -c shared_buffers=256MB + -c effective_cache_size=1GB + -c max_connections=100 + -c default_text_search_config=pg_catalog.english + + restart: unless-stopped + + # Don't expose port in production + # ports: + # - "5433:5432" + +volumes: + postgres_data: + driver: local +``` + +Deploy with environment file: +```bash +# Create .env file with secrets +cat > .env << EOF +JWT_SECRET=$(openssl rand -base64 64) +DB_PASSWORD=$(openssl rand -base64 32) +EOF + +# Deploy +docker compose -f docker-compose.prod.yml --env-file .env up -d +``` + +## Network Filesystem Mounts + +### NFS Mounts + +```bash +# Mount NFS share +sudo mount -t nfs 192.168.1.100:/documents /mnt/nfs/documents + +# Add to docker-compose.yml +volumes: + - /mnt/nfs/documents:/app/watch +environment: + - WATCH_INTERVAL_SECONDS=60 + - FILE_STABILITY_CHECK_MS=1000 + - FORCE_POLLING_WATCH=1 +``` + +### SMB/CIFS Mounts + +```bash +# Mount SMB share +sudo mount -t cifs //server/share /mnt/smb/shared -o username=user,password=pass + +# Docker volume configuration +volumes: + - /mnt/smb/shared:/app/watch +environment: + - WATCH_INTERVAL_SECONDS=30 + - FILE_STABILITY_CHECK_MS=2000 +``` + +### S3 Mounts + +```bash +# Mount S3 bucket using s3fs +s3fs mybucket /mnt/s3/bucket -o passwd_file=~/.passwd-s3fs + +# Docker configuration for S3 +volumes: + - /mnt/s3/bucket:/app/watch +environment: + - WATCH_INTERVAL_SECONDS=120 + - FILE_STABILITY_CHECK_MS=5000 + - FORCE_POLLING_WATCH=1 +``` + +## SSL/HTTPS Setup + +### Nginx Configuration + +```nginx +server { + listen 443 ssl http2; + server_name readur.yourdomain.com; + + ssl_certificate /path/to/cert.pem; + ssl_certificate_key /path/to/key.pem; + + location / { + proxy_pass http://localhost:8000; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + + # For file uploads + client_max_body_size 100M; + proxy_read_timeout 300s; + proxy_send_timeout 300s; + } +} +``` + +### Traefik Configuration + +```yaml +services: + readur: + labels: + - "traefik.enable=true" + - "traefik.http.routers.readur.rule=Host(`readur.yourdomain.com`)" + - "traefik.http.routers.readur.tls=true" + - "traefik.http.routers.readur.tls.certresolver=letsencrypt" +``` + +> ๐Ÿ“˜ **For more reverse proxy configurations** including Apache, Caddy, custom ports, load balancing, and advanced scenarios, see [REVERSE_PROXY.md](./REVERSE_PROXY.md). + +## Health Checks + +Add health checks to your Docker configuration: + +```yaml +services: + readur: + healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:8000/api/health"] + interval: 30s + timeout: 10s + retries: 3 + start_period: 40s +``` + +## Backup Strategy + +Create an automated backup script: + +```bash +#!/bin/bash +# backup.sh - Automated backup script + +BACKUP_DIR="/path/to/backups" +DATE=$(date +%Y%m%d_%H%M%S) + +# Create backup directory +mkdir -p "$BACKUP_DIR" + +# Backup database +docker exec readur-postgres-1 pg_dump -U readur readur | gzip > "$BACKUP_DIR/db_backup_$DATE.sql.gz" + +# Backup uploaded files +tar -czf "$BACKUP_DIR/uploads_backup_$DATE.tar.gz" -C ./data uploads/ + +# Clean old backups (keep 30 days) +find "$BACKUP_DIR" -name "db_backup_*.sql.gz" -mtime +30 -delete +find "$BACKUP_DIR" -name "uploads_backup_*.tar.gz" -mtime +30 -delete + +echo "Backup completed: $DATE" +``` + +Add to crontab for daily backups: +```bash +0 2 * * * /path/to/backup.sh >> /var/log/readur-backup.log 2>&1 +``` + +### Restore from Backup + +```bash +# Restore database +gunzip -c db_backup_20240101_020000.sql.gz | docker exec -i readur-postgres-1 psql -U readur readur + +# Restore files +tar -xzf uploads_backup_20240101_020000.tar.gz -C ./data +``` + +## Monitoring + +Monitor your deployment with Docker stats: + +```bash +# Real-time resource usage +docker stats + +# Container logs +docker compose logs -f readur + +# Watch folder activity +docker compose logs -f readur | grep watcher + +# PostgreSQL query performance +docker exec readur-postgres-1 psql -U readur -c "SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;" +``` + +### Prometheus Metrics + +Readur exposes metrics at `/metrics` endpoint: + +```yaml +# prometheus.yml +scrape_configs: + - job_name: 'readur' + static_configs: + - targets: ['readur:8000'] +``` + +## Deployment Platforms + +### Docker Swarm + +```yaml +version: '3.8' +services: + readur: + image: readur:latest + deploy: + replicas: 2 + restart_policy: + condition: on-failure + placement: + constraints: [node.role == worker] + networks: + - readur-network + secrets: + - jwt_secret + - db_password + +secrets: + jwt_secret: + external: true + db_password: + external: true +``` + +### Kubernetes + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: readur +spec: + replicas: 3 + selector: + matchLabels: + app: readur + template: + spec: + containers: + - name: readur + image: readur:latest + env: + - name: JWT_SECRET + valueFrom: + secretKeyRef: + name: readur-secrets + key: jwt-secret + resources: + limits: + memory: "2Gi" + cpu: "2" + requests: + memory: "512Mi" + cpu: "500m" +``` + +### Cloud Platforms + +- **AWS**: Use ECS with RDS PostgreSQL +- **Google Cloud**: Deploy to Cloud Run with Cloud SQL +- **Azure**: Use Container Instances with Azure Database +- **DigitalOcean**: App Platform with Managed Database + +## Security Considerations + +### Production Checklist + +- [ ] Change default admin password +- [ ] Generate strong JWT secret +- [ ] Use HTTPS/SSL in production +- [ ] Restrict database network access +- [ ] Set proper file permissions +- [ ] Enable firewall rules +- [ ] Regular security updates +- [ ] Monitor access logs +- [ ] Implement rate limiting +- [ ] Enable audit logging + +### Recommended Production Setup + +```bash +# Generate secure secrets +JWT_SECRET=$(openssl rand -base64 64) +DB_PASSWORD=$(openssl rand -base64 32) + +# Restrict file permissions +chmod 600 .env +chmod 700 ./data/uploads + +# Use read-only root filesystem +docker run --read-only --tmpfs /tmp ... +``` + +## Next Steps + +- Configure [monitoring and alerting](monitoring-usage) +- Review [security best practices](security) +- Set up [automated backups](#backup-strategy) +- Explore [database guardrails](dev/DATABASE_GUARDRAILS.md) \ No newline at end of file diff --git a/docs/DATABASE_GUARDRAILS.md b/docs/dev/DATABASE_GUARDRAILS.md similarity index 100% rename from docs/DATABASE_GUARDRAILS.md rename to docs/dev/DATABASE_GUARDRAILS.md diff --git a/docs/DEPLOYMENT_SUMMARY.md b/docs/dev/DEPLOYMENT_SUMMARY.md similarity index 100% rename from docs/DEPLOYMENT_SUMMARY.md rename to docs/dev/DEPLOYMENT_SUMMARY.md diff --git a/docs/OCR_OPTIMIZATION_GUIDE.md b/docs/dev/OCR_OPTIMIZATION_GUIDE.md similarity index 100% rename from docs/OCR_OPTIMIZATION_GUIDE.md rename to docs/dev/OCR_OPTIMIZATION_GUIDE.md diff --git a/docs/QUEUE_IMPROVEMENTS.md b/docs/dev/QUEUE_IMPROVEMENTS.md similarity index 100% rename from docs/QUEUE_IMPROVEMENTS.md rename to docs/dev/QUEUE_IMPROVEMENTS.md diff --git a/docs/dev/README.md b/docs/dev/README.md new file mode 100644 index 0000000..93e7dde --- /dev/null +++ b/docs/dev/README.md @@ -0,0 +1,47 @@ +# Developer Documentation + +This directory contains technical documentation for developers working on Readur. + +## ๐Ÿ“‹ Table of Contents + +### ๐Ÿ—๏ธ Architecture & Design +- [**Architecture Overview**](architecture.md) - System design, components, and data flow +- [**Database Guardrails**](DATABASE_GUARDRAILS.md) - Concurrency safety and database best practices + +### ๐Ÿ› ๏ธ Development +- [**Development Guide**](development.md) - Setup, contributing, code style guidelines +- [**Testing Guide**](TESTING.md) - Comprehensive testing strategy and instructions + +### โš™๏ธ Technical Guides +- [**OCR Optimization**](OCR_OPTIMIZATION_GUIDE.md) - Performance tuning and best practices +- [**Queue Improvements**](QUEUE_IMPROVEMENTS.md) - Background job processing architecture +- [**Deployment Summary**](DEPLOYMENT_SUMMARY.md) - Technical deployment overview + +## ๐Ÿš€ Quick Start for Developers + +1. **Read the [Architecture Overview](architecture.md)** to understand the system design +2. **Follow the [Development Guide](development.md)** to set up your local environment +3. **Review the [Testing Guide](TESTING.md)** to understand our testing approach +4. **Check [Database Guardrails](DATABASE_GUARDRAILS.md)** for data safety patterns + +## ๐Ÿ“– Related User Documentation + +- [Installation Guide](../installation.md) - How to install and run Readur +- [Configuration Guide](../configuration.md) - Environment variables and settings +- [User Guide](../user-guide.md) - How to use Readur features +- [API Reference](../api-reference.md) - REST API documentation + +## ๐Ÿค Contributing + +Please read our [Development Guide](development.md) for: +- Setting up your development environment +- Code style guidelines +- Testing requirements +- Pull request process + +## ๐Ÿท๏ธ Document Categories + +- **๐Ÿ“˜ User Docs**: Installation, configuration, user guide +- **๐Ÿ”ง Operations**: Deployment, monitoring, troubleshooting +- **๐Ÿ’ป Developer**: Architecture, development setup, testing +- **๐Ÿ”Œ Integration**: API reference, webhooks, extensions \ No newline at end of file diff --git a/docs/TESTING.md b/docs/dev/TESTING.md similarity index 100% rename from docs/TESTING.md rename to docs/dev/TESTING.md diff --git a/docs/dev/architecture.md b/docs/dev/architecture.md new file mode 100644 index 0000000..9a3f644 --- /dev/null +++ b/docs/dev/architecture.md @@ -0,0 +1,350 @@ +# Architecture Overview + +This document provides a comprehensive overview of Readur's architecture, design decisions, and technical implementation details. + +## Table of Contents + +- [System Architecture](#system-architecture) +- [Technology Stack](#technology-stack) +- [Component Overview](#component-overview) + - [Backend (Rust/Axum)](#backend-rustaxum) + - [Frontend (React)](#frontend-react) + - [Database (PostgreSQL)](#database-postgresql) + - [OCR Engine](#ocr-engine) +- [Data Flow](#data-flow) +- [Security Architecture](#security-architecture) +- [Performance Considerations](#performance-considerations) +- [Scalability](#scalability) +- [Design Patterns](#design-patterns) + +## System Architecture + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ React Frontend โ”‚โ”€โ”€โ”€โ”€โ”‚ Rust Backend โ”‚โ”€โ”€โ”€โ”€โ”‚ PostgreSQL DB โ”‚ +โ”‚ (Port 8000) โ”‚ โ”‚ (Axum API) โ”‚ โ”‚ (Port 5433) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ โ”‚ โ”‚ + โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ File Storage โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ”‚ + OCR Engine โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +### High-Level Components + +1. **Web Interface**: Modern React SPA with Material-UI +2. **API Server**: High-performance Rust backend using Axum +3. **Database**: PostgreSQL with full-text search capabilities +4. **File Storage**: Local or network-mounted filesystem +5. **OCR Processing**: Tesseract integration for text extraction +6. **Background Jobs**: Async task processing for OCR and file watching + +## Technology Stack + +### Backend +- **Language**: Rust (for performance and memory safety) +- **Web Framework**: Axum (async, fast, type-safe) +- **Database ORM**: SQLx (compile-time checked queries) +- **Authentication**: JWT tokens with bcrypt password hashing +- **Async Runtime**: Tokio +- **Serialization**: Serde + +### Frontend +- **Framework**: React 18 with TypeScript +- **UI Library**: Material-UI (MUI) +- **State Management**: React Context + Hooks +- **Build Tool**: Vite +- **HTTP Client**: Axios +- **Routing**: React Router + +### Infrastructure +- **Database**: PostgreSQL 14+ with pgvector extension +- **OCR**: Tesseract 4.0+ +- **Container**: Docker with multi-stage builds +- **Reverse Proxy**: Nginx/Traefik compatible + +## Component Overview + +### Backend (Rust/Axum) + +The backend is structured following clean architecture principles: + +``` +src/ +โ”œโ”€โ”€ main.rs # Application entry and server setup +โ”œโ”€โ”€ config.rs # Configuration management +โ”œโ”€โ”€ models.rs # Domain models and DTOs +โ”œโ”€โ”€ error.rs # Error handling +โ”œโ”€โ”€ auth.rs # Authentication middleware +โ”œโ”€โ”€ routes/ # HTTP route handlers +โ”‚ โ”œโ”€โ”€ auth.rs # Authentication endpoints +โ”‚ โ”œโ”€โ”€ documents.rs # Document CRUD operations +โ”‚ โ”œโ”€โ”€ search.rs # Search functionality +โ”‚ โ””โ”€โ”€ ... +โ”œโ”€โ”€ db/ # Database operations +โ”‚ โ”œโ”€โ”€ documents.rs # Document queries +โ”‚ โ”œโ”€โ”€ users.rs # User queries +โ”‚ โ””โ”€โ”€ ... +โ”œโ”€โ”€ services/ # Business logic +โ”‚ โ”œโ”€โ”€ ocr.rs # OCR processing +โ”‚ โ”œโ”€โ”€ file_service.rs # File management +โ”‚ โ””โ”€โ”€ watcher.rs # Folder watching +โ””โ”€โ”€ tests/ # Integration tests +``` + +Key design decisions: +- **Async-first**: All I/O operations are async +- **Type safety**: Leverages Rust's type system +- **Error handling**: Comprehensive error types +- **Dependency injection**: Clean separation of concerns + +### Frontend (React) + +The frontend follows a component-based architecture: + +``` +frontend/src/ +โ”œโ”€โ”€ components/ # Reusable UI components +โ”‚ โ”œโ”€โ”€ DocumentList/ +โ”‚ โ”œโ”€โ”€ SearchBar/ +โ”‚ โ””โ”€โ”€ ... +โ”œโ”€โ”€ pages/ # Page-level components +โ”‚ โ”œโ”€โ”€ Dashboard/ +โ”‚ โ”œโ”€โ”€ Documents/ +โ”‚ โ””โ”€โ”€ ... +โ”œโ”€โ”€ services/ # API integration +โ”‚ โ”œโ”€โ”€ api.ts # Base API client +โ”‚ โ”œโ”€โ”€ auth.ts # Auth service +โ”‚ โ””โ”€โ”€ documents.ts # Document service +โ”œโ”€โ”€ hooks/ # Custom React hooks +โ”œโ”€โ”€ contexts/ # React contexts +โ””โ”€โ”€ utils/ # Utility functions +``` + +### Database (PostgreSQL) + +Schema design optimized for document management: + +```sql +-- Core tables +users # User accounts +documents # Document metadata +document_content # Extracted text content +document_tags # Many-to-many tags +sources # File sources (folders, S3, etc.) +ocr_queue # OCR processing queue + +-- Search optimization +document_search_index # Full-text search index +``` + +Key features: +- **Full-text search**: PostgreSQL's powerful search capabilities +- **JSONB fields**: Flexible metadata storage +- **Triggers**: Automatic search index updates +- **Views**: Optimized query patterns + +### OCR Engine + +OCR processing pipeline: + +1. **File Detection**: New files detected via upload or folder watch +2. **Queue Management**: Files added to processing queue +3. **Pre-processing**: Image enhancement and optimization +4. **Text Extraction**: Tesseract OCR with language detection +5. **Post-processing**: Text cleaning and formatting +6. **Database Storage**: Indexed for search + +## Data Flow + +### Document Upload Flow + +```mermaid +sequenceDiagram + User->>Frontend: Upload Document + Frontend->>API: POST /api/documents + API->>FileStorage: Save File + API->>Database: Create Document Record + API->>OCRQueue: Add to Queue + API-->>Frontend: Document Created + OCRWorker->>OCRQueue: Poll for Jobs + OCRWorker->>FileStorage: Read File + OCRWorker->>Tesseract: Extract Text + OCRWorker->>Database: Update with Content + OCRWorker->>Frontend: WebSocket Update +``` + +### Search Flow + +```mermaid +sequenceDiagram + User->>Frontend: Enter Search Query + Frontend->>API: GET /api/search + API->>Database: Full-text Search + Database->>API: Ranked Results + API->>Frontend: Search Results + Frontend->>User: Display Results +``` + +## Security Architecture + +### Authentication & Authorization + +- **JWT Tokens**: Stateless authentication +- **Role-Based Access**: Admin, User roles +- **Token Refresh**: Automatic token renewal +- **Password Security**: Bcrypt with salt rounds + +### API Security + +- **CORS**: Configurable allowed origins +- **Rate Limiting**: Prevent abuse +- **Input Validation**: Comprehensive validation +- **SQL Injection**: Parameterized queries via SQLx + +### File Security + +- **Upload Validation**: File type and size checks +- **Virus Scanning**: Optional ClamAV integration +- **Access Control**: Document-level permissions +- **Secure Storage**: Filesystem permissions + +## Performance Considerations + +### Backend Optimization + +- **Connection Pooling**: Database connection reuse +- **Async I/O**: Non-blocking operations +- **Caching**: In-memory caching for hot data +- **Query Optimization**: Indexed searches + +### Frontend Optimization + +- **Code Splitting**: Lazy loading of routes +- **Virtual Scrolling**: Large document lists +- **Memoization**: Prevent unnecessary re-renders +- **Service Workers**: Offline capability + +### OCR Optimization + +- **Parallel Processing**: Multiple concurrent jobs +- **Image Pre-processing**: Enhance OCR accuracy +- **Resource Limits**: Memory and CPU constraints +- **Queue Priority**: Smart job scheduling + +## Scalability + +### Horizontal Scaling + +```yaml +# Multiple backend instances +backend-1: + image: readur:latest + environment: + - INSTANCE_ID=1 + +backend-2: + image: readur:latest + environment: + - INSTANCE_ID=2 + +# Load balancer +nginx: + upstream backend { + server backend-1:8000; + server backend-2:8000; + } +``` + +### Database Scaling + +- **Read Replicas**: Distribute read load +- **Connection Pooling**: PgBouncer +- **Partitioning**: Time-based partitions +- **Archival**: Move old documents + +### Storage Scaling + +- **S3 Compatible**: Object storage support +- **CDN Integration**: Static file delivery +- **Distributed Storage**: GlusterFS/Ceph +- **Archive Tiering**: Hot/cold storage + +## Design Patterns + +### Backend Patterns + +1. **Repository Pattern**: Database abstraction +2. **Service Layer**: Business logic separation +3. **Middleware Chain**: Request processing +4. **Error Boundaries**: Graceful error handling + +### Frontend Patterns + +1. **Container/Presenter**: Component separation +2. **Custom Hooks**: Logic reuse +3. **Context Provider**: State management +4. **HOCs**: Cross-cutting concerns + +### Database Patterns + +1. **Soft Deletes**: Data preservation +2. **Audit Trails**: Change tracking +3. **Materialized Views**: Performance +4. **Event Sourcing**: Optional audit log + +## Future Architecture Considerations + +### Microservices Migration + +Potential service boundaries: +- Authentication Service +- Document Service +- OCR Service +- Search Service +- Notification Service + +### Event-Driven Architecture + +- Message Queue (RabbitMQ/Kafka) +- Event Sourcing +- CQRS Pattern +- Async communication + +### Cloud-Native Features + +- Kubernetes deployment +- Service mesh (Istio) +- Distributed tracing +- Cloud storage integration + +## Monitoring and Observability + +### Metrics + +- Prometheus metrics endpoint +- Custom business metrics +- Performance counters +- Resource utilization + +### Logging + +- Structured logging (JSON) +- Log aggregation ready +- Correlation IDs +- Debug levels + +### Tracing + +- OpenTelemetry support +- Distributed tracing +- Performance profiling +- Request tracking + +## Next Steps + +- Review [deployment options](deployment.md) +- Explore [performance tuning](OCR_OPTIMIZATION_GUIDE.md) +- Understand [database design](DATABASE_GUARDRAILS.md) +- Learn about [testing strategy](TESTING.md) \ No newline at end of file diff --git a/docs/dev/development.md b/docs/dev/development.md new file mode 100644 index 0000000..5f7ff1a --- /dev/null +++ b/docs/dev/development.md @@ -0,0 +1,434 @@ +# Development Guide + +This guide covers contributing to Readur, setting up a development environment, testing, and code style guidelines. + +## Table of Contents + +- [Development Setup](#development-setup) + - [Prerequisites](#prerequisites) + - [Local Development](#local-development) + - [Development with Docker](#development-with-docker) +- [Project Structure](#project-structure) +- [Testing](#testing) + - [Backend Tests](#backend-tests) + - [Frontend Tests](#frontend-tests) + - [Integration Tests](#integration-tests) + - [E2E Tests](#e2e-tests) +- [Code Style](#code-style) + - [Rust Guidelines](#rust-guidelines) + - [Frontend Guidelines](#frontend-guidelines) +- [Contributing](#contributing) + - [Getting Started](#getting-started) + - [Pull Request Process](#pull-request-process) + - [Commit Guidelines](#commit-guidelines) +- [Debugging](#debugging) +- [Performance Profiling](#performance-profiling) + +## Development Setup + +### Prerequisites + +- Rust 1.70+ and Cargo +- Node.js 18+ and npm +- PostgreSQL 14+ +- Tesseract OCR 4.0+ +- Git + +### Local Development + +1. **Clone the repository**: +```bash +git clone https://github.com/perfectra1n/readur.git +cd readur +``` + +2. **Set up the database**: +```bash +# Create development database +sudo -u postgres psql +CREATE DATABASE readur_dev; +CREATE USER readur_dev WITH ENCRYPTED PASSWORD 'dev_password'; +GRANT ALL PRIVILEGES ON DATABASE readur_dev TO readur_dev; +\q +``` + +3. **Configure environment**: +```bash +# Copy example environment +cp .env.example .env.development + +# Edit with your settings +DATABASE_URL=postgresql://readur_dev:dev_password@localhost/readur_dev +JWT_SECRET=dev-secret-key +``` + +4. **Run database migrations**: +```bash +# Install sqlx-cli if needed +cargo install sqlx-cli + +# Run migrations +sqlx migrate run +``` + +5. **Start the backend**: +```bash +# Development mode with auto-reload +cargo watch -x run + +# Or without auto-reload +cargo run +``` + +6. **Start the frontend**: +```bash +cd frontend +npm install +npm run dev +``` + +### Development with Docker + +For a consistent development environment: + +```bash +# Start all services +docker compose -f docker-compose.yml -f docker-compose.dev.yml up + +# Backend available at: http://localhost:8000 +# Frontend dev server at: http://localhost:5173 +# PostgreSQL at: localhost:5433 +``` + +The development compose file includes: +- Volume mounts for hot reloading +- Exposed database port +- Debug logging enabled + +## Project Structure + +``` +readur/ +โ”œโ”€โ”€ src/ # Rust backend source +โ”‚ โ”œโ”€โ”€ main.rs # Application entry point +โ”‚ โ”œโ”€โ”€ config.rs # Configuration management +โ”‚ โ”œโ”€โ”€ models.rs # Database models +โ”‚ โ”œโ”€โ”€ routes/ # API route handlers +โ”‚ โ”œโ”€โ”€ db/ # Database operations +โ”‚ โ”œโ”€โ”€ ocr.rs # OCR processing +โ”‚ โ””โ”€โ”€ tests/ # Integration tests +โ”œโ”€โ”€ frontend/ # React frontend +โ”‚ โ”œโ”€โ”€ src/ +โ”‚ โ”‚ โ”œโ”€โ”€ components/ # React components +โ”‚ โ”‚ โ”œโ”€โ”€ pages/ # Page components +โ”‚ โ”‚ โ”œโ”€โ”€ services/ # API services +โ”‚ โ”‚ โ””โ”€โ”€ App.tsx # Main app component +โ”‚ โ””โ”€โ”€ tests/ # Frontend tests +โ”œโ”€โ”€ migrations/ # Database migrations +โ”œโ”€โ”€ docs/ # Documentation +โ””โ”€โ”€ tests/ # E2E and integration tests +``` + +## Testing + +Readur has comprehensive test coverage across unit, integration, and end-to-end tests. + +### Backend Tests + +```bash +# Run all tests +cargo test + +# Run with output +cargo test -- --nocapture + +# Run specific test +cargo test test_document_upload + +# Run tests with coverage +cargo install cargo-tarpaulin +cargo tarpaulin --out Html +``` + +Test categories: +- **Unit tests**: In `src/tests/` +- **Integration tests**: In `tests/` +- **Database tests**: Require `TEST_DATABASE_URL` + +Example test: +```rust +#[cfg(test)] +mod tests { + use super::*; + + #[tokio::test] + async fn test_document_creation() { + let doc = Document::new("test.pdf", "application/pdf"); + assert_eq!(doc.filename, "test.pdf"); + } +} +``` + +### Frontend Tests + +```bash +cd frontend + +# Run unit tests +npm test + +# Run with coverage +npm run test:coverage + +# Run in watch mode +npm run test:watch +``` + +Example test: +```typescript +import { render, screen } from '@testing-library/react'; +import DocumentList from './DocumentList'; + +test('renders document list', () => { + render(); + expect(screen.getByText(/No documents/i)).toBeInTheDocument(); +}); +``` + +### Integration Tests + +```bash +# Run integration tests +docker compose -f docker-compose.test.yml up --abort-on-container-exit + +# Or manually +cargo test --test '*' -- --test-threads=1 +``` + +### E2E Tests + +Using Playwright for end-to-end testing: + +```bash +cd frontend + +# Install Playwright +npm run e2e:install + +# Run E2E tests +npm run e2e + +# Run in UI mode +npm run e2e:ui +``` + +## Code Style + +### Rust Guidelines + +We follow the official Rust style guide with some additions: + +```bash +# Format code +cargo fmt + +# Check linting +cargo clippy -- -D warnings + +# Check before committing +cargo fmt --check && cargo clippy +``` + +Style preferences: +- Use descriptive variable names +- Add documentation comments for public APIs +- Keep functions small and focused +- Use `Result` for error handling +- Prefer `&str` over `String` for function parameters + +### Frontend Guidelines + +```bash +# Format code +npm run format + +# Lint check +npm run lint + +# Type check +npm run type-check +``` + +Style preferences: +- Use functional components with hooks +- TypeScript for all new code +- Descriptive component and variable names +- Extract reusable logic into custom hooks +- Keep components focused and small + +## Contributing + +We welcome contributions! Please see our [Contributing Guide](../CONTRIBUTING.md) for details. + +### Getting Started + +1. **Fork the repository** +2. **Create a feature branch**: +```bash +git checkout -b feature/amazing-feature +``` + +3. **Make your changes** +4. **Add tests** for new functionality +5. **Ensure all tests pass**: +```bash +cargo test +cd frontend && npm test +``` + +6. **Commit your changes** (see commit guidelines below) +7. **Push to your fork**: +```bash +git push origin feature/amazing-feature +``` + +8. **Open a Pull Request** + +### Pull Request Process + +1. **Update documentation** for any changed functionality +2. **Add tests** covering new code +3. **Ensure CI passes** (automated checks) +4. **Request review** from maintainers +5. **Address feedback** promptly +6. **Squash commits** if requested + +### Commit Guidelines + +We use conventional commits for clear history: + +``` +feat: add bulk document export +fix: resolve OCR timeout on large files +docs: update API authentication section +test: add coverage for search filters +refactor: simplify document processing pipeline +perf: optimize database queries for search +chore: update dependencies +``` + +Format: +``` +(): + + + +