feat(docs): add more user facing docs, update README, and move dev docs to correct folder

This commit is contained in:
perf3ct 2025-06-25 18:35:06 +00:00
parent 102e7d8b3f
commit a7883c1b63
15 changed files with 2659 additions and 779 deletions

787
README.md
View File

@ -16,10 +16,6 @@ A powerful, modern document management system built with Rust and React. Readur
## 🚀 Quick Start
### Using Docker Compose (Recommended)
The fastest way to get Readur running:
```bash
# Clone the repository
git clone https://github.com/perfectra1n/readur
@ -38,278 +34,26 @@ open http://localhost:8000
> ⚠️ **Important**: Change the default admin password immediately after first login!
### What You Get
## 📚 Documentation
After deployment, you'll have:
- **Web Interface**: Modern document management UI at `http://localhost:8000`
- **PostgreSQL Database**: Document metadata and full-text search indexes
- **File Storage**: Persistent document storage with OCR processing
- **Watch Folder**: Automatic file ingestion from mounted directories
- **REST API**: Full API access for integrations
### Getting Started
- [📦 Installation Guide](docs/installation.md) - Docker & manual installation instructions
- [🔧 Configuration](docs/configuration.md) - Environment variables and settings
- [📖 User Guide](docs/user-guide.md) - How to use Readur effectively
## 🐳 Docker Deployment Guide
### Deployment & Operations
- [🚀 Deployment Guide](docs/deployment.md) - Production deployment, SSL, monitoring
- [🔄 Reverse Proxy Setup](docs/REVERSE_PROXY.md) - Nginx, Traefik, and more
- [📁 Watch Folder Guide](docs/WATCH_FOLDER.md) - Automatic document ingestion
### Production Docker Compose
### Development
- [🏗️ Developer Documentation](docs/dev/) - Architecture, development setup, testing
- [🔌 API Reference](docs/api-reference.md) - REST API documentation
For production deployments, create a custom `docker-compose.prod.yml`:
```yaml
services:
readur:
image: readur:latest
ports:
- "8000:8000"
environment:
# Core Configuration
- DATABASE_URL=postgresql://readur:${DB_PASSWORD}@postgres:5432/readur
- JWT_SECRET=${JWT_SECRET}
- SERVER_ADDRESS=0.0.0.0:8000
# File Storage
- UPLOAD_PATH=/app/uploads
- WATCH_FOLDER=/app/watch
- ALLOWED_FILE_TYPES=pdf,png,jpg,jpeg,tiff,bmp,gif,txt,doc,docx
# Watch Folder Settings
- WATCH_INTERVAL_SECONDS=30
- FILE_STABILITY_CHECK_MS=500
- MAX_FILE_AGE_HOURS=168
# OCR Configuration
- OCR_LANGUAGE=eng
- CONCURRENT_OCR_JOBS=4
- OCR_TIMEOUT_SECONDS=300
- MAX_FILE_SIZE_MB=100
# Performance Tuning
- MEMORY_LIMIT_MB=1024
- CPU_PRIORITY=normal
- ENABLE_COMPRESSION=true
volumes:
# Document storage
- ./data/uploads:/app/uploads
# Watch folder - mount your network drives here
- /mnt/nfs/documents:/app/watch
# or SMB: - /mnt/smb/shared:/app/watch
# or S3: - /mnt/s3/bucket:/app/watch
depends_on:
- postgres
restart: unless-stopped
# Resource limits for production
deploy:
resources:
limits:
memory: 2G
cpus: '2.0'
reservations:
memory: 512M
cpus: '0.5'
postgres:
image: postgres:15
environment:
- POSTGRES_USER=readur
- POSTGRES_PASSWORD=${DB_PASSWORD}
- POSTGRES_DB=readur
- POSTGRES_INITDB_ARGS=--encoding=UTF-8 --lc-collate=en_US.UTF-8 --lc-ctype=en_US.UTF-8
volumes:
- postgres_data:/var/lib/postgresql/data
- ./postgres-config:/etc/postgresql/conf.d:ro
# PostgreSQL optimization for document search
command: >
postgres
-c shared_buffers=256MB
-c effective_cache_size=1GB
-c max_connections=100
-c default_text_search_config=pg_catalog.english
restart: unless-stopped
# Don't expose port in production
# ports:
# - "5433:5432"
volumes:
postgres_data:
driver: local
```
### Environment Variables
#### Port Configuration
Readur supports flexible port configuration:
```bash
# Method 1: Specify full server address
SERVER_ADDRESS=0.0.0.0:8000
# Method 2: Use separate host and port (recommended)
SERVER_HOST=0.0.0.0
SERVER_PORT=8000
# For development: Configure frontend port
CLIENT_PORT=5173
BACKEND_PORT=8000
```
#### Security Configuration
Create a `.env` file for your secrets:
```bash
# Generate secure secrets
JWT_SECRET=$(openssl rand -base64 64)
DB_PASSWORD=$(openssl rand -base64 32)
# Save to .env file
cat > .env << EOF
JWT_SECRET=${JWT_SECRET}
DB_PASSWORD=${DB_PASSWORD}
EOF
```
Deploy with:
```bash
docker compose -f docker-compose.prod.yml --env-file .env up -d
```
### Network Filesystem Mounts
#### NFS Mounts
```bash
# Mount NFS share
sudo mount -t nfs 192.168.1.100:/documents /mnt/nfs/documents
# Add to docker-compose.yml
volumes:
- /mnt/nfs/documents:/app/watch
environment:
- WATCH_INTERVAL_SECONDS=60
- FILE_STABILITY_CHECK_MS=1000
- FORCE_POLLING_WATCH=1
```
#### SMB/CIFS Mounts
```bash
# Mount SMB share
sudo mount -t cifs //server/share /mnt/smb/shared -o username=user,password=pass
# Docker volume configuration
volumes:
- /mnt/smb/shared:/app/watch
environment:
- WATCH_INTERVAL_SECONDS=30
- FILE_STABILITY_CHECK_MS=2000
```
#### S3 Mounts (using s3fs)
```bash
# Mount S3 bucket
s3fs mybucket /mnt/s3/bucket -o passwd_file=~/.passwd-s3fs
# Docker configuration for S3
volumes:
- /mnt/s3/bucket:/app/watch
environment:
- WATCH_INTERVAL_SECONDS=120
- FILE_STABILITY_CHECK_MS=5000
- FORCE_POLLING_WATCH=1
```
### SSL/HTTPS Setup
Use a reverse proxy like Nginx or Traefik:
#### Nginx Configuration
```nginx
server {
listen 443 ssl http2;
server_name readur.yourdomain.com;
ssl_certificate /path/to/cert.pem;
ssl_certificate_key /path/to/key.pem;
location / {
proxy_pass http://localhost:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# For file uploads
client_max_body_size 100M;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
}
```
#### Traefik Configuration
```yaml
services:
readur:
labels:
- "traefik.enable=true"
- "traefik.http.routers.readur.rule=Host(`readur.yourdomain.com`)"
- "traefik.http.routers.readur.tls=true"
- "traefik.http.routers.readur.tls.certresolver=letsencrypt"
```
> 📘 **For detailed reverse proxy configurations** including Apache, Caddy, custom ports, load balancing, and advanced scenarios, see [REVERSE_PROXY.md](./REVERSE_PROXY.md).
### Health Checks
Add health checks to your Docker configuration:
```yaml
services:
readur:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/api/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
```
### Backup Strategy
```bash
#!/bin/bash
# backup.sh - Automated backup script
# Backup database
docker exec readur-postgres-1 pg_dump -U readur readur | gzip > backup_$(date +%Y%m%d_%H%M%S).sql.gz
# Backup uploaded files
tar -czf uploads_backup_$(date +%Y%m%d_%H%M%S).tar.gz -C ./data uploads/
# Clean old backups (keep 30 days)
find . -name "backup_*.sql.gz" -mtime +30 -delete
find . -name "uploads_backup_*.tar.gz" -mtime +30 -delete
```
### Monitoring
Monitor your deployment with Docker stats:
```bash
# Real-time resource usage
docker stats
# Container logs
docker compose logs -f readur
# Watch folder activity
docker compose logs -f readur | grep watcher
```
### Advanced Topics
- [🔍 OCR Optimization](docs/dev/OCR_OPTIMIZATION_GUIDE.md) - Improve OCR performance
- [🗄️ Database Best Practices](docs/dev/DATABASE_GUARDRAILS.md) - Concurrency and safety
- [📊 Queue Architecture](docs/dev/QUEUE_IMPROVEMENTS.md) - Background job processing
## 🏗️ Architecture
@ -327,495 +71,24 @@ docker compose logs -f readur | grep watcher
## 📋 System Requirements
### Minimum Requirements
- **CPU**: 2 cores
- **RAM**: 2GB
- **Storage**: 10GB free space
- **OS**: Linux, macOS, or Windows with Docker
### Minimum
- 2 CPU cores, 2GB RAM, 10GB storage
- Docker or manual installation prerequisites
### Recommended for Production
- **CPU**: 4+ cores
- **RAM**: 4GB+
- **Storage**: 50GB+ SSD
- **Network**: Stable internet connection for OCR processing
## 🛠️ Manual Installation
For development or custom deployments without Docker:
### Prerequisites
Install these dependencies on your system:
```bash
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y \
tesseract-ocr tesseract-ocr-eng \
libtesseract-dev libleptonica-dev \
postgresql postgresql-contrib \
pkg-config libclang-dev
# macOS (requires Homebrew)
brew install tesseract leptonica postgresql rust nodejs npm
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
### Backend Setup
1. **Configure Database**:
```bash
# Create database and user
sudo -u postgres psql
CREATE DATABASE readur;
CREATE USER readur_user WITH ENCRYPTED PASSWORD 'your_password';
GRANT ALL PRIVILEGES ON DATABASE readur TO readur_user;
\q
```
2. **Environment Configuration**:
```bash
# Copy environment template
cp .env.example .env
# Edit configuration
nano .env
```
Required environment variables:
```env
DATABASE_URL=postgresql://readur_user:your_password@localhost/readur
JWT_SECRET=your-super-secret-jwt-key-change-this
SERVER_ADDRESS=0.0.0.0:8000
UPLOAD_PATH=./uploads
WATCH_FOLDER=./watch
ALLOWED_FILE_TYPES=pdf,png,jpg,jpeg,gif,bmp,tiff,txt,rtf,doc,docx
```
3. **Build and Run Backend**:
```bash
# Install dependencies and run
cargo build --release
cargo run
```
### Frontend Setup
1. **Install Dependencies**:
```bash
cd frontend
npm install
```
2. **Development Mode**:
```bash
npm run dev
# Frontend available at http://localhost:5173
```
3. **Production Build**:
```bash
npm run build
# Built files in frontend/dist/
```
## 📖 User Guide
### Getting Started
1. **First Login**: Use the default admin credentials to access the system
2. **Upload Documents**: Drag and drop files or use the upload button
3. **Wait for Processing**: OCR processing happens automatically in the background
4. **Search and Organize**: Use the powerful search features to find your documents
### Supported File Types
| Type | Extensions | OCR Support | Notes |
|------|-----------|-------------|-------|
| **PDF** | `.pdf` | ✅ | Text extraction + OCR for scanned pages |
| **Images** | `.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.gif` | ✅ | Full OCR text extraction |
| **Text** | `.txt`, `.rtf` | ❌ | Direct text indexing |
| **Office** | `.doc`, `.docx` | ⚠️ | Limited support |
### Using the Interface
#### Dashboard
- **Document Statistics**: Total documents, storage usage, OCR status
- **Recent Activity**: Latest uploads and processing status
- **Quick Actions**: Fast access to upload and search
#### Document Management
- **List/Grid View**: Toggle between different viewing modes
- **Sorting**: Sort by date, name, size, or file type
- **Filtering**: Filter by tags, file types, and OCR status
- **Bulk Actions**: Select multiple documents for batch operations
#### Advanced Search
- **Full-text Search**: Search within document content
- **Metadata Filters**: Filter by upload date, file size, type
- **Tag System**: Organize documents with custom tags
- **OCR Status**: Find processed vs. pending documents
#### Folder Watching
- **Non-destructive**: Unlike paperless-ngx, source files remain untouched
- **Automatic Processing**: New files are detected and processed automatically
- **Configurable**: Set custom watch directories
### Tips for Best Results
1. **OCR Quality**: Higher resolution images (300+ DPI) produce better OCR results
2. **File Organization**: Use consistent naming conventions for easier searching
3. **Regular Backups**: Backup both database and file storage regularly
4. **Performance**: For large document collections, consider increasing server resources
## 🔧 Configuration
### Environment Variables
All application settings can be configured via environment variables:
#### Core Configuration
| Variable | Default | Description |
|----------|---------|-------------|
| `DATABASE_URL` | `postgresql://readur:readur@localhost/readur` | PostgreSQL connection string |
| `JWT_SECRET` | `your-secret-key` | Secret key for JWT tokens ⚠️ **Change in production!** |
| `SERVER_ADDRESS` | `0.0.0.0:8000` | Server bind address and port |
#### File Storage & Upload
| Variable | Default | Description |
|----------|---------|-------------|
| `UPLOAD_PATH` | `./uploads` | Document storage directory |
| `ALLOWED_FILE_TYPES` | `pdf,txt,doc,docx,png,jpg,jpeg` | Comma-separated allowed file extensions |
#### Watch Folder Configuration
| Variable | Default | Description |
|----------|---------|-------------|
| `WATCH_FOLDER` | `./watch` | Directory to monitor for new files |
| `WATCH_INTERVAL_SECONDS` | `30` | Polling interval for network filesystems (seconds) |
| `FILE_STABILITY_CHECK_MS` | `500` | Time to wait for file write completion (milliseconds) |
| `MAX_FILE_AGE_HOURS` | _(none)_ | Skip files older than this many hours |
| `FORCE_POLLING_WATCH` | _(none)_ | Force polling mode even for local filesystems |
#### OCR & Processing Settings
*Note: These settings can also be configured per-user via the web interface*
| Variable | Default | Description |
|----------|---------|-------------|
| `OCR_LANGUAGE` | `eng` | OCR language code (eng, fra, deu, spa, etc.) |
| `CONCURRENT_OCR_JOBS` | `4` | Maximum parallel OCR processes |
| `OCR_TIMEOUT_SECONDS` | `300` | OCR processing timeout per file |
| `MAX_FILE_SIZE_MB` | `50` | Maximum file size for processing |
| `AUTO_ROTATE_IMAGES` | `true` | Automatically rotate images for better OCR |
| `ENABLE_IMAGE_PREPROCESSING` | `true` | Apply image enhancement before OCR |
#### Search & Performance
| Variable | Default | Description |
|----------|---------|-------------|
| `SEARCH_RESULTS_PER_PAGE` | `25` | Default number of search results per page |
| `SEARCH_SNIPPET_LENGTH` | `200` | Length of text snippets in search results |
| `FUZZY_SEARCH_THRESHOLD` | `0.8` | Similarity threshold for fuzzy search (0.0-1.0) |
| `MEMORY_LIMIT_MB` | `512` | Memory limit for OCR processes |
| `CPU_PRIORITY` | `normal` | CPU priority: `low`, `normal`, `high` |
#### Data Management
| Variable | Default | Description |
|----------|---------|-------------|
| `RETENTION_DAYS` | _(none)_ | Auto-delete documents after N days |
| `ENABLE_AUTO_CLEANUP` | `false` | Enable automatic cleanup of old documents |
| `ENABLE_COMPRESSION` | `false` | Compress stored documents to save space |
| `ENABLE_BACKGROUND_OCR` | `true` | Process OCR in background queue |
### Example Production Configuration
```env
# Core settings
DATABASE_URL=postgresql://readur:secure_password@postgres:5432/readur
JWT_SECRET=your-very-long-random-secret-key-generated-with-openssl
SERVER_ADDRESS=0.0.0.0:8000
# File handling
UPLOAD_PATH=/app/uploads
ALLOWED_FILE_TYPES=pdf,png,jpg,jpeg,tiff,bmp,gif,txt,rtf,doc,docx
# Watch folder for NFS mount
WATCH_FOLDER=/mnt/nfs/documents
WATCH_INTERVAL_SECONDS=60
FILE_STABILITY_CHECK_MS=1000
MAX_FILE_AGE_HOURS=168
FORCE_POLLING_WATCH=1
# OCR optimization
OCR_LANGUAGE=eng
CONCURRENT_OCR_JOBS=8
OCR_TIMEOUT_SECONDS=600
MAX_FILE_SIZE_MB=200
AUTO_ROTATE_IMAGES=true
ENABLE_IMAGE_PREPROCESSING=true
# Performance tuning
MEMORY_LIMIT_MB=2048
CPU_PRIORITY=high
ENABLE_COMPRESSION=true
ENABLE_BACKGROUND_OCR=true
# Search optimization
SEARCH_RESULTS_PER_PAGE=50
SEARCH_SNIPPET_LENGTH=300
FUZZY_SEARCH_THRESHOLD=0.7
# Data management
RETENTION_DAYS=2555 # 7 years
ENABLE_AUTO_CLEANUP=true
```
### Runtime Settings vs Environment Variables
Some settings can be configured in two ways:
1. **Environment Variables**: Set at container startup, affects the entire application
2. **User Settings**: Configured per-user via the web interface, stored in database
**Environment variables take precedence** and provide system-wide defaults. User settings override these defaults for individual users where applicable.
Settings configurable via web interface:
- OCR language preferences
- Search result limits
- File type restrictions
- OCR processing options
- Data retention policies
### Configuration Priority
Settings are applied in this order (later values override earlier ones):
1. **Application defaults** (built into the code)
2. **Environment variables** (system-wide configuration)
3. **User settings** (per-user database settings via web interface)
This allows for flexible deployment where system administrators can set defaults while users can customize their experience.
### Quick Reference - Essential Variables
For a minimal production deployment, configure these essential variables:
```bash
# Security (REQUIRED)
JWT_SECRET=your-secure-random-key-here
DATABASE_URL=postgresql://user:password@host:port/database
# File Storage
UPLOAD_PATH=/app/uploads
WATCH_FOLDER=/path/to/mounted/folder
# Watch Folder (for network mounts)
WATCH_INTERVAL_SECONDS=60
FORCE_POLLING_WATCH=1
# Performance
CONCURRENT_OCR_JOBS=4
MAX_FILE_SIZE_MB=100
```
### Database Tuning
For better search performance with large document collections:
```sql
-- Increase shared_buffers for better caching
ALTER SYSTEM SET shared_buffers = '256MB';
-- Optimize for full-text search
ALTER SYSTEM SET default_text_search_config = 'pg_catalog.english';
-- Restart PostgreSQL after changes
```
## 🔌 API Reference
### Authentication Endpoints
```bash
# Register new user
POST /api/auth/register
Content-Type: application/json
{
"username": "john_doe",
"email": "john@example.com",
"password": "secure_password"
}
# Login
POST /api/auth/login
Content-Type: application/json
{
"username": "john_doe",
"password": "secure_password"
}
# Get current user
GET /api/auth/me
Authorization: Bearer <jwt_token>
```
### Document Management
```bash
# Upload document
POST /api/documents
Authorization: Bearer <jwt_token>
Content-Type: multipart/form-data
file: <binary_file_data>
# List documents
GET /api/documents?limit=50&offset=0
Authorization: Bearer <jwt_token>
# Download document
GET /api/documents/{id}/download
Authorization: Bearer <jwt_token>
```
### Search
```bash
# Search documents
GET /api/search?query=contract&limit=20
Authorization: Bearer <jwt_token>
# Advanced search with filters
GET /api/search?query=invoice&mime_types=application/pdf&tags=important
Authorization: Bearer <jwt_token>
```
## 🧪 Testing
### Run All Tests
```bash
# Backend tests
cargo test
# Frontend tests
cd frontend && npm test
# Integration tests with Docker
docker compose -f docker-compose.test.yml up --build
```
### Test Coverage
```bash
# Install cargo-tarpaulin for coverage
cargo install cargo-tarpaulin
# Generate coverage report
cargo tarpaulin --out Html
```
## 🔒 Security Considerations
### Production Deployment
1. **Change Default Credentials**: Update admin password immediately
2. **Use Strong JWT Secret**: Generate a secure random key
3. **Enable HTTPS**: Use a reverse proxy with SSL/TLS
4. **Database Security**: Use strong passwords and restrict network access
5. **File Permissions**: Ensure proper file system permissions
6. **Regular Updates**: Keep dependencies and base images updated
### Recommended Production Setup
```bash
# Use environment-specific secrets
JWT_SECRET=$(openssl rand -base64 64)
# Restrict database access
# Only allow connections from application container
# Use read-only file system where possible
# Mount uploads and watch folders as separate volumes
```
## 🚀 Deployment Options
### Docker Swarm
```yaml
version: '3.8'
services:
readur:
image: readur:latest
deploy:
replicas: 2
restart_policy:
condition: on-failure
networks:
- readur-network
secrets:
- jwt_secret
- db_password
```
### Kubernetes
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: readur
spec:
replicas: 3
selector:
matchLabels:
app: readur
template:
spec:
containers:
- name: readur
image: readur:latest
env:
- name: JWT_SECRET
valueFrom:
secretKeyRef:
name: readur-secrets
key: jwt-secret
```
### Cloud Platforms
- **AWS**: Use ECS with RDS PostgreSQL
- **Google Cloud**: Deploy to Cloud Run with Cloud SQL
- **Azure**: Use Container Instances with Azure Database
- **DigitalOcean**: App Platform with Managed Database
- 4+ CPU cores, 4GB+ RAM, 50GB+ SSD
- See [deployment guide](docs/deployment.md) for details
## 🤝 Contributing
We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) and [Development Setup](docs/dev/development.md) for details.
### Development Setup
## 🔒 Security
```bash
# Fork and clone the repository
git clone https://github.com/yourusername/readur.git
cd readur
# Create a feature branch
git checkout -b feature/amazing-feature
# Make your changes and test
cargo test
cd frontend && npm test
# Submit a pull request
```
### Code Style
- **Rust**: Follow `rustfmt` and `clippy` recommendations
- **Frontend**: Use Prettier and ESLint configurations
- **Commits**: Use conventional commit format
- Change default credentials immediately
- Use HTTPS in production
- Regular security updates
- See [deployment guide](docs/deployment.md#security-considerations) for security best practices
## 📝 License
@ -830,9 +103,9 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
## 📞 Support
- **Documentation**: Check this README and inline code comments
- **Issues**: Report bugs and request features on GitHub Issues
- **Discussions**: Join community discussions on GitHub Discussions
- **Documentation**: Start with the [User Guide](docs/user-guide.md)
- **Issues**: Report bugs on [GitHub Issues](https://github.com/perfectra1n/readur/issues)
- **Discussions**: Join our [GitHub Discussions](https://github.com/perfectra1n/readur/discussions)
---

View File

@ -1,31 +1,68 @@
# Watch Folder Documentation
# Watch Folder Guide
The watch folder feature automatically monitors a directory for new OCR-able files and processes them without deleting the original files. This is perfect for scenarios where files are mounted from various filesystem types including NFS, SMB, S3, and local storage.
The watch folder feature automatically monitors a directory for new files and processes them with OCR, making them searchable in Readur. Your original files are never modified or deleted - Readur simply copies and processes them while leaving the originals untouched.
## Features
## What is Watch Folder?
### 🔄 Cross-Filesystem Compatibility
- **Automatic Detection**: Detects filesystem type and chooses optimal watching strategy
- **Local Filesystems**: Uses efficient inotify-based watching for ext4, NTFS, APFS, etc.
- **Network Filesystems**: Uses polling-based watching for NFS, SMB/CIFS, S3 mounts
- **Hybrid Fallback**: Gracefully falls back to polling if inotify fails
Watch folder allows you to:
- **Drop files anywhere** - Point Readur to any folder (local, network drive, cloud mount)
- **Automatic processing** - New files are automatically detected and processed
- **Non-destructive** - Original files remain exactly where you put them
- **Background operation** - Processing happens in the background while you continue working
### 📁 Smart File Processing
- **OCR-able File Detection**: Only processes supported file types (PDF, images, text, Word docs)
- **Duplicate Prevention**: Checks for existing files with same name and size
- **File Stability**: Waits for files to finish being written before processing
- **System File Exclusion**: Skips hidden files, temporary files, and system directories
Perfect for scenarios where you want to automatically process files from:
- Network drives (NFS, SMB shares)
- Cloud storage mounts (Google Drive, Dropbox, OneDrive)
- Local folders where you save scanned documents
- Shared team folders
### ⚙️ Configuration Options
## How It Works
| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| `WATCH_FOLDER` | `./watch` | Path to the folder to monitor |
| `WATCH_INTERVAL_SECONDS` | `30` | Polling interval for network filesystems |
| `FILE_STABILITY_CHECK_MS` | `500` | Time to wait for file stability |
| `MAX_FILE_AGE_HOURS` | `none` | Skip files older than specified hours |
| `ALLOWED_FILE_TYPES` | `pdf,png,jpg,jpeg,tiff,bmp,txt,doc,docx` | Allowed file extensions |
| `FORCE_POLLING_WATCH` | `unset` | Force polling mode even for local filesystems |
1. **Point Readur to your folder** - Set the `WATCH_FOLDER` path to any directory you want monitored
2. **Drop files** - Add documents to that folder (PDFs, images, text files, Word docs)
3. **Automatic detection** - Readur notices new files within seconds (local) or minutes (network)
4. **OCR processing** - Files are automatically processed to extract searchable text
5. **Search and find** - Your documents become searchable in the Readur web interface
## Key Features
**Works with any storage type** - Local drives, network shares, cloud mounts
**Smart processing** - Only processes supported file types
**Duplicate prevention** - Won't process the same file twice
**Safe operation** - Never modifies or deletes your original files
**Background processing** - Doesn't interrupt your workflow
## Quick Setup
### Basic Setup (Docker Compose)
1. **Edit your docker-compose.yml**:
```yaml
services:
readur:
image: readur:latest
volumes:
# Mount your folder to the watch directory
- /path/to/your/documents:/app/watch
environment:
- WATCH_FOLDER=/app/watch
```
2. **Start Readur**:
```bash
docker compose up -d
```
3. **Start dropping files** into `/path/to/your/documents` - they'll be automatically processed!
### Configuration Options
| Setting | Default | What it does |
|---------|---------|-------------|
| `WATCH_FOLDER` | `./watch` | Which folder to monitor |
| `WATCH_INTERVAL_SECONDS` | `30` | How often to check for new files (network drives) |
| `MAX_FILE_AGE_HOURS` | _(none)_ | Ignore files older than this |
| `ALLOWED_FILE_TYPES` | `pdf,png,jpg,jpeg,tiff,bmp,txt,doc,docx` | Which file types to process |
## Usage

618
docs/api-reference.md Normal file
View File

@ -0,0 +1,618 @@
# API Reference
Readur provides a comprehensive REST API for integrating with external systems and building custom workflows.
## Table of Contents
- [Base URL](#base-url)
- [Authentication](#authentication)
- [Error Handling](#error-handling)
- [Rate Limiting](#rate-limiting)
- [Endpoints](#endpoints)
- [Authentication](#authentication-endpoints)
- [Documents](#document-endpoints)
- [Search](#search-endpoints)
- [OCR Queue](#ocr-queue-endpoints)
- [Settings](#settings-endpoints)
- [Sources](#sources-endpoints)
- [Labels](#labels-endpoints)
- [Users](#user-endpoints)
- [WebSocket API](#websocket-api)
- [Examples](#examples)
## Base URL
```
http://localhost:8000/api
```
For production deployments, replace with your configured domain and ensure HTTPS is used.
## Authentication
Readur uses JWT (JSON Web Token) authentication. Include the token in the Authorization header:
```
Authorization: Bearer <jwt_token>
```
### Obtaining a Token
```bash
POST /api/auth/login
Content-Type: application/json
{
"username": "admin",
"password": "your_password"
}
```
Response:
```json
{
"token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
"user": {
"id": 1,
"username": "admin",
"email": "admin@example.com",
"role": "admin"
}
}
```
## Error Handling
All API errors follow a consistent format:
```json
{
"error": {
"code": "VALIDATION_ERROR",
"message": "Invalid request parameters",
"details": {
"field": "email",
"reason": "Invalid email format"
}
}
}
```
Common HTTP status codes:
- `200` - Success
- `201` - Created
- `400` - Bad Request
- `401` - Unauthorized
- `403` - Forbidden
- `404` - Not Found
- `422` - Validation Error
- `500` - Internal Server Error
## Rate Limiting
API requests are rate-limited to prevent abuse:
- Authenticated users: 1000 requests per hour
- Unauthenticated users: 100 requests per hour
Rate limit headers:
```
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 999
X-RateLimit-Reset: 1640995200
```
## Endpoints
### Authentication Endpoints
#### Register New User
```bash
POST /api/auth/register
Content-Type: application/json
{
"username": "john_doe",
"email": "john@example.com",
"password": "secure_password"
}
```
#### Login
```bash
POST /api/auth/login
Content-Type: application/json
{
"username": "john_doe",
"password": "secure_password"
}
```
#### Get Current User
```bash
GET /api/auth/me
Authorization: Bearer <jwt_token>
```
#### Logout
```bash
POST /api/auth/logout
Authorization: Bearer <jwt_token>
```
### Document Endpoints
#### Upload Document
```bash
POST /api/documents
Authorization: Bearer <jwt_token>
Content-Type: multipart/form-data
file: <binary_file_data>
tags: ["invoice", "2024"] # Optional
```
Response:
```json
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"filename": "invoice_2024.pdf",
"mime_type": "application/pdf",
"size": 1048576,
"uploaded_at": "2024-01-01T00:00:00Z",
"ocr_status": "pending"
}
```
#### List Documents
```bash
GET /api/documents?limit=50&offset=0&sort=-uploaded_at
Authorization: Bearer <jwt_token>
```
Query parameters:
- `limit` - Number of results (default: 50, max: 100)
- `offset` - Pagination offset
- `sort` - Sort field (prefix with `-` for descending)
- `mime_type` - Filter by MIME type
- `ocr_status` - Filter by OCR status
- `tag` - Filter by tag
#### Get Document Details
```bash
GET /api/documents/{id}
Authorization: Bearer <jwt_token>
```
#### Download Document
```bash
GET /api/documents/{id}/download
Authorization: Bearer <jwt_token>
```
#### Delete Document
```bash
DELETE /api/documents/{id}
Authorization: Bearer <jwt_token>
```
#### Update Document
```bash
PATCH /api/documents/{id}
Authorization: Bearer <jwt_token>
Content-Type: application/json
{
"tags": ["invoice", "paid", "2024"]
}
```
### Search Endpoints
#### Search Documents
```bash
GET /api/search?query=invoice&limit=20
Authorization: Bearer <jwt_token>
```
Query parameters:
- `query` - Search query (required)
- `limit` - Number of results
- `offset` - Pagination offset
- `mime_types` - Comma-separated MIME types
- `tags` - Comma-separated tags
- `date_from` - Start date (ISO 8601)
- `date_to` - End date (ISO 8601)
Response:
```json
{
"results": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"filename": "invoice_2024.pdf",
"snippet": "...invoice for services rendered in Q1 2024...",
"score": 0.95,
"highlights": ["invoice", "2024"]
}
],
"total": 42,
"limit": 20,
"offset": 0
}
```
#### Advanced Search
```bash
POST /api/search/advanced
Authorization: Bearer <jwt_token>
Content-Type: application/json
{
"query": "invoice",
"filters": {
"mime_types": ["application/pdf"],
"tags": ["unpaid"],
"date_range": {
"from": "2024-01-01",
"to": "2024-12-31"
},
"file_size": {
"min": 1024,
"max": 10485760
}
},
"options": {
"fuzzy": true,
"snippet_length": 200
}
}
```
### OCR Queue Endpoints
#### Get Queue Status
```bash
GET /api/queue/status
Authorization: Bearer <jwt_token>
```
Response:
```json
{
"pending": 15,
"processing": 3,
"completed_today": 127,
"failed_today": 2,
"average_processing_time": 4.5
}
```
#### Reprocess Document
```bash
POST /api/documents/{id}/reprocess
Authorization: Bearer <jwt_token>
```
#### Get Failed OCR Jobs
```bash
GET /api/queue/failed
Authorization: Bearer <jwt_token>
```
### Settings Endpoints
#### Get User Settings
```bash
GET /api/settings
Authorization: Bearer <jwt_token>
```
#### Update User Settings
```bash
PUT /api/settings
Authorization: Bearer <jwt_token>
Content-Type: application/json
{
"ocr_language": "eng",
"search_results_per_page": 50,
"enable_notifications": true
}
```
### Sources Endpoints
#### List Sources
```bash
GET /api/sources
Authorization: Bearer <jwt_token>
```
#### Create Source
```bash
POST /api/sources
Authorization: Bearer <jwt_token>
Content-Type: application/json
{
"name": "Network Drive",
"type": "local_folder",
"config": {
"path": "/mnt/network/documents",
"scan_interval": 3600
},
"enabled": true
}
```
#### Update Source
```bash
PUT /api/sources/{id}
Authorization: Bearer <jwt_token>
Content-Type: application/json
{
"enabled": false
}
```
#### Delete Source
```bash
DELETE /api/sources/{id}
Authorization: Bearer <jwt_token>
```
#### Sync Source
```bash
POST /api/sources/{id}/sync
Authorization: Bearer <jwt_token>
```
### Labels Endpoints
#### List Labels
```bash
GET /api/labels
Authorization: Bearer <jwt_token>
```
#### Create Label
```bash
POST /api/labels
Authorization: Bearer <jwt_token>
Content-Type: application/json
{
"name": "Important",
"color": "#FF0000"
}
```
#### Update Label
```bash
PUT /api/labels/{id}
Authorization: Bearer <jwt_token>
Content-Type: application/json
{
"name": "Very Important",
"color": "#FF00FF"
}
```
#### Delete Label
```bash
DELETE /api/labels/{id}
Authorization: Bearer <jwt_token>
```
### User Endpoints
#### List Users (Admin Only)
```bash
GET /api/users
Authorization: Bearer <jwt_token>
```
#### Get User
```bash
GET /api/users/{id}
Authorization: Bearer <jwt_token>
```
#### Update User
```bash
PUT /api/users/{id}
Authorization: Bearer <jwt_token>
Content-Type: application/json
{
"email": "newemail@example.com",
"role": "user"
}
```
#### Delete User (Admin Only)
```bash
DELETE /api/users/{id}
Authorization: Bearer <jwt_token>
```
## WebSocket API
Connect to receive real-time updates:
```javascript
const ws = new WebSocket('ws://localhost:8000/ws');
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
console.log('Event:', data);
};
// Authenticate
ws.send(JSON.stringify({
type: 'auth',
token: 'your_jwt_token'
}));
```
Event types:
- `document.uploaded` - New document uploaded
- `ocr.completed` - OCR processing completed
- `ocr.failed` - OCR processing failed
- `source.sync.completed` - Source sync finished
## Examples
### Python Example
```python
import requests
# Configuration
BASE_URL = "http://localhost:8000/api"
USERNAME = "admin"
PASSWORD = "your_password"
# Login
response = requests.post(f"{BASE_URL}/auth/login", json={
"username": USERNAME,
"password": PASSWORD
})
token = response.json()["token"]
headers = {"Authorization": f"Bearer {token}"}
# Upload document
with open("document.pdf", "rb") as f:
files = {"file": ("document.pdf", f, "application/pdf")}
response = requests.post(
f"{BASE_URL}/documents",
headers=headers,
files=files
)
document_id = response.json()["id"]
# Search documents
response = requests.get(
f"{BASE_URL}/search",
headers=headers,
params={"query": "invoice 2024"}
)
results = response.json()["results"]
```
### JavaScript Example
```javascript
// Configuration
const BASE_URL = 'http://localhost:8000/api';
// Login
async function login(username, password) {
const response = await fetch(`${BASE_URL}/auth/login`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ username, password })
});
const data = await response.json();
return data.token;
}
// Upload document
async function uploadDocument(token, file) {
const formData = new FormData();
formData.append('file', file);
const response = await fetch(`${BASE_URL}/documents`, {
method: 'POST',
headers: { 'Authorization': `Bearer ${token}` },
body: formData
});
return response.json();
}
// Search documents
async function searchDocuments(token, query) {
const response = await fetch(
`${BASE_URL}/search?query=${encodeURIComponent(query)}`,
{
headers: { 'Authorization': `Bearer ${token}` }
}
);
return response.json();
}
```
### cURL Examples
```bash
# Login
TOKEN=$(curl -s -X POST http://localhost:8000/api/auth/login \
-H "Content-Type: application/json" \
-d '{"username":"admin","password":"your_password"}' \
| jq -r .token)
# Upload document
curl -X POST http://localhost:8000/api/documents \
-H "Authorization: Bearer $TOKEN" \
-F "file=@document.pdf"
# Search documents
curl -X GET "http://localhost:8000/api/search?query=invoice" \
-H "Authorization: Bearer $TOKEN"
# Get document
curl -X GET http://localhost:8000/api/documents/550e8400-e29b-41d4-a716-446655440000 \
-H "Authorization: Bearer $TOKEN"
```
## OpenAPI Specification
The complete OpenAPI specification is available at:
```
GET /api/openapi.json
```
You can use this with tools like Swagger UI or to generate client libraries.
## SDK Support
Official SDKs are planned for:
- Python
- JavaScript/TypeScript
- Go
- Ruby
Check the [GitHub repository](https://github.com/perfectra1n/readur) for the latest SDK availability.

261
docs/configuration.md Normal file
View File

@ -0,0 +1,261 @@
# Configuration Guide
This guide covers all configuration options available in Readur through environment variables and runtime settings.
## Table of Contents
- [Environment Variables](#environment-variables)
- [Core Configuration](#core-configuration)
- [File Storage & Upload](#file-storage--upload)
- [Watch Folder Configuration](#watch-folder-configuration)
- [OCR & Processing Settings](#ocr--processing-settings)
- [Search & Performance](#search--performance)
- [Data Management](#data-management)
- [Port Configuration](#port-configuration)
- [Example Configurations](#example-configurations)
- [Configuration Priority](#configuration-priority)
- [Runtime Settings vs Environment Variables](#runtime-settings-vs-environment-variables)
- [Database Tuning](#database-tuning)
## Environment Variables
All application settings can be configured via environment variables:
### Core Configuration
| Variable | Default | Description |
|----------|---------|-------------|
| `DATABASE_URL` | `postgresql://readur:readur@localhost/readur` | PostgreSQL connection string |
| `JWT_SECRET` | `your-secret-key` | Secret key for JWT tokens ⚠️ **Change in production!** |
| `SERVER_ADDRESS` | `0.0.0.0:8000` | Server bind address and port |
### File Storage & Upload
| Variable | Default | Description |
|----------|---------|-------------|
| `UPLOAD_PATH` | `./uploads` | Document storage directory |
| `ALLOWED_FILE_TYPES` | `pdf,txt,doc,docx,png,jpg,jpeg` | Comma-separated allowed file extensions |
### Watch Folder Configuration
| Variable | Default | Description |
|----------|---------|-------------|
| `WATCH_FOLDER` | `./watch` | Directory to monitor for new files |
| `WATCH_INTERVAL_SECONDS` | `30` | Polling interval for network filesystems (seconds) |
| `FILE_STABILITY_CHECK_MS` | `500` | Time to wait for file write completion (milliseconds) |
| `MAX_FILE_AGE_HOURS` | _(none)_ | Skip files older than this many hours |
| `FORCE_POLLING_WATCH` | _(none)_ | Force polling mode even for local filesystems |
### OCR & Processing Settings
*Note: These settings can also be configured per-user via the web interface*
| Variable | Default | Description |
|----------|---------|-------------|
| `OCR_LANGUAGE` | `eng` | OCR language code (eng, fra, deu, spa, etc.) |
| `CONCURRENT_OCR_JOBS` | `4` | Maximum parallel OCR processes |
| `OCR_TIMEOUT_SECONDS` | `300` | OCR processing timeout per file |
| `MAX_FILE_SIZE_MB` | `50` | Maximum file size for processing |
| `AUTO_ROTATE_IMAGES` | `true` | Automatically rotate images for better OCR |
| `ENABLE_IMAGE_PREPROCESSING` | `true` | Apply image enhancement before OCR |
### Search & Performance
| Variable | Default | Description |
|----------|---------|-------------|
| `SEARCH_RESULTS_PER_PAGE` | `25` | Default number of search results per page |
| `SEARCH_SNIPPET_LENGTH` | `200` | Length of text snippets in search results |
| `FUZZY_SEARCH_THRESHOLD` | `0.8` | Similarity threshold for fuzzy search (0.0-1.0) |
| `MEMORY_LIMIT_MB` | `512` | Memory limit for OCR processes |
| `CPU_PRIORITY` | `normal` | CPU priority: `low`, `normal`, `high` |
### Data Management
| Variable | Default | Description |
|----------|---------|-------------|
| `RETENTION_DAYS` | _(none)_ | Auto-delete documents after N days |
| `ENABLE_AUTO_CLEANUP` | `false` | Enable automatic cleanup of old documents |
| `ENABLE_COMPRESSION` | `false` | Compress stored documents to save space |
| `ENABLE_BACKGROUND_OCR` | `true` | Process OCR in background queue |
## Port Configuration
Readur supports flexible port configuration:
```bash
# Method 1: Specify full server address
SERVER_ADDRESS=0.0.0.0:8000
# Method 2: Use separate host and port (recommended)
SERVER_HOST=0.0.0.0
SERVER_PORT=8000
# For development: Configure frontend port
CLIENT_PORT=5173
BACKEND_PORT=8000
```
## Example Configurations
### Development Configuration
```env
# Basic development setup
DATABASE_URL=postgresql://readur:readur@localhost/readur
JWT_SECRET=dev-secret-key-not-for-production
SERVER_ADDRESS=0.0.0.0:8000
UPLOAD_PATH=./uploads
WATCH_FOLDER=./watch
OCR_LANGUAGE=eng
CONCURRENT_OCR_JOBS=2
```
### Production Configuration
```env
# Core settings
DATABASE_URL=postgresql://readur:secure_password@postgres:5432/readur
JWT_SECRET=your-very-long-random-secret-key-generated-with-openssl
SERVER_ADDRESS=0.0.0.0:8000
# File handling
UPLOAD_PATH=/app/uploads
ALLOWED_FILE_TYPES=pdf,png,jpg,jpeg,tiff,bmp,gif,txt,rtf,doc,docx
# Watch folder for NFS mount
WATCH_FOLDER=/mnt/nfs/documents
WATCH_INTERVAL_SECONDS=60
FILE_STABILITY_CHECK_MS=1000
MAX_FILE_AGE_HOURS=168
FORCE_POLLING_WATCH=1
# OCR optimization
OCR_LANGUAGE=eng
CONCURRENT_OCR_JOBS=8
OCR_TIMEOUT_SECONDS=600
MAX_FILE_SIZE_MB=200
AUTO_ROTATE_IMAGES=true
ENABLE_IMAGE_PREPROCESSING=true
# Performance tuning
MEMORY_LIMIT_MB=2048
CPU_PRIORITY=high
ENABLE_COMPRESSION=true
ENABLE_BACKGROUND_OCR=true
# Search optimization
SEARCH_RESULTS_PER_PAGE=50
SEARCH_SNIPPET_LENGTH=300
FUZZY_SEARCH_THRESHOLD=0.7
# Data management
RETENTION_DAYS=2555 # 7 years
ENABLE_AUTO_CLEANUP=true
```
### Network Filesystem Configuration
```env
# For NFS mounts
WATCH_FOLDER=/mnt/nfs/documents
WATCH_INTERVAL_SECONDS=60
FILE_STABILITY_CHECK_MS=1000
FORCE_POLLING_WATCH=1
# For SMB/CIFS mounts
WATCH_FOLDER=/mnt/smb/shared
WATCH_INTERVAL_SECONDS=30
FILE_STABILITY_CHECK_MS=2000
# For S3 mounts (using s3fs)
WATCH_FOLDER=/mnt/s3/bucket
WATCH_INTERVAL_SECONDS=120
FILE_STABILITY_CHECK_MS=5000
FORCE_POLLING_WATCH=1
```
## Configuration Priority
Settings are applied in this order (later values override earlier ones):
1. **Application defaults** (built into the code)
2. **Environment variables** (system-wide configuration)
3. **User settings** (per-user database settings via web interface)
This allows for flexible deployment where system administrators can set defaults while users can customize their experience.
## Runtime Settings vs Environment Variables
Some settings can be configured in two ways:
1. **Environment Variables**: Set at container startup, affects the entire application
2. **User Settings**: Configured per-user via the web interface, stored in database
**Environment variables take precedence** and provide system-wide defaults. User settings override these defaults for individual users where applicable.
Settings configurable via web interface:
- OCR language preferences
- Search result limits
- File type restrictions
- OCR processing options
- Data retention policies
## Database Tuning
For better search performance with large document collections:
```sql
-- Increase shared_buffers for better caching
ALTER SYSTEM SET shared_buffers = '256MB';
-- Optimize for full-text search
ALTER SYSTEM SET default_text_search_config = 'pg_catalog.english';
-- Restart PostgreSQL after changes
```
## Security Configuration
### Generating Secure Secrets
```bash
# Generate secure JWT secret
JWT_SECRET=$(openssl rand -base64 64)
# Generate secure database password
DB_PASSWORD=$(openssl rand -base64 32)
# Save to .env file
cat > .env << EOF
JWT_SECRET=${JWT_SECRET}
DB_PASSWORD=${DB_PASSWORD}
EOF
```
### Quick Reference - Essential Variables
For a minimal production deployment, configure these essential variables:
```bash
# Security (REQUIRED)
JWT_SECRET=your-secure-random-key-here
DATABASE_URL=postgresql://user:password@host:port/database
# File Storage
UPLOAD_PATH=/app/uploads
WATCH_FOLDER=/path/to/mounted/folder
# Watch Folder (for network mounts)
WATCH_INTERVAL_SECONDS=60
FORCE_POLLING_WATCH=1
# Performance
CONCURRENT_OCR_JOBS=4
MAX_FILE_SIZE_MB=100
```
## Next Steps
- Review [deployment options](deployment.md) for production setup
- Learn about [folder watching](WATCH_FOLDER.md) for automatic document ingestion
- Optimize [OCR performance](dev/OCR_OPTIMIZATION_GUIDE.md) for your use case

403
docs/deployment.md Normal file
View File

@ -0,0 +1,403 @@
# Deployment Guide
This guide covers production deployment strategies, SSL setup, monitoring, backups, and best practices for running Readur in production.
## Table of Contents
- [Production Docker Compose](#production-docker-compose)
- [Network Filesystem Mounts](#network-filesystem-mounts)
- [NFS Mounts](#nfs-mounts)
- [SMB/CIFS Mounts](#smbcifs-mounts)
- [S3 Mounts](#s3-mounts)
- [SSL/HTTPS Setup](#sslhttps-setup)
- [Nginx Configuration](#nginx-configuration)
- [Traefik Configuration](#traefik-configuration)
- [Health Checks](#health-checks)
- [Backup Strategy](#backup-strategy)
- [Monitoring](#monitoring)
- [Deployment Platforms](#deployment-platforms)
- [Docker Swarm](#docker-swarm)
- [Kubernetes](#kubernetes)
- [Cloud Platforms](#cloud-platforms)
- [Security Considerations](#security-considerations)
## Production Docker Compose
For production deployments, create a custom `docker-compose.prod.yml`:
```yaml
services:
readur:
image: readur:latest
ports:
- "8000:8000"
environment:
# Core Configuration
- DATABASE_URL=postgresql://readur:${DB_PASSWORD}@postgres:5432/readur
- JWT_SECRET=${JWT_SECRET}
- SERVER_ADDRESS=0.0.0.0:8000
# File Storage
- UPLOAD_PATH=/app/uploads
- WATCH_FOLDER=/app/watch
- ALLOWED_FILE_TYPES=pdf,png,jpg,jpeg,tiff,bmp,gif,txt,doc,docx
# Watch Folder Settings
- WATCH_INTERVAL_SECONDS=30
- FILE_STABILITY_CHECK_MS=500
- MAX_FILE_AGE_HOURS=168
# OCR Configuration
- OCR_LANGUAGE=eng
- CONCURRENT_OCR_JOBS=4
- OCR_TIMEOUT_SECONDS=300
- MAX_FILE_SIZE_MB=100
# Performance Tuning
- MEMORY_LIMIT_MB=1024
- CPU_PRIORITY=normal
- ENABLE_COMPRESSION=true
volumes:
# Document storage
- ./data/uploads:/app/uploads
# Watch folder - mount your network drives here
- /mnt/nfs/documents:/app/watch
# or SMB: - /mnt/smb/shared:/app/watch
# or S3: - /mnt/s3/bucket:/app/watch
depends_on:
- postgres
restart: unless-stopped
# Resource limits for production
deploy:
resources:
limits:
memory: 2G
cpus: '2.0'
reservations:
memory: 512M
cpus: '0.5'
postgres:
image: postgres:15
environment:
- POSTGRES_USER=readur
- POSTGRES_PASSWORD=${DB_PASSWORD}
- POSTGRES_DB=readur
- POSTGRES_INITDB_ARGS=--encoding=UTF-8 --lc-collate=en_US.UTF-8 --lc-ctype=en_US.UTF-8
volumes:
- postgres_data:/var/lib/postgresql/data
- ./postgres-config:/etc/postgresql/conf.d:ro
# PostgreSQL optimization for document search
command: >
postgres
-c shared_buffers=256MB
-c effective_cache_size=1GB
-c max_connections=100
-c default_text_search_config=pg_catalog.english
restart: unless-stopped
# Don't expose port in production
# ports:
# - "5433:5432"
volumes:
postgres_data:
driver: local
```
Deploy with environment file:
```bash
# Create .env file with secrets
cat > .env << EOF
JWT_SECRET=$(openssl rand -base64 64)
DB_PASSWORD=$(openssl rand -base64 32)
EOF
# Deploy
docker compose -f docker-compose.prod.yml --env-file .env up -d
```
## Network Filesystem Mounts
### NFS Mounts
```bash
# Mount NFS share
sudo mount -t nfs 192.168.1.100:/documents /mnt/nfs/documents
# Add to docker-compose.yml
volumes:
- /mnt/nfs/documents:/app/watch
environment:
- WATCH_INTERVAL_SECONDS=60
- FILE_STABILITY_CHECK_MS=1000
- FORCE_POLLING_WATCH=1
```
### SMB/CIFS Mounts
```bash
# Mount SMB share
sudo mount -t cifs //server/share /mnt/smb/shared -o username=user,password=pass
# Docker volume configuration
volumes:
- /mnt/smb/shared:/app/watch
environment:
- WATCH_INTERVAL_SECONDS=30
- FILE_STABILITY_CHECK_MS=2000
```
### S3 Mounts
```bash
# Mount S3 bucket using s3fs
s3fs mybucket /mnt/s3/bucket -o passwd_file=~/.passwd-s3fs
# Docker configuration for S3
volumes:
- /mnt/s3/bucket:/app/watch
environment:
- WATCH_INTERVAL_SECONDS=120
- FILE_STABILITY_CHECK_MS=5000
- FORCE_POLLING_WATCH=1
```
## SSL/HTTPS Setup
### Nginx Configuration
```nginx
server {
listen 443 ssl http2;
server_name readur.yourdomain.com;
ssl_certificate /path/to/cert.pem;
ssl_certificate_key /path/to/key.pem;
location / {
proxy_pass http://localhost:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# For file uploads
client_max_body_size 100M;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
}
```
### Traefik Configuration
```yaml
services:
readur:
labels:
- "traefik.enable=true"
- "traefik.http.routers.readur.rule=Host(`readur.yourdomain.com`)"
- "traefik.http.routers.readur.tls=true"
- "traefik.http.routers.readur.tls.certresolver=letsencrypt"
```
> 📘 **For more reverse proxy configurations** including Apache, Caddy, custom ports, load balancing, and advanced scenarios, see [REVERSE_PROXY.md](./REVERSE_PROXY.md).
## Health Checks
Add health checks to your Docker configuration:
```yaml
services:
readur:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/api/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
```
## Backup Strategy
Create an automated backup script:
```bash
#!/bin/bash
# backup.sh - Automated backup script
BACKUP_DIR="/path/to/backups"
DATE=$(date +%Y%m%d_%H%M%S)
# Create backup directory
mkdir -p "$BACKUP_DIR"
# Backup database
docker exec readur-postgres-1 pg_dump -U readur readur | gzip > "$BACKUP_DIR/db_backup_$DATE.sql.gz"
# Backup uploaded files
tar -czf "$BACKUP_DIR/uploads_backup_$DATE.tar.gz" -C ./data uploads/
# Clean old backups (keep 30 days)
find "$BACKUP_DIR" -name "db_backup_*.sql.gz" -mtime +30 -delete
find "$BACKUP_DIR" -name "uploads_backup_*.tar.gz" -mtime +30 -delete
echo "Backup completed: $DATE"
```
Add to crontab for daily backups:
```bash
0 2 * * * /path/to/backup.sh >> /var/log/readur-backup.log 2>&1
```
### Restore from Backup
```bash
# Restore database
gunzip -c db_backup_20240101_020000.sql.gz | docker exec -i readur-postgres-1 psql -U readur readur
# Restore files
tar -xzf uploads_backup_20240101_020000.tar.gz -C ./data
```
## Monitoring
Monitor your deployment with Docker stats:
```bash
# Real-time resource usage
docker stats
# Container logs
docker compose logs -f readur
# Watch folder activity
docker compose logs -f readur | grep watcher
# PostgreSQL query performance
docker exec readur-postgres-1 psql -U readur -c "SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"
```
### Prometheus Metrics
Readur exposes metrics at `/metrics` endpoint:
```yaml
# prometheus.yml
scrape_configs:
- job_name: 'readur'
static_configs:
- targets: ['readur:8000']
```
## Deployment Platforms
### Docker Swarm
```yaml
version: '3.8'
services:
readur:
image: readur:latest
deploy:
replicas: 2
restart_policy:
condition: on-failure
placement:
constraints: [node.role == worker]
networks:
- readur-network
secrets:
- jwt_secret
- db_password
secrets:
jwt_secret:
external: true
db_password:
external: true
```
### Kubernetes
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: readur
spec:
replicas: 3
selector:
matchLabels:
app: readur
template:
spec:
containers:
- name: readur
image: readur:latest
env:
- name: JWT_SECRET
valueFrom:
secretKeyRef:
name: readur-secrets
key: jwt-secret
resources:
limits:
memory: "2Gi"
cpu: "2"
requests:
memory: "512Mi"
cpu: "500m"
```
### Cloud Platforms
- **AWS**: Use ECS with RDS PostgreSQL
- **Google Cloud**: Deploy to Cloud Run with Cloud SQL
- **Azure**: Use Container Instances with Azure Database
- **DigitalOcean**: App Platform with Managed Database
## Security Considerations
### Production Checklist
- [ ] Change default admin password
- [ ] Generate strong JWT secret
- [ ] Use HTTPS/SSL in production
- [ ] Restrict database network access
- [ ] Set proper file permissions
- [ ] Enable firewall rules
- [ ] Regular security updates
- [ ] Monitor access logs
- [ ] Implement rate limiting
- [ ] Enable audit logging
### Recommended Production Setup
```bash
# Generate secure secrets
JWT_SECRET=$(openssl rand -base64 64)
DB_PASSWORD=$(openssl rand -base64 32)
# Restrict file permissions
chmod 600 .env
chmod 700 ./data/uploads
# Use read-only root filesystem
docker run --read-only --tmpfs /tmp ...
```
## Next Steps
- Configure [monitoring and alerting](monitoring-usage)
- Review [security best practices](security)
- Set up [automated backups](#backup-strategy)
- Explore [database guardrails](dev/DATABASE_GUARDRAILS.md)

47
docs/dev/README.md Normal file
View File

@ -0,0 +1,47 @@
# Developer Documentation
This directory contains technical documentation for developers working on Readur.
## 📋 Table of Contents
### 🏗️ Architecture & Design
- [**Architecture Overview**](architecture.md) - System design, components, and data flow
- [**Database Guardrails**](DATABASE_GUARDRAILS.md) - Concurrency safety and database best practices
### 🛠️ Development
- [**Development Guide**](development.md) - Setup, contributing, code style guidelines
- [**Testing Guide**](TESTING.md) - Comprehensive testing strategy and instructions
### ⚙️ Technical Guides
- [**OCR Optimization**](OCR_OPTIMIZATION_GUIDE.md) - Performance tuning and best practices
- [**Queue Improvements**](QUEUE_IMPROVEMENTS.md) - Background job processing architecture
- [**Deployment Summary**](DEPLOYMENT_SUMMARY.md) - Technical deployment overview
## 🚀 Quick Start for Developers
1. **Read the [Architecture Overview](architecture.md)** to understand the system design
2. **Follow the [Development Guide](development.md)** to set up your local environment
3. **Review the [Testing Guide](TESTING.md)** to understand our testing approach
4. **Check [Database Guardrails](DATABASE_GUARDRAILS.md)** for data safety patterns
## 📖 Related User Documentation
- [Installation Guide](../installation.md) - How to install and run Readur
- [Configuration Guide](../configuration.md) - Environment variables and settings
- [User Guide](../user-guide.md) - How to use Readur features
- [API Reference](../api-reference.md) - REST API documentation
## 🤝 Contributing
Please read our [Development Guide](development.md) for:
- Setting up your development environment
- Code style guidelines
- Testing requirements
- Pull request process
## 🏷️ Document Categories
- **📘 User Docs**: Installation, configuration, user guide
- **🔧 Operations**: Deployment, monitoring, troubleshooting
- **💻 Developer**: Architecture, development setup, testing
- **🔌 Integration**: API reference, webhooks, extensions

350
docs/dev/architecture.md Normal file
View File

@ -0,0 +1,350 @@
# Architecture Overview
This document provides a comprehensive overview of Readur's architecture, design decisions, and technical implementation details.
## Table of Contents
- [System Architecture](#system-architecture)
- [Technology Stack](#technology-stack)
- [Component Overview](#component-overview)
- [Backend (Rust/Axum)](#backend-rustaxum)
- [Frontend (React)](#frontend-react)
- [Database (PostgreSQL)](#database-postgresql)
- [OCR Engine](#ocr-engine)
- [Data Flow](#data-flow)
- [Security Architecture](#security-architecture)
- [Performance Considerations](#performance-considerations)
- [Scalability](#scalability)
- [Design Patterns](#design-patterns)
## System Architecture
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ React Frontend │────│ Rust Backend │────│ PostgreSQL DB │
│ (Port 8000) │ │ (Axum API) │ │ (Port 5433) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
│ ┌─────────────────┐ │
└──────────────│ File Storage │─────────────┘
│ + OCR Engine │
└─────────────────┘
```
### High-Level Components
1. **Web Interface**: Modern React SPA with Material-UI
2. **API Server**: High-performance Rust backend using Axum
3. **Database**: PostgreSQL with full-text search capabilities
4. **File Storage**: Local or network-mounted filesystem
5. **OCR Processing**: Tesseract integration for text extraction
6. **Background Jobs**: Async task processing for OCR and file watching
## Technology Stack
### Backend
- **Language**: Rust (for performance and memory safety)
- **Web Framework**: Axum (async, fast, type-safe)
- **Database ORM**: SQLx (compile-time checked queries)
- **Authentication**: JWT tokens with bcrypt password hashing
- **Async Runtime**: Tokio
- **Serialization**: Serde
### Frontend
- **Framework**: React 18 with TypeScript
- **UI Library**: Material-UI (MUI)
- **State Management**: React Context + Hooks
- **Build Tool**: Vite
- **HTTP Client**: Axios
- **Routing**: React Router
### Infrastructure
- **Database**: PostgreSQL 14+ with pgvector extension
- **OCR**: Tesseract 4.0+
- **Container**: Docker with multi-stage builds
- **Reverse Proxy**: Nginx/Traefik compatible
## Component Overview
### Backend (Rust/Axum)
The backend is structured following clean architecture principles:
```
src/
├── main.rs # Application entry and server setup
├── config.rs # Configuration management
├── models.rs # Domain models and DTOs
├── error.rs # Error handling
├── auth.rs # Authentication middleware
├── routes/ # HTTP route handlers
│ ├── auth.rs # Authentication endpoints
│ ├── documents.rs # Document CRUD operations
│ ├── search.rs # Search functionality
│ └── ...
├── db/ # Database operations
│ ├── documents.rs # Document queries
│ ├── users.rs # User queries
│ └── ...
├── services/ # Business logic
│ ├── ocr.rs # OCR processing
│ ├── file_service.rs # File management
│ └── watcher.rs # Folder watching
└── tests/ # Integration tests
```
Key design decisions:
- **Async-first**: All I/O operations are async
- **Type safety**: Leverages Rust's type system
- **Error handling**: Comprehensive error types
- **Dependency injection**: Clean separation of concerns
### Frontend (React)
The frontend follows a component-based architecture:
```
frontend/src/
├── components/ # Reusable UI components
│ ├── DocumentList/
│ ├── SearchBar/
│ └── ...
├── pages/ # Page-level components
│ ├── Dashboard/
│ ├── Documents/
│ └── ...
├── services/ # API integration
│ ├── api.ts # Base API client
│ ├── auth.ts # Auth service
│ └── documents.ts # Document service
├── hooks/ # Custom React hooks
├── contexts/ # React contexts
└── utils/ # Utility functions
```
### Database (PostgreSQL)
Schema design optimized for document management:
```sql
-- Core tables
users # User accounts
documents # Document metadata
document_content # Extracted text content
document_tags # Many-to-many tags
sources # File sources (folders, S3, etc.)
ocr_queue # OCR processing queue
-- Search optimization
document_search_index # Full-text search index
```
Key features:
- **Full-text search**: PostgreSQL's powerful search capabilities
- **JSONB fields**: Flexible metadata storage
- **Triggers**: Automatic search index updates
- **Views**: Optimized query patterns
### OCR Engine
OCR processing pipeline:
1. **File Detection**: New files detected via upload or folder watch
2. **Queue Management**: Files added to processing queue
3. **Pre-processing**: Image enhancement and optimization
4. **Text Extraction**: Tesseract OCR with language detection
5. **Post-processing**: Text cleaning and formatting
6. **Database Storage**: Indexed for search
## Data Flow
### Document Upload Flow
```mermaid
sequenceDiagram
User->>Frontend: Upload Document
Frontend->>API: POST /api/documents
API->>FileStorage: Save File
API->>Database: Create Document Record
API->>OCRQueue: Add to Queue
API-->>Frontend: Document Created
OCRWorker->>OCRQueue: Poll for Jobs
OCRWorker->>FileStorage: Read File
OCRWorker->>Tesseract: Extract Text
OCRWorker->>Database: Update with Content
OCRWorker->>Frontend: WebSocket Update
```
### Search Flow
```mermaid
sequenceDiagram
User->>Frontend: Enter Search Query
Frontend->>API: GET /api/search
API->>Database: Full-text Search
Database->>API: Ranked Results
API->>Frontend: Search Results
Frontend->>User: Display Results
```
## Security Architecture
### Authentication & Authorization
- **JWT Tokens**: Stateless authentication
- **Role-Based Access**: Admin, User roles
- **Token Refresh**: Automatic token renewal
- **Password Security**: Bcrypt with salt rounds
### API Security
- **CORS**: Configurable allowed origins
- **Rate Limiting**: Prevent abuse
- **Input Validation**: Comprehensive validation
- **SQL Injection**: Parameterized queries via SQLx
### File Security
- **Upload Validation**: File type and size checks
- **Virus Scanning**: Optional ClamAV integration
- **Access Control**: Document-level permissions
- **Secure Storage**: Filesystem permissions
## Performance Considerations
### Backend Optimization
- **Connection Pooling**: Database connection reuse
- **Async I/O**: Non-blocking operations
- **Caching**: In-memory caching for hot data
- **Query Optimization**: Indexed searches
### Frontend Optimization
- **Code Splitting**: Lazy loading of routes
- **Virtual Scrolling**: Large document lists
- **Memoization**: Prevent unnecessary re-renders
- **Service Workers**: Offline capability
### OCR Optimization
- **Parallel Processing**: Multiple concurrent jobs
- **Image Pre-processing**: Enhance OCR accuracy
- **Resource Limits**: Memory and CPU constraints
- **Queue Priority**: Smart job scheduling
## Scalability
### Horizontal Scaling
```yaml
# Multiple backend instances
backend-1:
image: readur:latest
environment:
- INSTANCE_ID=1
backend-2:
image: readur:latest
environment:
- INSTANCE_ID=2
# Load balancer
nginx:
upstream backend {
server backend-1:8000;
server backend-2:8000;
}
```
### Database Scaling
- **Read Replicas**: Distribute read load
- **Connection Pooling**: PgBouncer
- **Partitioning**: Time-based partitions
- **Archival**: Move old documents
### Storage Scaling
- **S3 Compatible**: Object storage support
- **CDN Integration**: Static file delivery
- **Distributed Storage**: GlusterFS/Ceph
- **Archive Tiering**: Hot/cold storage
## Design Patterns
### Backend Patterns
1. **Repository Pattern**: Database abstraction
2. **Service Layer**: Business logic separation
3. **Middleware Chain**: Request processing
4. **Error Boundaries**: Graceful error handling
### Frontend Patterns
1. **Container/Presenter**: Component separation
2. **Custom Hooks**: Logic reuse
3. **Context Provider**: State management
4. **HOCs**: Cross-cutting concerns
### Database Patterns
1. **Soft Deletes**: Data preservation
2. **Audit Trails**: Change tracking
3. **Materialized Views**: Performance
4. **Event Sourcing**: Optional audit log
## Future Architecture Considerations
### Microservices Migration
Potential service boundaries:
- Authentication Service
- Document Service
- OCR Service
- Search Service
- Notification Service
### Event-Driven Architecture
- Message Queue (RabbitMQ/Kafka)
- Event Sourcing
- CQRS Pattern
- Async communication
### Cloud-Native Features
- Kubernetes deployment
- Service mesh (Istio)
- Distributed tracing
- Cloud storage integration
## Monitoring and Observability
### Metrics
- Prometheus metrics endpoint
- Custom business metrics
- Performance counters
- Resource utilization
### Logging
- Structured logging (JSON)
- Log aggregation ready
- Correlation IDs
- Debug levels
### Tracing
- OpenTelemetry support
- Distributed tracing
- Performance profiling
- Request tracking
## Next Steps
- Review [deployment options](deployment.md)
- Explore [performance tuning](OCR_OPTIMIZATION_GUIDE.md)
- Understand [database design](DATABASE_GUARDRAILS.md)
- Learn about [testing strategy](TESTING.md)

434
docs/dev/development.md Normal file
View File

@ -0,0 +1,434 @@
# Development Guide
This guide covers contributing to Readur, setting up a development environment, testing, and code style guidelines.
## Table of Contents
- [Development Setup](#development-setup)
- [Prerequisites](#prerequisites)
- [Local Development](#local-development)
- [Development with Docker](#development-with-docker)
- [Project Structure](#project-structure)
- [Testing](#testing)
- [Backend Tests](#backend-tests)
- [Frontend Tests](#frontend-tests)
- [Integration Tests](#integration-tests)
- [E2E Tests](#e2e-tests)
- [Code Style](#code-style)
- [Rust Guidelines](#rust-guidelines)
- [Frontend Guidelines](#frontend-guidelines)
- [Contributing](#contributing)
- [Getting Started](#getting-started)
- [Pull Request Process](#pull-request-process)
- [Commit Guidelines](#commit-guidelines)
- [Debugging](#debugging)
- [Performance Profiling](#performance-profiling)
## Development Setup
### Prerequisites
- Rust 1.70+ and Cargo
- Node.js 18+ and npm
- PostgreSQL 14+
- Tesseract OCR 4.0+
- Git
### Local Development
1. **Clone the repository**:
```bash
git clone https://github.com/perfectra1n/readur.git
cd readur
```
2. **Set up the database**:
```bash
# Create development database
sudo -u postgres psql
CREATE DATABASE readur_dev;
CREATE USER readur_dev WITH ENCRYPTED PASSWORD 'dev_password';
GRANT ALL PRIVILEGES ON DATABASE readur_dev TO readur_dev;
\q
```
3. **Configure environment**:
```bash
# Copy example environment
cp .env.example .env.development
# Edit with your settings
DATABASE_URL=postgresql://readur_dev:dev_password@localhost/readur_dev
JWT_SECRET=dev-secret-key
```
4. **Run database migrations**:
```bash
# Install sqlx-cli if needed
cargo install sqlx-cli
# Run migrations
sqlx migrate run
```
5. **Start the backend**:
```bash
# Development mode with auto-reload
cargo watch -x run
# Or without auto-reload
cargo run
```
6. **Start the frontend**:
```bash
cd frontend
npm install
npm run dev
```
### Development with Docker
For a consistent development environment:
```bash
# Start all services
docker compose -f docker-compose.yml -f docker-compose.dev.yml up
# Backend available at: http://localhost:8000
# Frontend dev server at: http://localhost:5173
# PostgreSQL at: localhost:5433
```
The development compose file includes:
- Volume mounts for hot reloading
- Exposed database port
- Debug logging enabled
## Project Structure
```
readur/
├── src/ # Rust backend source
│ ├── main.rs # Application entry point
│ ├── config.rs # Configuration management
│ ├── models.rs # Database models
│ ├── routes/ # API route handlers
│ ├── db/ # Database operations
│ ├── ocr.rs # OCR processing
│ └── tests/ # Integration tests
├── frontend/ # React frontend
│ ├── src/
│ │ ├── components/ # React components
│ │ ├── pages/ # Page components
│ │ ├── services/ # API services
│ │ └── App.tsx # Main app component
│ └── tests/ # Frontend tests
├── migrations/ # Database migrations
├── docs/ # Documentation
└── tests/ # E2E and integration tests
```
## Testing
Readur has comprehensive test coverage across unit, integration, and end-to-end tests.
### Backend Tests
```bash
# Run all tests
cargo test
# Run with output
cargo test -- --nocapture
# Run specific test
cargo test test_document_upload
# Run tests with coverage
cargo install cargo-tarpaulin
cargo tarpaulin --out Html
```
Test categories:
- **Unit tests**: In `src/tests/`
- **Integration tests**: In `tests/`
- **Database tests**: Require `TEST_DATABASE_URL`
Example test:
```rust
#[cfg(test)]
mod tests {
use super::*;
#[tokio::test]
async fn test_document_creation() {
let doc = Document::new("test.pdf", "application/pdf");
assert_eq!(doc.filename, "test.pdf");
}
}
```
### Frontend Tests
```bash
cd frontend
# Run unit tests
npm test
# Run with coverage
npm run test:coverage
# Run in watch mode
npm run test:watch
```
Example test:
```typescript
import { render, screen } from '@testing-library/react';
import DocumentList from './DocumentList';
test('renders document list', () => {
render(<DocumentList documents={[]} />);
expect(screen.getByText(/No documents/i)).toBeInTheDocument();
});
```
### Integration Tests
```bash
# Run integration tests
docker compose -f docker-compose.test.yml up --abort-on-container-exit
# Or manually
cargo test --test '*' -- --test-threads=1
```
### E2E Tests
Using Playwright for end-to-end testing:
```bash
cd frontend
# Install Playwright
npm run e2e:install
# Run E2E tests
npm run e2e
# Run in UI mode
npm run e2e:ui
```
## Code Style
### Rust Guidelines
We follow the official Rust style guide with some additions:
```bash
# Format code
cargo fmt
# Check linting
cargo clippy -- -D warnings
# Check before committing
cargo fmt --check && cargo clippy
```
Style preferences:
- Use descriptive variable names
- Add documentation comments for public APIs
- Keep functions small and focused
- Use `Result` for error handling
- Prefer `&str` over `String` for function parameters
### Frontend Guidelines
```bash
# Format code
npm run format
# Lint check
npm run lint
# Type check
npm run type-check
```
Style preferences:
- Use functional components with hooks
- TypeScript for all new code
- Descriptive component and variable names
- Extract reusable logic into custom hooks
- Keep components focused and small
## Contributing
We welcome contributions! Please see our [Contributing Guide](../CONTRIBUTING.md) for details.
### Getting Started
1. **Fork the repository**
2. **Create a feature branch**:
```bash
git checkout -b feature/amazing-feature
```
3. **Make your changes**
4. **Add tests** for new functionality
5. **Ensure all tests pass**:
```bash
cargo test
cd frontend && npm test
```
6. **Commit your changes** (see commit guidelines below)
7. **Push to your fork**:
```bash
git push origin feature/amazing-feature
```
8. **Open a Pull Request**
### Pull Request Process
1. **Update documentation** for any changed functionality
2. **Add tests** covering new code
3. **Ensure CI passes** (automated checks)
4. **Request review** from maintainers
5. **Address feedback** promptly
6. **Squash commits** if requested
### Commit Guidelines
We use conventional commits for clear history:
```
feat: add bulk document export
fix: resolve OCR timeout on large files
docs: update API authentication section
test: add coverage for search filters
refactor: simplify document processing pipeline
perf: optimize database queries for search
chore: update dependencies
```
Format:
```
<type>(<scope>): <subject>
<body>
<footer>
```
Types:
- `feat`: New feature
- `fix`: Bug fix
- `docs`: Documentation only
- `style`: Code style changes
- `refactor`: Code refactoring
- `perf`: Performance improvements
- `test`: Test additions/changes
- `chore`: Build process/auxiliary tool changes
## Debugging
### Backend Debugging
1. **Enable debug logging**:
```bash
RUST_LOG=debug cargo run
```
2. **Use VS Code debugger**:
```json
// .vscode/launch.json
{
"version": "0.2.0",
"configurations": [
{
"type": "lldb",
"request": "launch",
"name": "Debug Readur",
"cargo": {
"args": ["build", "--bin=readur"],
"filter": {
"name": "readur",
"kind": "bin"
}
},
"args": [],
"cwd": "${workspaceFolder}"
}
]
}
```
3. **Database query logging**:
```bash
RUST_LOG=sqlx=debug cargo run
```
### Frontend Debugging
1. **React DevTools**: Install browser extension
2. **Redux DevTools**: For state debugging
3. **Network tab**: Monitor API calls
4. **Console debugging**: Strategic `console.log`
## Performance Profiling
### Backend Profiling
```bash
# CPU profiling with flamegraph
cargo install flamegraph
cargo flamegraph --bin readur
# Memory profiling
valgrind --tool=massif target/release/readur
```
### Frontend Profiling
1. Use Chrome DevTools Performance tab
2. React Profiler for component performance
3. Lighthouse for overall performance audit
### Database Profiling
```sql
-- Enable query timing
\timing on
-- Analyze query plan
EXPLAIN ANALYZE SELECT * FROM documents WHERE ...;
-- Check slow queries
SELECT * FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;
```
## Additional Resources
- [Rust Book](https://doc.rust-lang.org/book/)
- [React Documentation](https://react.dev/)
- [PostgreSQL Documentation](https://www.postgresql.org/docs/)
- [Tesseract Documentation](https://tesseract-ocr.github.io/)
- [Testing Guide](TESTING.md)
## Getting Help
- **GitHub Issues**: For bug reports and feature requests
- **GitHub Discussions**: For questions and community support
- **Discord**: Join our community server (link in README)
## License
By contributing to Readur, you agree that your contributions will be licensed under the MIT License.

175
docs/installation.md Normal file
View File

@ -0,0 +1,175 @@
# Installation Guide
This guide covers various methods to install and run Readur, from quick Docker deployment to manual installation.
## Table of Contents
- [Quick Start with Docker Compose](#quick-start-with-docker-compose)
- [System Requirements](#system-requirements)
- [Manual Installation](#manual-installation)
- [Prerequisites](#prerequisites)
- [Backend Setup](#backend-setup)
- [Frontend Setup](#frontend-setup)
- [Verifying Installation](#verifying-installation)
## Quick Start with Docker Compose
The fastest way to get Readur running:
```bash
# Clone the repository
git clone https://github.com/perfectra1n/readur
cd readur
# Start all services
docker compose up --build -d
# Access the application
open http://localhost:8000
```
**Default login credentials:**
- Username: `admin`
- Password: `readur2024`
> ⚠️ **Important**: Change the default admin password immediately after first login!
### What You Get
After deployment, you'll have:
- **Web Interface**: Modern document management UI at `http://localhost:8000`
- **PostgreSQL Database**: Document metadata and full-text search indexes
- **File Storage**: Persistent document storage with OCR processing
- **Watch Folder**: Automatic file ingestion from mounted directories
- **REST API**: Full API access for integrations
## System Requirements
### Minimum Requirements
- **CPU**: 2 cores
- **RAM**: 2GB
- **Storage**: 10GB free space
- **OS**: Linux, macOS, or Windows with Docker
### Recommended for Production
- **CPU**: 4+ cores
- **RAM**: 4GB+
- **Storage**: 50GB+ SSD
- **Network**: Stable internet connection for OCR processing
## Manual Installation
For development or custom deployments without Docker:
### Prerequisites
Install these dependencies on your system:
```bash
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y \
tesseract-ocr tesseract-ocr-eng \
libtesseract-dev libleptonica-dev \
postgresql postgresql-contrib \
pkg-config libclang-dev
# macOS (requires Homebrew)
brew install tesseract leptonica postgresql rust nodejs npm
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
### Backend Setup
1. **Configure Database**:
```bash
# Create database and user
sudo -u postgres psql
CREATE DATABASE readur;
CREATE USER readur_user WITH ENCRYPTED PASSWORD 'your_password';
GRANT ALL PRIVILEGES ON DATABASE readur TO readur_user;
\q
```
2. **Environment Configuration**:
```bash
# Copy environment template
cp .env.example .env
# Edit configuration
nano .env
```
Required environment variables:
```env
DATABASE_URL=postgresql://readur_user:your_password@localhost/readur
JWT_SECRET=your-super-secret-jwt-key-change-this
SERVER_ADDRESS=0.0.0.0:8000
UPLOAD_PATH=./uploads
WATCH_FOLDER=./watch
ALLOWED_FILE_TYPES=pdf,png,jpg,jpeg,gif,bmp,tiff,txt,rtf,doc,docx
```
3. **Build and Run Backend**:
```bash
# Install dependencies and run
cargo build --release
cargo run
```
### Frontend Setup
1. **Install Dependencies**:
```bash
cd frontend
npm install
```
2. **Development Mode**:
```bash
npm run dev
# Frontend available at http://localhost:5173
```
3. **Production Build**:
```bash
npm run build
# Built files in frontend/dist/
```
## Verifying Installation
After installation, verify everything is working:
1. **Check Backend Health**:
```bash
curl http://localhost:8000/api/health
```
2. **Access Web Interface**:
- Navigate to `http://localhost:8000`
- Log in with default credentials
- Upload a test document
3. **Verify Database Connection**:
```bash
# For Docker installation
docker exec -it readur-postgres-1 psql -U readur -c "\dt"
# For manual installation
psql -U readur_user -d readur -c "\dt"
```
4. **Check OCR Functionality**:
- Upload a PDF or image file
- Wait for processing to complete
- Search for text content from the uploaded file
## Next Steps
- [Configure Readur](configuration.md) for your specific needs
- Set up [production deployment](deployment.md) with SSL and proper security
- Read the [User Guide](user-guide.md) to learn about all features
- Explore the [API Reference](api-reference.md) for integrations

282
docs/user-guide.md Normal file
View File

@ -0,0 +1,282 @@
# User Guide
A comprehensive guide to using Readur's features for document management, OCR processing, and search.
## Table of Contents
- [Getting Started](#getting-started)
- [Supported File Types](#supported-file-types)
- [Using the Interface](#using-the-interface)
- [Dashboard](#dashboard)
- [Document Management](#document-management)
- [Advanced Search](#advanced-search)
- [Folder Watching](#folder-watching)
- [Document Upload](#document-upload)
- [OCR Processing](#ocr-processing)
- [Search Features](#search-features)
- [Tags and Organization](#tags-and-organization)
- [User Settings](#user-settings)
- [Tips for Best Results](#tips-for-best-results)
## Getting Started
1. **First Login**:
- Navigate to `http://localhost:8000` (or your configured URL)
- Use the default admin credentials (username: `admin`, password: `readur2024`)
- **Important**: Change the default password immediately
2. **Initial Setup**:
- Configure your user preferences
- Set OCR language if different from English
- Adjust search and display settings
3. **Quick Start**:
- Upload your first document using drag-and-drop or the upload button
- Wait for OCR processing to complete
- Search for content within your documents
## Supported File Types
| Type | Extensions | OCR Support | Notes |
|------|-----------|-------------|-------|
| **PDF** | `.pdf` | ✅ | Text extraction + OCR for scanned pages |
| **Images** | `.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.gif` | ✅ | Full OCR text extraction |
| **Text** | `.txt`, `.rtf` | ❌ | Direct text indexing |
| **Office** | `.doc`, `.docx` | ⚠️ | Limited support |
## Using the Interface
### Dashboard
The dashboard provides an overview of your document system:
- **Document Statistics**:
- Total documents in the system
- Storage usage breakdown
- OCR processing status
- Recent activity timeline
- **Quick Actions**:
- Upload new documents
- Quick search bar
- Access to recent documents
- System notifications
### Document Management
#### List/Grid View
- **List View**: Detailed document information in a table format
- **Grid View**: Visual thumbnails for quick browsing
- Toggle between views using the view selector in the top toolbar
#### Sorting Options
- Upload date (newest/oldest first)
- File name (A-Z/Z-A)
- File size (largest/smallest)
- Document type
- OCR status
#### Filtering
- By file type (PDF, images, text)
- By OCR status (completed, pending, failed)
- By date range
- By tags
- By source (uploaded, watched folder)
#### Bulk Actions
1. Select multiple documents using checkboxes
2. Available bulk actions:
- Delete selected documents
- Add/remove tags
- Export document list
- Reprocess OCR
### Advanced Search
Readur offers powerful search capabilities:
#### Full-Text Search
- Search within document content
- Automatic stemming and fuzzy matching
- Phrase search with quotes: `"exact phrase"`
- Exclude terms with minus: `-excluded`
#### Search Filters
- **Date Range**: Find documents from specific time periods
- **File Type**: Limit search to specific formats
- **File Size**: Filter by document size
- **OCR Status**: Only search processed documents
- **Tags**: Search within tagged documents
#### Search Syntax
```
invoice 2024 # Find documents with both terms
"quarterly report" # Exact phrase search
invoice -draft # Exclude drafts
tag:important invoice # Search within tagged documents
type:pdf contract # Search only PDFs
```
### Folder Watching
The folder watching feature automatically imports documents:
1. **Non-destructive**: Source files remain untouched
2. **Automatic Processing**: New files are detected and processed
3. **Configurable Intervals**: Adjust scan frequency
4. **Multiple Sources**: Watch local folders, network drives, cloud storage
#### Setting Up Watch Folders
1. Go to Settings → Sources
2. Add a new source with type "Local Folder"
3. Configure the path and scan interval
4. Enable/disable the source as needed
## Document Upload
### Manual Upload
1. Click the upload button or drag files to the upload area
2. Select one or multiple files
3. Add tags during upload (optional)
4. Click "Upload" to start processing
### Drag and Drop
- Drag files directly from your file manager
- Drop anywhere on the document list page
- Multiple files can be dropped at once
### Upload Limits
- Maximum file size: Configurable (default 50MB)
- Supported formats: See [Supported File Types](#supported-file-types)
- Batch upload: Up to 100 files at once
## OCR Processing
### Automatic OCR
- Starts automatically after upload
- Processes documents in background
- Priority queue for smaller files
### OCR Settings
- **Language**: Select from 100+ languages
- **Preprocessing**: Enable image enhancement
- **Auto-rotation**: Correct document orientation
- **Quality**: Balance between speed and accuracy
### OCR Status Indicators
- 🟢 **Completed**: Full text extracted
- 🟡 **Processing**: OCR in progress
- 🔴 **Failed**: Error during processing
- ⚪ **Pending**: Waiting in queue
## Search Features
### Quick Search
- Available in the header on all pages
- Instant results as you type
- Shows top 5 matches with snippets
### Advanced Search Page
- Full search interface with all filters
- Export search results
- Save frequently used searches
- Search history
### Search Tips
1. Use quotes for exact phrases
2. Combine filters for precise results
3. Use wildcards: `inv*` matches invoice, inventory
4. Search in specific fields: `filename:report`
## Tags and Organization
### Creating Tags
1. Select document(s)
2. Click "Add Tag"
3. Enter tag name or select existing
4. Tags are color-coded for easy identification
### Tag Management
- Rename tags globally
- Merge similar tags
- Delete unused tags
- Set tag colors
### Smart Collections
Create saved searches based on:
- Tag combinations
- Date ranges
- File types
- Custom criteria
## User Settings
### Personal Preferences
- **Display**: List/grid default view
- **Language**: Interface language
- **Time Zone**: For accurate timestamps
- **Notifications**: Email/in-app alerts
### OCR Preferences
- Default OCR language
- Processing priority
- Image preprocessing options
- Batch size limits
### Search Settings
- Results per page
- Default sort order
- Snippet length
- Fuzzy search threshold
## Tips for Best Results
### OCR Quality
1. **Higher Resolution**: 300+ DPI produces better OCR results
2. **Clean Scans**: Avoid skewed or dirty documents
3. **Good Lighting**: For photo captures, ensure even lighting
4. **Text Contrast**: Black text on white background works best
### File Organization
1. **Consistent Naming**: Use descriptive, consistent file names
2. **Regular Uploads**: Don't let documents pile up
3. **Use Tags**: Tag documents immediately after upload
4. **Folder Structure**: Organize watch folders logically
### Search Optimization
1. **Use Filters**: Combine text search with filters
2. **Save Searches**: Save frequently used search queries
3. **Learn Syntax**: Master search operators for better results
4. **Index Regularly**: Ensure all documents are processed
### Performance Tips
1. **Batch Processing**: Upload similar documents together
2. **Off-Peak Hours**: Schedule large uploads during low-usage times
3. **Monitor Queue**: Check OCR queue status regularly
4. **Clean Up**: Remove outdated documents periodically
## Troubleshooting
### Common Issues
**OCR Not Starting**
- Check file size limits
- Verify supported file format
- Ensure OCR service is running
**Search Not Finding Documents**
- Confirm OCR completed successfully
- Check search syntax
- Try broader search terms
**Slow Performance**
- Review concurrent OCR job settings
- Check system resources
- Consider increasing memory limits
## Next Steps
- Explore the [API Reference](api-reference.md) for automation
- Learn about [advanced configuration](configuration.md)
- Set up [automated workflows](WATCH_FOLDER.md)
- Optimize [OCR performance](dev/OCR_OPTIMIZATION_GUIDE.md)