Readur/docs/dev/architecture.md

# Architecture Overview

This document provides a comprehensive overview of Readur's architecture, design decisions, and technical implementation details.

## Table of Contents

- [System Architecture](#system-architecture)
- [Technology Stack](#technology-stack)
- [Component Overview](#component-overview)
  - [Backend (Rust/Axum)](#backend-rustaxum)
  - [Frontend (React)](#frontend-react)
  - [Database (PostgreSQL)](#database-postgresql)
  - [OCR Engine](#ocr-engine)
- [Data Flow](#data-flow)
- [Security Architecture](#security-architecture)
- [Performance Considerations](#performance-considerations)
- [Scalability](#scalability)
- [Design Patterns](#design-patterns)

## System Architecture

```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   React Frontend │────│   Rust Backend  │────│  PostgreSQL DB  │
│   (Port 8000)   │    │   (Axum API)    │    │   (Port 5433)   │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         │              ┌─────────────────┐             │
         └──────────────│  File Storage   │─────────────┘
                        │  + OCR Engine   │
                        └─────────────────┘
```

### High-Level Components

1. **Web Interface**: Modern React SPA with Material-UI
2. **API Server**: High-performance Rust backend using Axum
3. **Database**: PostgreSQL with full-text search capabilities
4. **File Storage**: Local or network-mounted filesystem
5. **OCR Processing**: Tesseract integration for text extraction
6. **Background Jobs**: Async task processing for OCR and file watching

## Technology Stack

### Backend
- **Language**: Rust (for performance and memory safety)
- **Web Framework**: Axum (async, fast, type-safe)
- **Database ORM**: SQLx (compile-time checked queries)
- **Authentication**: JWT tokens with bcrypt password hashing
- **Async Runtime**: Tokio
- **Serialization**: Serde

### Frontend
- **Framework**: React 18 with TypeScript
- **UI Library**: Material-UI (MUI)
- **State Management**: React Context + Hooks
- **Build Tool**: Vite
- **HTTP Client**: Axios
- **Routing**: React Router

### Infrastructure
- **Database**: PostgreSQL 14+ with pgvector extension
- **OCR**: Tesseract 4.0+
- **Container**: Docker with multi-stage builds
- **Reverse Proxy**: Nginx/Traefik compatible

## Component Overview

### Backend (Rust/Axum)

The backend is structured following clean architecture principles:

```
src/
├── main.rs              # Application entry and server setup
├── config.rs            # Configuration management
├── models.rs            # Domain models and DTOs
├── error.rs             # Error handling
├── auth.rs              # Authentication middleware
├── routes/              # HTTP route handlers
│   ├── auth.rs         # Authentication endpoints
│   ├── documents.rs    # Document CRUD operations
│   ├── search.rs       # Search functionality
│   └── ...
├── db/                  # Database operations
│   ├── documents.rs    # Document queries
│   ├── users.rs        # User queries
│   └── ...
├── services/            # Business logic
│   ├── ocr.rs          # OCR processing
│   ├── file_service.rs # File management
│   └── watcher.rs      # Folder watching
└── tests/              # Integration tests
```

Key design decisions:
- **Async-first**: All I/O operations are async
- **Type safety**: Leverages Rust's type system
- **Error handling**: Comprehensive error types
- **Dependency injection**: Clean separation of concerns

### Frontend (React)

The frontend follows a component-based architecture:

```
frontend/src/
├── components/          # Reusable UI components
│   ├── DocumentList/
│   ├── SearchBar/
│   └── ...
├── pages/              # Page-level components
│   ├── Dashboard/
│   ├── Documents/
│   └── ...
├── services/           # API integration
│   ├── api.ts         # Base API client
│   ├── auth.ts        # Auth service
│   └── documents.ts   # Document service
├── hooks/              # Custom React hooks
├── contexts/           # React contexts
└── utils/              # Utility functions
```

### Database (PostgreSQL)

Schema design optimized for document management:

```sql
-- Core tables
users                   # User accounts
documents              # Document metadata
document_content       # Extracted text content
document_tags          # Many-to-many tags
sources                # File sources (folders, S3, etc.)
ocr_queue              # OCR processing queue

-- Search optimization
document_search_index  # Full-text search index
```

Key features:
- **Full-text search**: PostgreSQL's powerful search capabilities
- **JSONB fields**: Flexible metadata storage
- **Triggers**: Automatic search index updates
- **Views**: Optimized query patterns

### OCR Engine

OCR processing pipeline:

1. **File Detection**: New files detected via upload or folder watch
2. **Queue Management**: Files added to processing queue
3. **Pre-processing**: Image enhancement and optimization
4. **Text Extraction**: Tesseract OCR with language detection
5. **Post-processing**: Text cleaning and formatting
6. **Database Storage**: Indexed for search

## Data Flow

### Document Upload Flow

```mermaid
sequenceDiagram
    User->>Frontend: Upload Document
    Frontend->>API: POST /api/documents
    API->>FileStorage: Save File
    API->>Database: Create Document Record
    API->>OCRQueue: Add to Queue
    API-->>Frontend: Document Created
    OCRWorker->>OCRQueue: Poll for Jobs
    OCRWorker->>FileStorage: Read File
    OCRWorker->>Tesseract: Extract Text
    OCRWorker->>Database: Update with Content
    OCRWorker->>Frontend: WebSocket Update
```

### Search Flow

```mermaid
sequenceDiagram
    User->>Frontend: Enter Search Query
    Frontend->>API: GET /api/search
    API->>Database: Full-text Search
    Database->>API: Ranked Results
    API->>Frontend: Search Results
    Frontend->>User: Display Results
```

## Security Architecture

### Authentication & Authorization

- **JWT Tokens**: Stateless authentication
- **Role-Based Access**: Admin, User roles
- **Token Refresh**: Automatic token renewal
- **Password Security**: Bcrypt with salt rounds

### API Security

- **CORS**: Configurable allowed origins
- **Rate Limiting**: Prevent abuse
- **Input Validation**: Comprehensive validation
- **SQL Injection**: Parameterized queries via SQLx

### File Security

- **Upload Validation**: File type and size checks
- **Virus Scanning**: Optional ClamAV integration
- **Access Control**: Document-level permissions
- **Secure Storage**: Filesystem permissions

## Performance Considerations

### Backend Optimization

- **Connection Pooling**: Database connection reuse
- **Async I/O**: Non-blocking operations
- **Caching**: In-memory caching for hot data
- **Query Optimization**: Indexed searches

### Frontend Optimization

- **Code Splitting**: Lazy loading of routes
- **Virtual Scrolling**: Large document lists
- **Memoization**: Prevent unnecessary re-renders
- **Service Workers**: Offline capability

### OCR Optimization

- **Parallel Processing**: Multiple concurrent jobs
- **Image Pre-processing**: Enhance OCR accuracy
- **Resource Limits**: Memory and CPU constraints
- **Queue Priority**: Smart job scheduling

## Scalability

### Horizontal Scaling

```yaml
# Multiple backend instances
backend-1:
  image: readur:latest
  environment:
    - INSTANCE_ID=1

backend-2:
  image: readur:latest
  environment:
    - INSTANCE_ID=2

# Load balancer
nginx:
  upstream backend {
    server backend-1:8000;
    server backend-2:8000;
  }
```

### Database Scaling

- **Read Replicas**: Distribute read load
- **Connection Pooling**: PgBouncer
- **Partitioning**: Time-based partitions
- **Archival**: Move old documents

### Storage Scaling

- **S3 Compatible**: Object storage support
- **CDN Integration**: Static file delivery
- **Distributed Storage**: GlusterFS/Ceph
- **Archive Tiering**: Hot/cold storage

## Design Patterns

### Backend Patterns

1. **Repository Pattern**: Database abstraction
2. **Service Layer**: Business logic separation
3. **Middleware Chain**: Request processing
4. **Error Boundaries**: Graceful error handling

### Frontend Patterns

1. **Container/Presenter**: Component separation
2. **Custom Hooks**: Logic reuse
3. **Context Provider**: State management
4. **HOCs**: Cross-cutting concerns

### Database Patterns

1. **Soft Deletes**: Data preservation
2. **Audit Trails**: Change tracking
3. **Materialized Views**: Performance
4. **Event Sourcing**: Optional audit log

## Future Architecture Considerations

### Microservices Migration

Potential service boundaries:
- Authentication Service
- Document Service
- OCR Service
- Search Service
- Notification Service

### Event-Driven Architecture

- Message Queue (RabbitMQ/Kafka)
- Event Sourcing
- CQRS Pattern
- Async communication

### Cloud-Native Features

- Kubernetes deployment
- Service mesh (Istio)
- Distributed tracing
- Cloud storage integration

## Monitoring and Observability

### Metrics

- Prometheus metrics endpoint
- Custom business metrics
- Performance counters
- Resource utilization

### Logging

- Structured logging (JSON)
- Log aggregation ready
- Correlation IDs
- Debug levels

### Tracing

- OpenTelemetry support
- Distributed tracing
- Performance profiling
- Request tracking

## Next Steps

- Review [deployment options](../deployment.md)
- Explore [performance tuning](OCR_OPTIMIZATION_GUIDE.md)
- Understand [database design](DATABASE_GUARDRAILS.md)
- Learn about [testing strategy](TESTING.md)