350 lines
10 KiB
Markdown
350 lines
10 KiB
Markdown
# Architecture Overview
|
|
|
|
This document provides a comprehensive overview of Readur's architecture, design decisions, and technical implementation details.
|
|
|
|
## Table of Contents
|
|
|
|
- [System Architecture](#system-architecture)
|
|
- [Technology Stack](#technology-stack)
|
|
- [Component Overview](#component-overview)
|
|
- [Backend (Rust/Axum)](#backend-rustaxum)
|
|
- [Frontend (React)](#frontend-react)
|
|
- [Database (PostgreSQL)](#database-postgresql)
|
|
- [OCR Engine](#ocr-engine)
|
|
- [Data Flow](#data-flow)
|
|
- [Security Architecture](#security-architecture)
|
|
- [Performance Considerations](#performance-considerations)
|
|
- [Scalability](#scalability)
|
|
- [Design Patterns](#design-patterns)
|
|
|
|
## System Architecture
|
|
|
|
```
|
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
│ React Frontend │────│ Rust Backend │────│ PostgreSQL DB │
|
|
│ (Port 8000) │ │ (Axum API) │ │ (Port 5433) │
|
|
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
|
│ │ │
|
|
│ ┌─────────────────┐ │
|
|
└──────────────│ File Storage │─────────────┘
|
|
│ + OCR Engine │
|
|
└─────────────────┘
|
|
```
|
|
|
|
### High-Level Components
|
|
|
|
1. **Web Interface**: Modern React SPA with Material-UI
|
|
2. **API Server**: High-performance Rust backend using Axum
|
|
3. **Database**: PostgreSQL with full-text search capabilities
|
|
4. **File Storage**: Local or network-mounted filesystem
|
|
5. **OCR Processing**: Tesseract integration for text extraction
|
|
6. **Background Jobs**: Async task processing for OCR and file watching
|
|
|
|
## Technology Stack
|
|
|
|
### Backend
|
|
- **Language**: Rust (for performance and memory safety)
|
|
- **Web Framework**: Axum (async, fast, type-safe)
|
|
- **Database ORM**: SQLx (compile-time checked queries)
|
|
- **Authentication**: JWT tokens with bcrypt password hashing
|
|
- **Async Runtime**: Tokio
|
|
- **Serialization**: Serde
|
|
|
|
### Frontend
|
|
- **Framework**: React 18 with TypeScript
|
|
- **UI Library**: Material-UI (MUI)
|
|
- **State Management**: React Context + Hooks
|
|
- **Build Tool**: Vite
|
|
- **HTTP Client**: Axios
|
|
- **Routing**: React Router
|
|
|
|
### Infrastructure
|
|
- **Database**: PostgreSQL 14+ with pgvector extension
|
|
- **OCR**: Tesseract 4.0+
|
|
- **Container**: Docker with multi-stage builds
|
|
- **Reverse Proxy**: Nginx/Traefik compatible
|
|
|
|
## Component Overview
|
|
|
|
### Backend (Rust/Axum)
|
|
|
|
The backend is structured following clean architecture principles:
|
|
|
|
```
|
|
src/
|
|
├── main.rs # Application entry and server setup
|
|
├── config.rs # Configuration management
|
|
├── models.rs # Domain models and DTOs
|
|
├── error.rs # Error handling
|
|
├── auth.rs # Authentication middleware
|
|
├── routes/ # HTTP route handlers
|
|
│ ├── auth.rs # Authentication endpoints
|
|
│ ├── documents.rs # Document CRUD operations
|
|
│ ├── search.rs # Search functionality
|
|
│ └── ...
|
|
├── db/ # Database operations
|
|
│ ├── documents.rs # Document queries
|
|
│ ├── users.rs # User queries
|
|
│ └── ...
|
|
├── services/ # Business logic
|
|
│ ├── ocr.rs # OCR processing
|
|
│ ├── file_service.rs # File management
|
|
│ └── watcher.rs # Folder watching
|
|
└── tests/ # Integration tests
|
|
```
|
|
|
|
Key design decisions:
|
|
- **Async-first**: All I/O operations are async
|
|
- **Type safety**: Leverages Rust's type system
|
|
- **Error handling**: Comprehensive error types
|
|
- **Dependency injection**: Clean separation of concerns
|
|
|
|
### Frontend (React)
|
|
|
|
The frontend follows a component-based architecture:
|
|
|
|
```
|
|
frontend/src/
|
|
├── components/ # Reusable UI components
|
|
│ ├── DocumentList/
|
|
│ ├── SearchBar/
|
|
│ └── ...
|
|
├── pages/ # Page-level components
|
|
│ ├── Dashboard/
|
|
│ ├── Documents/
|
|
│ └── ...
|
|
├── services/ # API integration
|
|
│ ├── api.ts # Base API client
|
|
│ ├── auth.ts # Auth service
|
|
│ └── documents.ts # Document service
|
|
├── hooks/ # Custom React hooks
|
|
├── contexts/ # React contexts
|
|
└── utils/ # Utility functions
|
|
```
|
|
|
|
### Database (PostgreSQL)
|
|
|
|
Schema design optimized for document management:
|
|
|
|
```sql
|
|
-- Core tables
|
|
users # User accounts
|
|
documents # Document metadata
|
|
document_content # Extracted text content
|
|
document_tags # Many-to-many tags
|
|
sources # File sources (folders, S3, etc.)
|
|
ocr_queue # OCR processing queue
|
|
|
|
-- Search optimization
|
|
document_search_index # Full-text search index
|
|
```
|
|
|
|
Key features:
|
|
- **Full-text search**: PostgreSQL's powerful search capabilities
|
|
- **JSONB fields**: Flexible metadata storage
|
|
- **Triggers**: Automatic search index updates
|
|
- **Views**: Optimized query patterns
|
|
|
|
### OCR Engine
|
|
|
|
OCR processing pipeline:
|
|
|
|
1. **File Detection**: New files detected via upload or folder watch
|
|
2. **Queue Management**: Files added to processing queue
|
|
3. **Pre-processing**: Image enhancement and optimization
|
|
4. **Text Extraction**: Tesseract OCR with language detection
|
|
5. **Post-processing**: Text cleaning and formatting
|
|
6. **Database Storage**: Indexed for search
|
|
|
|
## Data Flow
|
|
|
|
### Document Upload Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
User->>Frontend: Upload Document
|
|
Frontend->>API: POST /api/documents
|
|
API->>FileStorage: Save File
|
|
API->>Database: Create Document Record
|
|
API->>OCRQueue: Add to Queue
|
|
API-->>Frontend: Document Created
|
|
OCRWorker->>OCRQueue: Poll for Jobs
|
|
OCRWorker->>FileStorage: Read File
|
|
OCRWorker->>Tesseract: Extract Text
|
|
OCRWorker->>Database: Update with Content
|
|
OCRWorker->>Frontend: WebSocket Update
|
|
```
|
|
|
|
### Search Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
User->>Frontend: Enter Search Query
|
|
Frontend->>API: GET /api/search
|
|
API->>Database: Full-text Search
|
|
Database->>API: Ranked Results
|
|
API->>Frontend: Search Results
|
|
Frontend->>User: Display Results
|
|
```
|
|
|
|
## Security Architecture
|
|
|
|
### Authentication & Authorization
|
|
|
|
- **JWT Tokens**: Stateless authentication
|
|
- **Role-Based Access**: Admin, User roles
|
|
- **Token Refresh**: Automatic token renewal
|
|
- **Password Security**: Bcrypt with salt rounds
|
|
|
|
### API Security
|
|
|
|
- **CORS**: Configurable allowed origins
|
|
- **Rate Limiting**: Prevent abuse
|
|
- **Input Validation**: Comprehensive validation
|
|
- **SQL Injection**: Parameterized queries via SQLx
|
|
|
|
### File Security
|
|
|
|
- **Upload Validation**: File type and size checks
|
|
- **Virus Scanning**: Optional ClamAV integration
|
|
- **Access Control**: Document-level permissions
|
|
- **Secure Storage**: Filesystem permissions
|
|
|
|
## Performance Considerations
|
|
|
|
### Backend Optimization
|
|
|
|
- **Connection Pooling**: Database connection reuse
|
|
- **Async I/O**: Non-blocking operations
|
|
- **Caching**: In-memory caching for hot data
|
|
- **Query Optimization**: Indexed searches
|
|
|
|
### Frontend Optimization
|
|
|
|
- **Code Splitting**: Lazy loading of routes
|
|
- **Virtual Scrolling**: Large document lists
|
|
- **Memoization**: Prevent unnecessary re-renders
|
|
- **Service Workers**: Offline capability
|
|
|
|
### OCR Optimization
|
|
|
|
- **Parallel Processing**: Multiple concurrent jobs
|
|
- **Image Pre-processing**: Enhance OCR accuracy
|
|
- **Resource Limits**: Memory and CPU constraints
|
|
- **Queue Priority**: Smart job scheduling
|
|
|
|
## Scalability
|
|
|
|
### Horizontal Scaling
|
|
|
|
```yaml
|
|
# Multiple backend instances
|
|
backend-1:
|
|
image: readur:latest
|
|
environment:
|
|
- INSTANCE_ID=1
|
|
|
|
backend-2:
|
|
image: readur:latest
|
|
environment:
|
|
- INSTANCE_ID=2
|
|
|
|
# Load balancer
|
|
nginx:
|
|
upstream backend {
|
|
server backend-1:8000;
|
|
server backend-2:8000;
|
|
}
|
|
```
|
|
|
|
### Database Scaling
|
|
|
|
- **Read Replicas**: Distribute read load
|
|
- **Connection Pooling**: PgBouncer
|
|
- **Partitioning**: Time-based partitions
|
|
- **Archival**: Move old documents
|
|
|
|
### Storage Scaling
|
|
|
|
- **S3 Compatible**: Object storage support
|
|
- **CDN Integration**: Static file delivery
|
|
- **Distributed Storage**: GlusterFS/Ceph
|
|
- **Archive Tiering**: Hot/cold storage
|
|
|
|
## Design Patterns
|
|
|
|
### Backend Patterns
|
|
|
|
1. **Repository Pattern**: Database abstraction
|
|
2. **Service Layer**: Business logic separation
|
|
3. **Middleware Chain**: Request processing
|
|
4. **Error Boundaries**: Graceful error handling
|
|
|
|
### Frontend Patterns
|
|
|
|
1. **Container/Presenter**: Component separation
|
|
2. **Custom Hooks**: Logic reuse
|
|
3. **Context Provider**: State management
|
|
4. **HOCs**: Cross-cutting concerns
|
|
|
|
### Database Patterns
|
|
|
|
1. **Soft Deletes**: Data preservation
|
|
2. **Audit Trails**: Change tracking
|
|
3. **Materialized Views**: Performance
|
|
4. **Event Sourcing**: Optional audit log
|
|
|
|
## Future Architecture Considerations
|
|
|
|
### Microservices Migration
|
|
|
|
Potential service boundaries:
|
|
- Authentication Service
|
|
- Document Service
|
|
- OCR Service
|
|
- Search Service
|
|
- Notification Service
|
|
|
|
### Event-Driven Architecture
|
|
|
|
- Message Queue (RabbitMQ/Kafka)
|
|
- Event Sourcing
|
|
- CQRS Pattern
|
|
- Async communication
|
|
|
|
### Cloud-Native Features
|
|
|
|
- Kubernetes deployment
|
|
- Service mesh (Istio)
|
|
- Distributed tracing
|
|
- Cloud storage integration
|
|
|
|
## Monitoring and Observability
|
|
|
|
### Metrics
|
|
|
|
- Prometheus metrics endpoint
|
|
- Custom business metrics
|
|
- Performance counters
|
|
- Resource utilization
|
|
|
|
### Logging
|
|
|
|
- Structured logging (JSON)
|
|
- Log aggregation ready
|
|
- Correlation IDs
|
|
- Debug levels
|
|
|
|
### Tracing
|
|
|
|
- OpenTelemetry support
|
|
- Distributed tracing
|
|
- Performance profiling
|
|
- Request tracking
|
|
|
|
## Next Steps
|
|
|
|
- Review [deployment options](deployment.md)
|
|
- Explore [performance tuning](OCR_OPTIMIZATION_GUIDE.md)
|
|
- Understand [database design](DATABASE_GUARDRAILS.md)
|
|
- Learn about [testing strategy](TESTING.md) |