Readur/docs/s3-storage-guide.md

496 lines
13 KiB
Markdown

# S3 Storage Backend Guide for Readur
## Overview
Starting with version 2.5.4, Readur supports Amazon S3 and S3-compatible storage services as an alternative to local filesystem storage. This implementation provides full support for AWS S3, MinIO, Wasabi, Backblaze B2, and other S3-compatible services with automatic multipart upload for files larger than 100MB, structured storage paths with year/month organization, and automatic retry mechanisms with exponential backoff.
This guide provides comprehensive instructions for configuring, deploying, and managing Readur with S3 storage.
### Key Benefits
- **Scalability**: Unlimited storage capacity without local disk constraints
- **Durability**: 99.999999999% (11 9's) durability with AWS S3
- **Cost-Effective**: Pay only for what you use with various storage tiers
- **Global Access**: Access documents from anywhere with proper credentials
- **Backup**: Built-in versioning and cross-region replication capabilities
## Table of Contents
1. [Prerequisites](#prerequisites)
2. [Configuration](#configuration)
3. [Migration from Local Storage](#migration-from-local-storage)
4. [Storage Structure](#storage-structure)
5. [Performance Optimization](#performance-optimization)
6. [Troubleshooting](#troubleshooting)
7. [Best Practices](#best-practices)
## Prerequisites
Before configuring S3 storage, ensure you have:
1. **S3 Bucket Access**
- An AWS S3 bucket or S3-compatible service (MinIO, Wasabi, Backblaze B2, etc.)
- Access Key ID and Secret Access Key with appropriate permissions
- Bucket name and region information
2. **Required S3 Permissions**
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:ListBucket",
"s3:HeadObject",
"s3:HeadBucket",
"s3:AbortMultipartUpload",
"s3:CreateMultipartUpload",
"s3:UploadPart",
"s3:CompleteMultipartUpload"
],
"Resource": [
"arn:aws:s3:::your-bucket-name/*",
"arn:aws:s3:::your-bucket-name"
]
}
]
}
```
3. **Readur Build Requirements**
- Readur must be compiled with the `s3` feature flag enabled
- Build command: `cargo build --release --features s3`
## Configuration
### Environment Variables
Configure S3 storage by setting the following environment variables:
```bash
# Enable S3 storage backend
S3_ENABLED=true
# Required S3 credentials
S3_BUCKET_NAME=readur-documents
S3_ACCESS_KEY_ID=your-access-key-id
S3_SECRET_ACCESS_KEY=your-secret-access-key
S3_REGION=us-east-1
# Optional: For S3-compatible services (MinIO, Wasabi, etc.)
S3_ENDPOINT=https://s3-compatible-endpoint.com
```
### Configuration File Example (.env)
```bash
# Database Configuration
DATABASE_URL=postgresql://readur:password@localhost/readur
# Server Configuration
SERVER_ADDRESS=0.0.0.0:8000
JWT_SECRET=your-secure-jwt-secret
# S3 Storage Configuration
S3_ENABLED=true
S3_BUCKET_NAME=readur-production
S3_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
S3_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
S3_REGION=us-west-2
# Optional S3 endpoint for compatible services
# S3_ENDPOINT=https://minio.example.com
# Upload Configuration
UPLOAD_PATH=./temp_uploads
MAX_FILE_SIZE_MB=500
```
### S3-Compatible Services Configuration
#### MinIO
```bash
S3_ENABLED=true
S3_BUCKET_NAME=readur-bucket
S3_ACCESS_KEY_ID=minioadmin
S3_SECRET_ACCESS_KEY=minioadmin
S3_REGION=us-east-1
S3_ENDPOINT=http://localhost:9000
```
#### Wasabi
```bash
S3_ENABLED=true
S3_BUCKET_NAME=readur-bucket
S3_ACCESS_KEY_ID=your-wasabi-key
S3_SECRET_ACCESS_KEY=your-wasabi-secret
S3_REGION=us-east-1
S3_ENDPOINT=https://s3.wasabisys.com
```
#### Backblaze B2
```bash
S3_ENABLED=true
S3_BUCKET_NAME=readur-bucket
S3_ACCESS_KEY_ID=your-b2-key-id
S3_SECRET_ACCESS_KEY=your-b2-application-key
S3_REGION=us-west-002
S3_ENDPOINT=https://s3.us-west-002.backblazeb2.com
```
## Migration from Local Storage
### Using the Migration Tool
Readur includes a migration utility to transfer existing local files to S3:
1. **Prepare for Migration**
```bash
# Backup your database first
pg_dump readur > readur_backup.sql
# Set S3 configuration
export S3_ENABLED=true
export S3_BUCKET_NAME=readur-production
export S3_ACCESS_KEY_ID=your-key
export S3_SECRET_ACCESS_KEY=your-secret
export S3_REGION=us-east-1
```
2. **Run Dry Run First**
```bash
# Preview what will be migrated
cargo run --bin migrate_to_s3 --features s3 -- --dry-run
```
3. **Execute Migration**
```bash
# Migrate all files
cargo run --bin migrate_to_s3 --features s3
# Migrate with options
cargo run --bin migrate_to_s3 --features s3 -- \
--delete-local \ # Delete local files after successful upload
--limit 100 \ # Limit to 100 files (for testing)
--enable-rollback # Enable automatic rollback on failure
```
4. **Migrate Specific User's Files**
```bash
cargo run --bin migrate_to_s3 --features s3 -- \
--user-id 550e8400-e29b-41d4-a716-446655440000
```
5. **Resume Failed Migration**
```bash
# Resume from specific document ID
cargo run --bin migrate_to_s3 --features s3 -- \
--resume-from 550e8400-e29b-41d4-a716-446655440001
```
### Migration Process Details
The migration tool performs the following steps:
1. Connects to database and S3
2. Identifies all documents with local file paths
3. For each document:
- Reads the local file
- Uploads to S3 with structured path
- Updates database with S3 path
- Migrates associated thumbnails and processed images
- Optionally deletes local files
4. Tracks migration state for recovery
5. Supports rollback on failure
### Post-Migration Verification
```sql
-- Check migrated documents
SELECT
COUNT(*) FILTER (WHERE file_path LIKE 's3://%') as s3_documents,
COUNT(*) FILTER (WHERE file_path NOT LIKE 's3://%') as local_documents
FROM documents;
-- Find any remaining local files
SELECT id, filename, file_path
FROM documents
WHERE file_path NOT LIKE 's3://%'
LIMIT 10;
```
## Storage Structure
### S3 Path Organization
Readur uses a structured path format in S3:
```
bucket-name/
├── documents/
│ └── {user_id}/
│ └── {year}/
│ └── {month}/
│ └── {document_id}.{extension}
├── thumbnails/
│ └── {user_id}/
│ └── {document_id}_thumb.jpg
└── processed_images/
└── {user_id}/
└── {document_id}_processed.png
```
### Example Paths
```
readur-production/
├── documents/
│ └── 550e8400-e29b-41d4-a716-446655440000/
│ └── 2024/
│ └── 03/
│ ├── 123e4567-e89b-12d3-a456-426614174000.pdf
│ └── 987fcdeb-51a2-43f1-b321-123456789abc.docx
├── thumbnails/
│ └── 550e8400-e29b-41d4-a716-446655440000/
│ ├── 123e4567-e89b-12d3-a456-426614174000_thumb.jpg
│ └── 987fcdeb-51a2-43f1-b321-123456789abc_thumb.jpg
└── processed_images/
└── 550e8400-e29b-41d4-a716-446655440000/
├── 123e4567-e89b-12d3-a456-426614174000_processed.png
└── 987fcdeb-51a2-43f1-b321-123456789abc_processed.png
```
## Performance Optimization
### Multipart Upload
Readur automatically uses multipart upload for files larger than 100MB:
- **Chunk Size**: 16MB per part
- **Automatic Retry**: Exponential backoff with up to 3 retries
- **Progress Tracking**: Real-time upload progress via WebSocket
### Network Optimization
1. **Region Selection**: Choose S3 region closest to your Readur server
2. **Transfer Acceleration**: Enable S3 Transfer Acceleration for global users
3. **CloudFront CDN**: Use CloudFront for serving frequently accessed documents
### Caching Strategy
```nginx
# Nginx caching configuration for S3-backed documents
location /api/documents/ {
proxy_cache_valid 200 1h;
proxy_cache_valid 404 1m;
proxy_cache_bypass $http_authorization;
add_header X-Cache-Status $upstream_cache_status;
}
```
## Troubleshooting
### Common Issues and Solutions
#### 1. S3 Connection Errors
**Error**: "Failed to access S3 bucket"
**Solution**:
```bash
# Verify credentials
aws s3 ls s3://your-bucket-name --profile readur
# Check IAM permissions
aws iam get-user-policy --user-name readur-user --policy-name ReadurS3Policy
# Test connectivity
curl -I https://s3.amazonaws.com/your-bucket-name
```
#### 2. Upload Failures
**Error**: "Failed to store file: RequestTimeout"
**Solution**:
- Check network connectivity
- Verify S3 endpoint configuration
- Increase timeout values if using S3-compatible service
- Monitor S3 request metrics in AWS CloudWatch
#### 3. Permission Denied
**Error**: "AccessDenied: Access Denied"
**Solution**:
```bash
# Verify bucket policy
aws s3api get-bucket-policy --bucket your-bucket-name
# Check object ACLs
aws s3api get-object-acl --bucket your-bucket-name --key test-object
# Ensure CORS configuration for web access
aws s3api put-bucket-cors --bucket your-bucket-name --cors-configuration file://cors.json
```
#### 4. Migration Stuck
**Problem**: Migration process hangs or fails repeatedly
**Solution**:
```bash
# Check migration state
cat migration_state.json | jq '.failed_migrations'
# Resume from last successful migration
LAST_SUCCESS=$(cat migration_state.json | jq -r '.completed_migrations[-1].document_id')
cargo run --bin migrate_to_s3 --features s3 -- --resume-from $LAST_SUCCESS
# Force rollback if needed
cargo run --bin migrate_to_s3 --features s3 -- --rollback
```
### Debugging S3 Operations
Enable detailed S3 logging:
```bash
# Set environment variables for debugging
export RUST_LOG=readur=debug,aws_sdk_s3=debug
export AWS_SDK_LOAD_CONFIG=true
# Run Readur with debug logging
cargo run --features s3
```
### Performance Monitoring
Monitor S3 performance metrics:
```sql
-- Query document upload times
SELECT
DATE(created_at) as upload_date,
AVG(file_size / 1024.0 / 1024.0) as avg_size_mb,
COUNT(*) as documents_uploaded,
AVG(EXTRACT(EPOCH FROM (updated_at - created_at))) as avg_processing_time_seconds
FROM documents
WHERE file_path LIKE 's3://%'
GROUP BY DATE(created_at)
ORDER BY upload_date DESC;
```
## Best Practices
### 1. Security
- **Encryption**: Enable S3 server-side encryption (SSE-S3 or SSE-KMS)
- **Access Control**: Use IAM roles instead of access keys when possible
- **Bucket Policies**: Implement least-privilege bucket policies
- **VPC Endpoints**: Use VPC endpoints for private S3 access
```bash
# Enable default encryption on bucket
aws s3api put-bucket-encryption \
--bucket readur-production \
--server-side-encryption-configuration '{
"Rules": [{
"ApplyServerSideEncryptionByDefault": {
"SSEAlgorithm": "AES256"
}
}]
}'
```
### 2. Cost Optimization
- **Lifecycle Policies**: Archive old documents to Glacier
- **Intelligent-Tiering**: Enable for automatic cost optimization
- **Request Metrics**: Monitor and optimize S3 request patterns
```json
{
"Rules": [{
"Id": "ArchiveOldDocuments",
"Status": "Enabled",
"Transitions": [{
"Days": 90,
"StorageClass": "GLACIER"
}],
"NoncurrentVersionTransitions": [{
"NoncurrentDays": 30,
"StorageClass": "GLACIER"
}]
}]
}
```
### 3. Reliability
- **Versioning**: Enable S3 versioning for document recovery
- **Cross-Region Replication**: Set up for disaster recovery
- **Backup Strategy**: Regular backups to separate bucket or region
```bash
# Enable versioning
aws s3api put-bucket-versioning \
--bucket readur-production \
--versioning-configuration Status=Enabled
# Set up replication
aws s3api put-bucket-replication \
--bucket readur-production \
--replication-configuration file://replication.json
```
### 4. Monitoring
Set up CloudWatch alarms for:
- High error rates
- Unusual request patterns
- Storage quota approaching
- Failed multipart uploads
```bash
# Create CloudWatch alarm for S3 errors
aws cloudwatch put-metric-alarm \
--alarm-name readur-s3-errors \
--alarm-description "Alert on S3 4xx errors" \
--metric-name 4xxErrors \
--namespace AWS/S3 \
--statistic Sum \
--period 300 \
--threshold 10 \
--comparison-operator GreaterThanThreshold
```
### 5. Compliance
- **Data Residency**: Ensure S3 region meets data residency requirements
- **Audit Logging**: Enable S3 access logging and AWS CloudTrail
- **Retention Policies**: Implement compliant data retention policies
- **GDPR Compliance**: Implement proper data deletion procedures
```bash
# Enable access logging
aws s3api put-bucket-logging \
--bucket readur-production \
--bucket-logging-status '{
"LoggingEnabled": {
"TargetBucket": "readur-logs",
"TargetPrefix": "s3-access/"
}
}'
```
## Next Steps
- Review the [Configuration Reference](./configuration-reference.md) for all S3 options
- Explore [S3 Troubleshooting Guide](./s3-troubleshooting.md) for common issues and solutions
- Check [Migration Guide](./migration-guide.md) for moving from local to S3 storage
- Read [Deployment Guide](./deployment.md) for production deployment best practices