Readur/docs/migration-guide.md

471 lines
12 KiB
Markdown

# Migration Guide: Local Storage to S3
## Overview
This guide provides step-by-step instructions for migrating your Readur installation from local filesystem storage to S3 storage. The migration process is designed to be safe, resumable, and reversible.
## Pre-Migration Checklist
### 1. System Requirements
- [ ] Readur compiled with S3 feature: `cargo build --release --features s3`
- [ ] Sufficient disk space for temporary operations (at least 2x largest file)
- [ ] Network bandwidth for uploading all documents to S3
- [ ] AWS CLI installed and configured (for verification)
### 2. S3 Prerequisites
- [ ] S3 bucket created and accessible
- [ ] IAM user with appropriate permissions
- [ ] Access keys generated and tested
- [ ] Bucket region identified
- [ ] Encryption settings configured (if required)
- [ ] Lifecycle policies reviewed
### 3. Backup Requirements
- [ ] Database backed up
- [ ] Local files backed up (optional but recommended)
- [ ] Configuration files saved
- [ ] Document count and total size recorded
## Migration Process
### Step 1: Prepare Environment
#### 1.1 Backup Database
```bash
# Create timestamped backup
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
pg_dump $DATABASE_URL > readur_backup_${BACKUP_DATE}.sql
# Verify backup
pg_restore --list readur_backup_${BACKUP_DATE}.sql | head -20
```
#### 1.2 Document Current State
```sql
-- Record current statistics
SELECT
COUNT(*) as total_documents,
SUM(file_size) / 1024.0 / 1024.0 / 1024.0 as total_size_gb,
COUNT(DISTINCT user_id) as unique_users
FROM documents;
-- Save document list
\copy (SELECT id, filename, file_path, file_size FROM documents) TO 'documents_pre_migration.csv' CSV HEADER;
```
#### 1.3 Calculate Migration Time
```bash
# Estimate migration duration
TOTAL_SIZE_GB=100 # From query above
UPLOAD_SPEED_MBPS=100 # Your upload speed
ESTIMATED_HOURS=$(echo "scale=2; ($TOTAL_SIZE_GB * 1024 * 8) / ($UPLOAD_SPEED_MBPS * 3600)" | bc)
echo "Estimated migration time: $ESTIMATED_HOURS hours"
```
### Step 2: Configure S3
#### 2.1 Create S3 Bucket
```bash
# Create bucket
aws s3api create-bucket \
--bucket readur-production \
--region us-east-1 \
--create-bucket-configuration LocationConstraint=us-east-1
# Enable versioning
aws s3api put-bucket-versioning \
--bucket readur-production \
--versioning-configuration Status=Enabled
# Enable encryption
aws s3api put-bucket-encryption \
--bucket readur-production \
--server-side-encryption-configuration '{
"Rules": [{
"ApplyServerSideEncryptionByDefault": {
"SSEAlgorithm": "AES256"
}
}]
}'
```
#### 2.2 Set Up IAM User
```bash
# Create policy file
cat > readur-s3-policy.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::readur-production"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:GetObjectVersion",
"s3:PutObjectAcl"
],
"Resource": "arn:aws:s3:::readur-production/*"
}
]
}
EOF
# Create IAM user and attach policy
aws iam create-user --user-name readur-s3-user
aws iam put-user-policy \
--user-name readur-s3-user \
--policy-name ReadurS3Access \
--policy-document file://readur-s3-policy.json
# Generate access keys
aws iam create-access-key --user-name readur-s3-user > s3-credentials.json
```
#### 2.3 Configure Readur for S3
```bash
# Add to .env file
cat >> .env << 'EOF'
# S3 Configuration
S3_ENABLED=true
S3_BUCKET_NAME=readur-production
S3_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
S3_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
S3_REGION=us-east-1
EOF
# Test configuration
source .env
aws s3 ls s3://$S3_BUCKET_NAME --region $S3_REGION
```
### Step 3: Run Migration
#### 3.1 Dry Run
```bash
# Preview migration without making changes
cargo run --bin migrate_to_s3 --features s3 -- --dry-run
# Review output
# Expected output:
# 🔍 DRY RUN - Would migrate the following files:
# - document1.pdf (User: 123e4567..., Size: 2.5 MB)
# - report.docx (User: 987fcdeb..., Size: 1.2 MB)
# 💡 Run without --dry-run to perform actual migration
```
#### 3.2 Partial Migration (Testing)
```bash
# Migrate only 10 files first
cargo run --bin migrate_to_s3 --features s3 -- --limit 10
# Verify migrated files
aws s3 ls s3://$S3_BUCKET_NAME/documents/ --recursive | head -20
# Check database updates
psql $DATABASE_URL -c "SELECT id, filename, file_path FROM documents WHERE file_path LIKE 's3://%' LIMIT 10;"
```
#### 3.3 Full Migration
```bash
# Run full migration with progress tracking
cargo run --bin migrate_to_s3 --features s3 -- \
--enable-rollback \
2>&1 | tee migration_$(date +%Y%m%d_%H%M%S).log
# Monitor progress in another terminal
watch -n 5 'cat migration_state.json | jq "{processed: .processed_files, total: .total_files, failed: .failed_migrations | length}"'
```
#### 3.4 Migration with Local File Deletion
```bash
# Only after verifying successful migration
cargo run --bin migrate_to_s3 --features s3 -- \
--delete-local \
--enable-rollback
```
### Step 4: Verify Migration
#### 4.1 Database Verification
```sql
-- Check migration completeness
SELECT
COUNT(*) FILTER (WHERE file_path LIKE 's3://%') as s3_documents,
COUNT(*) FILTER (WHERE file_path NOT LIKE 's3://%') as local_documents,
COUNT(*) as total_documents
FROM documents;
-- Find any failed migrations
SELECT id, filename, file_path
FROM documents
WHERE file_path NOT LIKE 's3://%'
ORDER BY created_at DESC
LIMIT 20;
-- Verify path format
SELECT DISTINCT
substring(file_path from 1 for 50) as path_prefix,
COUNT(*) as document_count
FROM documents
GROUP BY path_prefix
ORDER BY document_count DESC;
```
#### 4.2 S3 Verification
```bash
# Count objects in S3
aws s3 ls s3://$S3_BUCKET_NAME/documents/ --recursive --summarize | grep "Total Objects"
# Verify file structure
aws s3 ls s3://$S3_BUCKET_NAME/ --recursive | head -50
# Check specific document
DOCUMENT_ID="123e4567-e89b-12d3-a456-426614174000"
aws s3 ls s3://$S3_BUCKET_NAME/documents/ --recursive | grep $DOCUMENT_ID
```
#### 4.3 Application Testing
```bash
# Restart Readur with S3 configuration
systemctl restart readur
# Test document upload
curl -X POST https://readur.example.com/api/documents \
-H "Authorization: Bearer $TOKEN" \
-F "file=@test-document.pdf"
# Test document retrieval
curl -X GET https://readur.example.com/api/documents/$DOCUMENT_ID/download \
-H "Authorization: Bearer $TOKEN" \
-o downloaded-test.pdf
# Verify downloaded file
md5sum test-document.pdf downloaded-test.pdf
```
### Step 5: Post-Migration Tasks
#### 5.1 Update Backup Procedures
```bash
# Create S3 backup script
cat > backup-s3.sh << 'EOF'
#!/bin/bash
# Backup S3 data to another bucket
BACKUP_BUCKET="readur-backup-$(date +%Y%m%d)"
aws s3api create-bucket --bucket $BACKUP_BUCKET --region us-east-1
aws s3 sync s3://readur-production s3://$BACKUP_BUCKET --storage-class GLACIER
EOF
chmod +x backup-s3.sh
```
#### 5.2 Set Up Monitoring
```bash
# Create CloudWatch dashboard
aws cloudwatch put-dashboard \
--dashboard-name ReadurS3 \
--dashboard-body file://cloudwatch-dashboard.json
```
#### 5.3 Clean Up Local Storage
```bash
# After confirming successful migration
# Remove old upload directories (CAREFUL!)
du -sh ./uploads ./thumbnails ./processed_images
# Archive before deletion
tar -czf pre_migration_files_$(date +%Y%m%d).tar.gz ./uploads ./thumbnails ./processed_images
# Remove directories
rm -rf ./uploads/* ./thumbnails/* ./processed_images/*
```
## Rollback Procedures
### Automatic Rollback
If migration fails with `--enable-rollback`:
```bash
# Rollback will automatically:
# 1. Restore database paths to original values
# 2. Delete uploaded S3 objects
# 3. Save rollback state to rollback_errors.json
```
### Manual Rollback
#### Step 1: Restore Database
```sql
-- Revert file paths to local
UPDATE documents
SET file_path = regexp_replace(file_path, '^s3://[^/]+/', './uploads/')
WHERE file_path LIKE 's3://%';
-- Or restore from backup
psql $DATABASE_URL < readur_backup_${BACKUP_DATE}.sql
```
#### Step 2: Remove S3 Objects
```bash
# Delete all migrated objects
aws s3 rm s3://$S3_BUCKET_NAME/documents/ --recursive
aws s3 rm s3://$S3_BUCKET_NAME/thumbnails/ --recursive
aws s3 rm s3://$S3_BUCKET_NAME/processed_images/ --recursive
```
#### Step 3: Restore Configuration
```bash
# Disable S3 in configuration
sed -i 's/S3_ENABLED=true/S3_ENABLED=false/' .env
# Restart application
systemctl restart readur
```
## Troubleshooting Migration Issues
### Issue: Migration Hangs
```bash
# Check current progress
tail -f migration_*.log
# View migration state
cat migration_state.json | jq '.processed_files, .failed_migrations'
# Resume from last successful
LAST_ID=$(cat migration_state.json | jq -r '.completed_migrations[-1].document_id')
cargo run --bin migrate_to_s3 --features s3 -- --resume-from $LAST_ID
```
### Issue: Permission Errors
```bash
# Verify IAM permissions
aws s3api put-object \
--bucket $S3_BUCKET_NAME \
--key test.txt \
--body /tmp/test.txt
# Check bucket policy
aws s3api get-bucket-policy --bucket $S3_BUCKET_NAME
```
### Issue: Network Timeouts
```bash
# Use screen/tmux for long migrations
screen -S migration
cargo run --bin migrate_to_s3 --features s3
# Detach: Ctrl+A, D
# Reattach: screen -r migration
```
## Migration Optimization
### Parallel Upload
```bash
# Split migration by user
for USER_ID in $(psql $DATABASE_URL -t -c "SELECT DISTINCT user_id FROM documents"); do
cargo run --bin migrate_to_s3 --features s3 -- --user-id $USER_ID &
done
```
### Bandwidth Management
```bash
# Limit upload bandwidth (if needed)
trickle -u 10240 cargo run --bin migrate_to_s3 --features s3
```
### Progress Monitoring
```bash
# Real-time statistics
watch -n 10 'echo "=== Migration Progress ===" && \
cat migration_state.json | jq "{
progress_pct: ((.processed_files / .total_files) * 100),
processed: .processed_files,
total: .total_files,
failed: .failed_migrations | length,
elapsed: now - (.started_at | fromdate),
rate_per_hour: (.processed_files / ((now - (.started_at | fromdate)) / 3600))
}"'
```
## Post-Migration Validation
### Data Integrity Check
```bash
# Generate checksums for S3 objects
aws s3api list-objects-v2 --bucket $S3_BUCKET_NAME --prefix documents/ \
--query 'Contents[].{Key:Key, ETag:ETag}' \
--output json > s3_checksums.json
# Compare with database
psql $DATABASE_URL -c "SELECT id, file_path, file_hash FROM documents" > db_checksums.txt
```
### Performance Testing
```bash
# Benchmark S3 retrieval
time for i in {1..100}; do
curl -s https://readur.example.com/api/documents/random/download > /dev/null
done
```
## Success Criteria
Migration is considered successful when:
- [ ] All documents have S3 paths in database
- [ ] No failed migrations in migration_state.json
- [ ] Application can upload new documents to S3
- [ ] Application can retrieve existing documents from S3
- [ ] Thumbnails and processed images are accessible
- [ ] Performance meets acceptable thresholds
- [ ] Backup procedures are updated and tested
## Next Steps
1. Monitor S3 costs and usage
2. Implement CloudFront CDN if needed
3. Set up cross-region replication for disaster recovery
4. Configure S3 lifecycle policies for cost optimization
5. Update documentation and runbooks