Readur/docs/migration-guide.md

12 KiB

Migration Guide: Local Storage to S3

Overview

This guide provides step-by-step instructions for migrating your Readur installation from local filesystem storage to S3 storage. The migration process is designed to be safe, resumable, and reversible.

Pre-Migration Checklist

1. System Requirements

  • Readur compiled with S3 feature: cargo build --release --features s3
  • Sufficient disk space for temporary operations (at least 2x largest file)
  • Network bandwidth for uploading all documents to S3
  • AWS CLI installed and configured (for verification)

2. S3 Prerequisites

  • S3 bucket created and accessible
  • IAM user with appropriate permissions
  • Access keys generated and tested
  • Bucket region identified
  • Encryption settings configured (if required)
  • Lifecycle policies reviewed

3. Backup Requirements

  • Database backed up
  • Local files backed up (optional but recommended)
  • Configuration files saved
  • Document count and total size recorded

Migration Process

Step 1: Prepare Environment

1.1 Backup Database

# Create timestamped backup
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
pg_dump $DATABASE_URL > readur_backup_${BACKUP_DATE}.sql

# Verify backup
pg_restore --list readur_backup_${BACKUP_DATE}.sql | head -20

1.2 Document Current State

-- Record current statistics
SELECT 
    COUNT(*) as total_documents,
    SUM(file_size) / 1024.0 / 1024.0 / 1024.0 as total_size_gb,
    COUNT(DISTINCT user_id) as unique_users
FROM documents;

-- Save document list
\copy (SELECT id, filename, file_path, file_size FROM documents) TO 'documents_pre_migration.csv' CSV HEADER;

1.3 Calculate Migration Time

# Estimate migration duration
TOTAL_SIZE_GB=100  # From query above
UPLOAD_SPEED_MBPS=100  # Your upload speed
ESTIMATED_HOURS=$(echo "scale=2; ($TOTAL_SIZE_GB * 1024 * 8) / ($UPLOAD_SPEED_MBPS * 3600)" | bc)
echo "Estimated migration time: $ESTIMATED_HOURS hours"

Step 2: Configure S3

2.1 Create S3 Bucket

# Create bucket
aws s3api create-bucket \
    --bucket readur-production \
    --region us-east-1 \
    --create-bucket-configuration LocationConstraint=us-east-1

# Enable versioning
aws s3api put-bucket-versioning \
    --bucket readur-production \
    --versioning-configuration Status=Enabled

# Enable encryption
aws s3api put-bucket-encryption \
    --bucket readur-production \
    --server-side-encryption-configuration '{
        "Rules": [{
            "ApplyServerSideEncryptionByDefault": {
                "SSEAlgorithm": "AES256"
            }
        }]
    }'

2.2 Set Up IAM User

# Create policy file
cat > readur-s3-policy.json << 'EOF'
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::readur-production"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:GetObjectVersion",
                "s3:PutObjectAcl"
            ],
            "Resource": "arn:aws:s3:::readur-production/*"
        }
    ]
}
EOF

# Create IAM user and attach policy
aws iam create-user --user-name readur-s3-user
aws iam put-user-policy \
    --user-name readur-s3-user \
    --policy-name ReadurS3Access \
    --policy-document file://readur-s3-policy.json

# Generate access keys
aws iam create-access-key --user-name readur-s3-user > s3-credentials.json

2.3 Configure Readur for S3

# Add to .env file
cat >> .env << 'EOF'
# S3 Configuration
S3_ENABLED=true
S3_BUCKET_NAME=readur-production
S3_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
S3_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
S3_REGION=us-east-1
EOF

# Test configuration
source .env
aws s3 ls s3://$S3_BUCKET_NAME --region $S3_REGION

Step 3: Run Migration

3.1 Dry Run

# Preview migration without making changes
cargo run --bin migrate_to_s3 --features s3 -- --dry-run

# Review output
# Expected output:
# 🔍 DRY RUN - Would migrate the following files:
#   - document1.pdf (User: 123e4567..., Size: 2.5 MB)
#   - report.docx (User: 987fcdeb..., Size: 1.2 MB)
# 💡 Run without --dry-run to perform actual migration

3.2 Partial Migration (Testing)

# Migrate only 10 files first
cargo run --bin migrate_to_s3 --features s3 -- --limit 10

# Verify migrated files
aws s3 ls s3://$S3_BUCKET_NAME/documents/ --recursive | head -20

# Check database updates
psql $DATABASE_URL -c "SELECT id, filename, file_path FROM documents WHERE file_path LIKE 's3://%' LIMIT 10;"

3.3 Full Migration

# Run full migration with progress tracking
cargo run --bin migrate_to_s3 --features s3 -- \
    --enable-rollback \
    2>&1 | tee migration_$(date +%Y%m%d_%H%M%S).log

# Monitor progress in another terminal
watch -n 5 'cat migration_state.json | jq "{processed: .processed_files, total: .total_files, failed: .failed_migrations | length}"'

3.4 Migration with Local File Deletion

# Only after verifying successful migration
cargo run --bin migrate_to_s3 --features s3 -- \
    --delete-local \
    --enable-rollback

Step 4: Verify Migration

4.1 Database Verification

-- Check migration completeness
SELECT 
    COUNT(*) FILTER (WHERE file_path LIKE 's3://%') as s3_documents,
    COUNT(*) FILTER (WHERE file_path NOT LIKE 's3://%') as local_documents,
    COUNT(*) as total_documents
FROM documents;

-- Find any failed migrations
SELECT id, filename, file_path 
FROM documents 
WHERE file_path NOT LIKE 's3://%'
ORDER BY created_at DESC
LIMIT 20;

-- Verify path format
SELECT DISTINCT 
    substring(file_path from 1 for 50) as path_prefix,
    COUNT(*) as document_count
FROM documents
GROUP BY path_prefix
ORDER BY document_count DESC;

4.2 S3 Verification

# Count objects in S3
aws s3 ls s3://$S3_BUCKET_NAME/documents/ --recursive --summarize | grep "Total Objects"

# Verify file structure
aws s3 ls s3://$S3_BUCKET_NAME/ --recursive | head -50

# Check specific document
DOCUMENT_ID="123e4567-e89b-12d3-a456-426614174000"
aws s3 ls s3://$S3_BUCKET_NAME/documents/ --recursive | grep $DOCUMENT_ID

4.3 Application Testing

# Restart Readur with S3 configuration
systemctl restart readur

# Test document upload
curl -X POST https://readur.example.com/api/documents \
    -H "Authorization: Bearer $TOKEN" \
    -F "file=@test-document.pdf"

# Test document retrieval
curl -X GET https://readur.example.com/api/documents/$DOCUMENT_ID/download \
    -H "Authorization: Bearer $TOKEN" \
    -o downloaded-test.pdf

# Verify downloaded file
md5sum test-document.pdf downloaded-test.pdf

Step 5: Post-Migration Tasks

5.1 Update Backup Procedures

# Create S3 backup script
cat > backup-s3.sh << 'EOF'
#!/bin/bash
# Backup S3 data to another bucket
BACKUP_BUCKET="readur-backup-$(date +%Y%m%d)"
aws s3api create-bucket --bucket $BACKUP_BUCKET --region us-east-1
aws s3 sync s3://readur-production s3://$BACKUP_BUCKET --storage-class GLACIER
EOF

chmod +x backup-s3.sh

5.2 Set Up Monitoring

# Create CloudWatch dashboard
aws cloudwatch put-dashboard \
    --dashboard-name ReadurS3 \
    --dashboard-body file://cloudwatch-dashboard.json

5.3 Clean Up Local Storage

# After confirming successful migration
# Remove old upload directories (CAREFUL!)
du -sh ./uploads ./thumbnails ./processed_images

# Archive before deletion
tar -czf pre_migration_files_$(date +%Y%m%d).tar.gz ./uploads ./thumbnails ./processed_images

# Remove directories
rm -rf ./uploads/* ./thumbnails/* ./processed_images/*

Rollback Procedures

Automatic Rollback

If migration fails with --enable-rollback:

# Rollback will automatically:
# 1. Restore database paths to original values
# 2. Delete uploaded S3 objects
# 3. Save rollback state to rollback_errors.json

Manual Rollback

Step 1: Restore Database

-- Revert file paths to local
UPDATE documents 
SET file_path = regexp_replace(file_path, '^s3://[^/]+/', './uploads/')
WHERE file_path LIKE 's3://%';

-- Or restore from backup
psql $DATABASE_URL < readur_backup_${BACKUP_DATE}.sql

Step 2: Remove S3 Objects

# Delete all migrated objects
aws s3 rm s3://$S3_BUCKET_NAME/documents/ --recursive
aws s3 rm s3://$S3_BUCKET_NAME/thumbnails/ --recursive
aws s3 rm s3://$S3_BUCKET_NAME/processed_images/ --recursive

Step 3: Restore Configuration

# Disable S3 in configuration
sed -i 's/S3_ENABLED=true/S3_ENABLED=false/' .env

# Restart application
systemctl restart readur

Troubleshooting Migration Issues

Issue: Migration Hangs

# Check current progress
tail -f migration_*.log

# View migration state
cat migration_state.json | jq '.processed_files, .failed_migrations'

# Resume from last successful
LAST_ID=$(cat migration_state.json | jq -r '.completed_migrations[-1].document_id')
cargo run --bin migrate_to_s3 --features s3 -- --resume-from $LAST_ID

Issue: Permission Errors

# Verify IAM permissions
aws s3api put-object \
    --bucket $S3_BUCKET_NAME \
    --key test.txt \
    --body /tmp/test.txt

# Check bucket policy
aws s3api get-bucket-policy --bucket $S3_BUCKET_NAME

Issue: Network Timeouts

# Use screen/tmux for long migrations
screen -S migration
cargo run --bin migrate_to_s3 --features s3

# Detach: Ctrl+A, D
# Reattach: screen -r migration

Migration Optimization

Parallel Upload

# Split migration by user
for USER_ID in $(psql $DATABASE_URL -t -c "SELECT DISTINCT user_id FROM documents"); do
    cargo run --bin migrate_to_s3 --features s3 -- --user-id $USER_ID &
done

Bandwidth Management

# Limit upload bandwidth (if needed)
trickle -u 10240 cargo run --bin migrate_to_s3 --features s3

Progress Monitoring

# Real-time statistics
watch -n 10 'echo "=== Migration Progress ===" && \
    cat migration_state.json | jq "{
        progress_pct: ((.processed_files / .total_files) * 100),
        processed: .processed_files,
        total: .total_files,
        failed: .failed_migrations | length,
        elapsed: now - (.started_at | fromdate),
        rate_per_hour: (.processed_files / ((now - (.started_at | fromdate)) / 3600))
    }"'

Post-Migration Validation

Data Integrity Check

# Generate checksums for S3 objects
aws s3api list-objects-v2 --bucket $S3_BUCKET_NAME --prefix documents/ \
    --query 'Contents[].{Key:Key, ETag:ETag}' \
    --output json > s3_checksums.json

# Compare with database
psql $DATABASE_URL -c "SELECT id, file_path, file_hash FROM documents" > db_checksums.txt

Performance Testing

# Benchmark S3 retrieval
time for i in {1..100}; do
    curl -s https://readur.example.com/api/documents/random/download > /dev/null
done

Success Criteria

Migration is considered successful when:

  • All documents have S3 paths in database
  • No failed migrations in migration_state.json
  • Application can upload new documents to S3
  • Application can retrieve existing documents from S3
  • Thumbnails and processed images are accessible
  • Performance meets acceptable thresholds
  • Backup procedures are updated and tested

Next Steps

  1. Monitor S3 costs and usage
  2. Implement CloudFront CDN if needed
  3. Set up cross-region replication for disaster recovery
  4. Configure S3 lifecycle policies for cost optimization
  5. Update documentation and runbooks