Readur/docs/migration-guide.md

# Migration Guide: Local Storage to S3

## Overview

This guide provides step-by-step instructions for migrating your Readur installation from local filesystem storage to S3 storage. The migration process is designed to be safe, resumable, and reversible.

## Pre-Migration Checklist

### 1. System Requirements

- [ ] Readur compiled with S3 feature: `cargo build --release --features s3`
- [ ] Sufficient disk space for temporary operations (at least 2x largest file)
- [ ] Network bandwidth for uploading all documents to S3
- [ ] AWS CLI installed and configured (for verification)

### 2. S3 Prerequisites

- [ ] S3 bucket created and accessible
- [ ] IAM user with appropriate permissions
- [ ] Access keys generated and tested
- [ ] Bucket region identified
- [ ] Encryption settings configured (if required)
- [ ] Lifecycle policies reviewed

### 3. Backup Requirements

- [ ] Database backed up
- [ ] Local files backed up (optional but recommended)
- [ ] Configuration files saved
- [ ] Document count and total size recorded

## Migration Process

### Step 1: Prepare Environment

#### 1.1 Backup Database

```bash
# Create timestamped backup
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
pg_dump $DATABASE_URL > readur_backup_${BACKUP_DATE}.sql

# Verify backup
pg_restore --list readur_backup_${BACKUP_DATE}.sql | head -20
```

#### 1.2 Document Current State

```sql
-- Record current statistics
SELECT
    COUNT(*) as total_documents,
    SUM(file_size) / 1024.0 / 1024.0 / 1024.0 as total_size_gb,
    COUNT(DISTINCT user_id) as unique_users
FROM documents;

-- Save document list
\copy (SELECT id, filename, file_path, file_size FROM documents) TO 'documents_pre_migration.csv' CSV HEADER;
```

#### 1.3 Calculate Migration Time

```bash
# Estimate migration duration
TOTAL_SIZE_GB=100  # From query above
UPLOAD_SPEED_MBPS=100  # Your upload speed
ESTIMATED_HOURS=$(echo "scale=2; ($TOTAL_SIZE_GB * 1024 * 8) / ($UPLOAD_SPEED_MBPS * 3600)" | bc)
echo "Estimated migration time: $ESTIMATED_HOURS hours"
```

### Step 2: Configure S3

#### 2.1 Create S3 Bucket

```bash
# Create bucket
aws s3api create-bucket \
    --bucket readur-production \
    --region us-east-1 \
    --create-bucket-configuration LocationConstraint=us-east-1

# Enable versioning
aws s3api put-bucket-versioning \
    --bucket readur-production \
    --versioning-configuration Status=Enabled

# Enable encryption
aws s3api put-bucket-encryption \
    --bucket readur-production \
    --server-side-encryption-configuration '{
        "Rules": [{
            "ApplyServerSideEncryptionByDefault": {
                "SSEAlgorithm": "AES256"
            }
        }]
    }'
```

#### 2.2 Set Up IAM User

```bash
# Create policy file
cat > readur-s3-policy.json << 'EOF'
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::readur-production"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:GetObjectVersion",
                "s3:PutObjectAcl"
            ],
            "Resource": "arn:aws:s3:::readur-production/*"
        }
    ]
}
EOF

# Create IAM user and attach policy
aws iam create-user --user-name readur-s3-user
aws iam put-user-policy \
    --user-name readur-s3-user \
    --policy-name ReadurS3Access \
    --policy-document file://readur-s3-policy.json

# Generate access keys
aws iam create-access-key --user-name readur-s3-user > s3-credentials.json
```

#### 2.3 Configure Readur for S3

```bash
# Add to .env file
cat >> .env << 'EOF'
# S3 Configuration
S3_ENABLED=true
S3_BUCKET_NAME=readur-production
S3_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
S3_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
S3_REGION=us-east-1
EOF

# Test configuration
source .env
aws s3 ls s3://$S3_BUCKET_NAME --region $S3_REGION
```

### Step 3: Run Migration

#### 3.1 Dry Run

```bash
# Preview migration without making changes
cargo run --bin migrate_to_s3 --features s3 -- --dry-run

# Review output
# Expected output:
# 🔍 DRY RUN - Would migrate the following files:
#   - document1.pdf (User: 123e4567..., Size: 2.5 MB)
#   - report.docx (User: 987fcdeb..., Size: 1.2 MB)
# 💡 Run without --dry-run to perform actual migration
```

#### 3.2 Partial Migration (Testing)

```bash
# Migrate only 10 files first
cargo run --bin migrate_to_s3 --features s3 -- --limit 10

# Verify migrated files
aws s3 ls s3://$S3_BUCKET_NAME/documents/ --recursive | head -20

# Check database updates
psql $DATABASE_URL -c "SELECT id, filename, file_path FROM documents WHERE file_path LIKE 's3://%' LIMIT 10;"
```

#### 3.3 Full Migration

```bash
# Run full migration with progress tracking
cargo run --bin migrate_to_s3 --features s3 -- \
    --enable-rollback \
    2>&1 | tee migration_$(date +%Y%m%d_%H%M%S).log

# Monitor progress in another terminal
watch -n 5 'cat migration_state.json | jq "{processed: .processed_files, total: .total_files, failed: .failed_migrations | length}"'
```

#### 3.4 Migration with Local File Deletion

```bash
# Only after verifying successful migration
cargo run --bin migrate_to_s3 --features s3 -- \
    --delete-local \
    --enable-rollback
```

### Step 4: Verify Migration

#### 4.1 Database Verification

```sql
-- Check migration completeness
SELECT
    COUNT(*) FILTER (WHERE file_path LIKE 's3://%') as s3_documents,
    COUNT(*) FILTER (WHERE file_path NOT LIKE 's3://%') as local_documents,
    COUNT(*) as total_documents
FROM documents;

-- Find any failed migrations
SELECT id, filename, file_path
FROM documents
WHERE file_path NOT LIKE 's3://%'
ORDER BY created_at DESC
LIMIT 20;

-- Verify path format
SELECT DISTINCT
    substring(file_path from 1 for 50) as path_prefix,
    COUNT(*) as document_count
FROM documents
GROUP BY path_prefix
ORDER BY document_count DESC;
```

#### 4.2 S3 Verification

```bash
# Count objects in S3
aws s3 ls s3://$S3_BUCKET_NAME/documents/ --recursive --summarize | grep "Total Objects"

# Verify file structure
aws s3 ls s3://$S3_BUCKET_NAME/ --recursive | head -50

# Check specific document
DOCUMENT_ID="123e4567-e89b-12d3-a456-426614174000"
aws s3 ls s3://$S3_BUCKET_NAME/documents/ --recursive | grep $DOCUMENT_ID
```

#### 4.3 Application Testing

```bash
# Restart Readur with S3 configuration
systemctl restart readur

# Test document upload
curl -X POST https://readur.example.com/api/documents \
    -H "Authorization: Bearer $TOKEN" \
    -F "file=@test-document.pdf"

# Test document retrieval
curl -X GET https://readur.example.com/api/documents/$DOCUMENT_ID/download \
    -H "Authorization: Bearer $TOKEN" \
    -o downloaded-test.pdf

# Verify downloaded file
md5sum test-document.pdf downloaded-test.pdf
```

### Step 5: Post-Migration Tasks

#### 5.1 Update Backup Procedures

```bash
# Create S3 backup script
cat > backup-s3.sh << 'EOF'
#!/bin/bash
# Backup S3 data to another bucket
BACKUP_BUCKET="readur-backup-$(date +%Y%m%d)"
aws s3api create-bucket --bucket $BACKUP_BUCKET --region us-east-1
aws s3 sync s3://readur-production s3://$BACKUP_BUCKET --storage-class GLACIER
EOF

chmod +x backup-s3.sh
```

#### 5.2 Set Up Monitoring

```bash
# Create CloudWatch dashboard
aws cloudwatch put-dashboard \
    --dashboard-name ReadurS3 \
    --dashboard-body file://cloudwatch-dashboard.json
```

#### 5.3 Clean Up Local Storage

```bash
# After confirming successful migration
# Remove old upload directories (CAREFUL!)
du -sh ./uploads ./thumbnails ./processed_images

# Archive before deletion
tar -czf pre_migration_files_$(date +%Y%m%d).tar.gz ./uploads ./thumbnails ./processed_images

# Remove directories
rm -rf ./uploads/* ./thumbnails/* ./processed_images/*
```

## Rollback Procedures

### Automatic Rollback

If migration fails with `--enable-rollback`:

```bash
# Rollback will automatically:
# 1. Restore database paths to original values
# 2. Delete uploaded S3 objects
# 3. Save rollback state to rollback_errors.json
```

### Manual Rollback

#### Step 1: Restore Database

```sql
-- Revert file paths to local
UPDATE documents
SET file_path = regexp_replace(file_path, '^s3://[^/]+/', './uploads/')
WHERE file_path LIKE 's3://%';

-- Or restore from backup
psql $DATABASE_URL < readur_backup_${BACKUP_DATE}.sql
```

#### Step 2: Remove S3 Objects

```bash
# Delete all migrated objects
aws s3 rm s3://$S3_BUCKET_NAME/documents/ --recursive
aws s3 rm s3://$S3_BUCKET_NAME/thumbnails/ --recursive
aws s3 rm s3://$S3_BUCKET_NAME/processed_images/ --recursive
```

#### Step 3: Restore Configuration

```bash
# Disable S3 in configuration
sed -i 's/S3_ENABLED=true/S3_ENABLED=false/' .env

# Restart application
systemctl restart readur
```

## Troubleshooting Migration Issues

### Issue: Migration Hangs

```bash
# Check current progress
tail -f migration_*.log

# View migration state
cat migration_state.json | jq '.processed_files, .failed_migrations'

# Resume from last successful
LAST_ID=$(cat migration_state.json | jq -r '.completed_migrations[-1].document_id')
cargo run --bin migrate_to_s3 --features s3 -- --resume-from $LAST_ID
```

### Issue: Permission Errors

```bash
# Verify IAM permissions
aws s3api put-object \
    --bucket $S3_BUCKET_NAME \
    --key test.txt \
    --body /tmp/test.txt

# Check bucket policy
aws s3api get-bucket-policy --bucket $S3_BUCKET_NAME
```

### Issue: Network Timeouts

```bash
# Use screen/tmux for long migrations
screen -S migration
cargo run --bin migrate_to_s3 --features s3

# Detach: Ctrl+A, D
# Reattach: screen -r migration
```

## Migration Optimization

### Parallel Upload

```bash
# Split migration by user
for USER_ID in $(psql $DATABASE_URL -t -c "SELECT DISTINCT user_id FROM documents"); do
    cargo run --bin migrate_to_s3 --features s3 -- --user-id $USER_ID &
done
```

### Bandwidth Management

```bash
# Limit upload bandwidth (if needed)
trickle -u 10240 cargo run --bin migrate_to_s3 --features s3
```

### Progress Monitoring

```bash
# Real-time statistics
watch -n 10 'echo "=== Migration Progress ===" && \
    cat migration_state.json | jq "{
        progress_pct: ((.processed_files / .total_files) * 100),
        processed: .processed_files,
        total: .total_files,
        failed: .failed_migrations | length,
        elapsed: now - (.started_at | fromdate),
        rate_per_hour: (.processed_files / ((now - (.started_at | fromdate)) / 3600))
    }"'
```

## Post-Migration Validation

### Data Integrity Check

```bash
# Generate checksums for S3 objects
aws s3api list-objects-v2 --bucket $S3_BUCKET_NAME --prefix documents/ \
    --query 'Contents[].{Key:Key, ETag:ETag}' \
    --output json > s3_checksums.json

# Compare with database
psql $DATABASE_URL -c "SELECT id, file_path, file_hash FROM documents" > db_checksums.txt
```

### Performance Testing

```bash
# Benchmark S3 retrieval
time for i in {1..100}; do
    curl -s https://readur.example.com/api/documents/random/download > /dev/null
done
```

## Success Criteria

Migration is considered successful when:

- [ ] All documents have S3 paths in database
- [ ] No failed migrations in migration_state.json
- [ ] Application can upload new documents to S3
- [ ] Application can retrieve existing documents from S3
- [ ] Thumbnails and processed images are accessible
- [ ] Performance meets acceptable thresholds
- [ ] Backup procedures are updated and tested

## Next Steps

1. Monitor S3 costs and usage
2. Implement CloudFront CDN if needed
3. Set up cross-region replication for disaster recovery
4. Configure S3 lifecycle policies for cost optimization
5. Update documentation and runbooks