Readur/docs/sources-guide.md

15 KiB

Sources Guide

Readur's Sources feature provides powerful automated document ingestion from multiple external storage systems. This comprehensive guide covers all supported source types and their configuration.

Table of Contents

Overview

Sources allow Readur to automatically discover, download, and process documents from external storage systems. Key features include:

  • Multi-Protocol Support: WebDAV, Local Folders, and S3-compatible storage
  • Automated Syncing: Scheduled synchronization with configurable intervals
  • Health Monitoring: Proactive monitoring and validation of source connections
  • Intelligent Processing: Duplicate detection, incremental syncs, and OCR integration
  • Real-time Status: Live sync progress and comprehensive statistics

How Sources Work

  1. Configuration: Set up a source with connection details and preferences
  2. Discovery: Readur scans the source for supported file types
  3. Synchronization: New and changed files are downloaded and processed
  4. OCR Processing: Documents are automatically queued for text extraction
  5. Search Integration: Processed documents become searchable in your collection

Source Types

WebDAV Sources

WebDAV sources connect to cloud storage services and self-hosted servers that support the WebDAV protocol.

Supported WebDAV Servers

Server Type Status Notes
Nextcloud Fully Supported Optimized discovery and authentication
ownCloud Fully Supported Native integration with server detection
Apache WebDAV Supported Generic WebDAV implementation
nginx WebDAV Supported Works with nginx dav module
Box.com ⚠️ Limited Basic WebDAV support
Other WebDAV Supported Generic WebDAV protocol compliance

WebDAV Configuration

Required Fields:

  • Name: Descriptive name for the source
  • Server URL: Full WebDAV server URL (e.g., https://cloud.example.com/remote.php/dav/files/username/)
  • Username: WebDAV authentication username
  • Password: WebDAV authentication password or app password

Optional Configuration:

  • Watch Folders: Specific directories to monitor (leave empty to sync entire accessible space)
  • File Extensions: Limit to specific file types (default: all supported types)
  • Auto Sync: Enable automatic scheduled synchronization
  • Sync Interval: How often to check for changes (15 minutes to 24 hours)
  • Server Type: Specify server type for optimizations (auto-detected)

Setting Up WebDAV Sources

  1. Navigate to Sources: Go to Settings → Sources in the Readur interface
  2. Add New Source: Click "Add Source" and select "WebDAV"
  3. Configure Connection:
    Name: My Nextcloud Documents
    Server URL: https://cloud.mycompany.com/remote.php/dav/files/john/
    Username: john
    Password: app-password-here
    
  4. Test Connection: Use the "Test Connection" button to verify credentials
  5. Configure Folders: Specify directories to monitor:
    Watch Folders:
    - Documents/
    - Projects/2024/
    - Invoices/
    
  6. Set Sync Schedule: Choose automatic sync interval (recommended: 30 minutes)
  7. Save and Sync: Save configuration and trigger initial sync

WebDAV Best Practices

  • Use App Passwords: Create dedicated app passwords instead of using main account passwords
  • Limit Scope: Specify watch folders to avoid syncing unnecessary files
  • Server Optimization: Let Readur auto-detect server type for optimal performance
  • Network Considerations: Use longer sync intervals for slow connections

Local Folder Sources

Local folder sources monitor directories on the Readur server's filesystem, including mounted network drives.

Use Cases

  • Watch Folders: Monitor directories where documents are dropped
  • Network Mounts: Sync from NFS, SMB/CIFS, or other mounted filesystems
  • Batch Processing: Automatically process documents placed in specific folders
  • Archive Integration: Monitor existing document archives

Local Folder Configuration

Required Fields:

  • Name: Descriptive name for the source
  • Watch Folders: Absolute paths to monitor directories

Optional Configuration:

  • File Extensions: Filter by specific file types
  • Auto Sync: Enable scheduled monitoring
  • Sync Interval: Frequency of directory scans
  • Recursive: Include subdirectories in scans
  • Follow Symlinks: Follow symbolic links (use with caution)

Setting Up Local Folder Sources

  1. Prepare Directory: Ensure the directory exists and is accessible

    # Create watch folder
    mkdir -p /mnt/documents/inbox
    
    # Set permissions (if needed)
    chmod 755 /mnt/documents/inbox
    
  2. Configure Source:

    Name: Document Inbox
    Watch Folders: /mnt/documents/inbox
    File Extensions: pdf,jpg,png,txt,docx
    Auto Sync: Enabled
    Sync Interval: 5 minutes
    Recursive: Yes
    
  3. Test Setup: Place a test document in the folder and verify detection

Network Mount Examples

NFS Mount:

# Mount NFS share
sudo mount -t nfs 192.168.1.100:/documents /mnt/nfs-docs

# Configure in Readur
Watch Folders: /mnt/nfs-docs/inbox

SMB/CIFS Mount:

# Mount SMB share
sudo mount -t cifs //server/documents /mnt/smb-docs -o username=user

# Configure in Readur
Watch Folders: /mnt/smb-docs/processing

S3 Sources

S3 sources connect to Amazon S3 or S3-compatible storage services for document synchronization.

Supported S3 Services

Service Status Configuration
Amazon S3 Fully Supported Standard AWS configuration
MinIO Fully Supported Custom endpoint URL
DigitalOcean Spaces Supported S3-compatible API
Wasabi Supported Custom endpoint configuration
Google Cloud Storage ⚠️ Limited S3-compatible mode only

S3 Configuration

Required Fields:

  • Name: Descriptive name for the source
  • Bucket Name: S3 bucket to monitor
  • Region: AWS region (e.g., us-east-1)
  • Access Key ID: AWS/S3 access key
  • Secret Access Key: AWS/S3 secret key

Optional Configuration:

  • Endpoint URL: Custom endpoint for S3-compatible services
  • Prefix: Bucket path prefix to limit scope
  • Watch Folders: Specific S3 "directories" to monitor
  • File Extensions: Filter by file types
  • Auto Sync: Enable scheduled synchronization
  • Sync Interval: Frequency of bucket scans

Setting Up S3 Sources

  1. Prepare S3 Bucket: Ensure bucket exists and credentials have access

  2. Configure Source:

    Name: Company Documents S3
    Bucket Name: company-documents
    Region: us-west-2
    Access Key ID: AKIAIOSFODNN7EXAMPLE
    Secret Access Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
    Prefix: documents/
    Watch Folders: 
    - invoices/
    - contracts/
    - reports/
    
  3. Test Connection: Verify credentials and bucket access

S3-Compatible Services

MinIO Configuration:

Endpoint URL: https://minio.example.com:9000
Bucket Name: documents
Region: us-east-1  (can be any value for MinIO)

DigitalOcean Spaces:

Endpoint URL: https://nyc3.digitaloceanspaces.com
Bucket Name: my-documents
Region: nyc3

Getting Started

Adding Your First Source

  1. Access Sources Management: Navigate to Settings → Sources
  2. Choose Source Type: Select WebDAV, Local Folder, or S3 based on your needs
  3. Configure Connection: Enter required credentials and connection details
  4. Test Connection: Verify connectivity before saving
  5. Configure Sync: Set up folders to monitor and sync schedule
  6. Initial Sync: Trigger first synchronization to import existing documents

Quick Setup Examples

Nextcloud WebDAV

Name: Nextcloud Documents
Server URL: https://cloud.company.com/remote.php/dav/files/username/
Username: username
Password: app-password
Watch Folders: Documents/, Shared/
Auto Sync: Every 30 minutes

Local Network Drive

Name: Network Archive
Watch Folders: /mnt/network/documents
File Extensions: pdf,doc,docx,txt
Recursive: Yes
Auto Sync: Every 15 minutes

AWS S3 Bucket

Name: AWS Document Bucket
Bucket: company-docs-bucket
Region: us-east-1
Access Key: [AWS Access Key]
Secret Key: [AWS Secret Key]
Prefix: active-documents/
Auto Sync: Every 1 hour

Configuration

Sync Settings

Sync Intervals:

  • Real-time: Immediate processing (local folders only)
  • 5-15 minutes: High-frequency monitoring
  • 30-60 minutes: Standard monitoring (recommended)
  • 2-24 hours: Low-frequency, large dataset sync

File Filtering:

  • File Extensions: pdf,jpg,jpeg,png,txt,doc,docx,rtf
  • Size Limits: Configurable maximum file size (default: 50MB)
  • Path Exclusions: Skip specific directories or file patterns

Advanced Configuration

Concurrency Settings:

  • Concurrent Files: Number of files processed simultaneously (default: 5)
  • Network Timeout: Connection timeout for network sources
  • Retry Logic: Automatic retry for failed downloads

Deduplication:

  • Hash-based: SHA-256 content hashing prevents duplicate storage
  • Cross-source: Duplicates detected across all sources
  • Metadata Preservation: Tracks file origins while avoiding storage duplication

Sync Operations

Manual Sync

Trigger Immediate Sync:

  1. Navigate to Sources page
  2. Find the source to sync
  3. Click the "Sync Now" button
  4. Monitor progress in real-time

Deep Scan:

  • Forces complete re-scan of entire source
  • Useful for detecting changes in large directories
  • Automatically triggered periodically

Sync Status

Status Indicators:

  • 🟢 Idle: Source ready, no sync in progress
  • 🟡 Syncing: Active synchronization in progress
  • 🔴 Error: Sync failed, requires attention
  • Disabled: Source disabled, no automatic sync

Progress Information:

  • Files discovered vs. processed
  • Current operation (scanning, downloading, processing)
  • Estimated completion time
  • Transfer speeds and statistics

Stopping Sync

Graceful Cancellation:

  1. Click "Stop Sync" button during active sync
  2. Current file processing completes
  3. Sync stops cleanly without corruption
  4. Partial progress is saved

Health Monitoring

Health Scores

Sources are continuously monitored and assigned health scores (0-100):

  • 90-100: Excellent - No issues detected
  • 75-89: ⚠️ Good - Minor issues or warnings
  • 50-74: ⚠️ Fair - Moderate issues requiring attention
  • 25-49: Poor - Significant problems
  • 0-24: Critical - Severe issues, manual intervention required

Health Checks

Automatic Validation (every 30 minutes):

  • Connection testing
  • Credential verification
  • Configuration validation
  • Sync pattern analysis
  • Error rate monitoring

Common Health Issues:

  • Authentication failures
  • Network connectivity problems
  • Permission or access issues
  • Configuration errors
  • Rate limiting or throttling

Health Notifications

Alert Types:

  • Connection failures
  • Authentication expires
  • Sync errors
  • Performance degradation
  • Configuration warnings

Troubleshooting

Common Issues

WebDAV Connection Problems

Symptom: "Connection failed" or authentication errors Solutions:

  1. Verify server URL format:

    • Nextcloud: https://server.com/remote.php/dav/files/username/
    • ownCloud: https://server.com/remote.php/dav/files/username/
    • Generic: https://server.com/webdav/
  2. Check credentials:

    • Use app passwords instead of main passwords
    • Verify username/password combination
    • Test credentials in web browser or WebDAV client
  3. Network issues:

    • Verify server is accessible from Readur
    • Check firewall and SSL certificate issues
    • Test with curl: curl -u username:password https://server.com/webdav/

Local Folder Issues

Symptom: "Permission denied" or "Directory not found" Solutions:

  1. Check directory permissions:

    ls -la /path/to/watch/folder
    chmod 755 /path/to/watch/folder  # If needed
    
  2. Verify path exists:

    stat /path/to/watch/folder
    
  3. For network mounts:

    mount | grep /path/to/mount  # Verify mount
    ls -la /path/to/mount        # Test access
    

S3 Access Problems

Symptom: "Access denied" or "Bucket not found" Solutions:

  1. Verify credentials and permissions:

    aws s3 ls s3://bucket-name --profile your-profile
    
  2. Check bucket policy and IAM permissions

  3. Verify region configuration matches bucket region

  4. For S3-compatible services, ensure correct endpoint URL

Performance Issues

Slow Sync Performance

Causes and Solutions:

  1. Large file sizes: Increase timeout values, consider file size limits
  2. Network latency: Reduce concurrent connections, increase intervals
  3. Server throttling: Implement longer delays between requests
  4. Large directories: Use watch folders to limit scope

High Resource Usage

Optimization Strategies:

  1. Reduce concurrency: Lower concurrent file processing
  2. Increase intervals: Less frequent sync checks
  3. Filter files: Limit to specific file types and sizes
  4. Stagger syncs: Avoid multiple sources syncing simultaneously

Error Recovery

Automatic Recovery:

  • Failed files are automatically retried
  • Temporary network issues are handled gracefully
  • Sync resumes from last successful point

Manual Recovery:

  1. Check source health status
  2. Review error logs in source details
  3. Test connection manually
  4. Trigger deep scan to reset sync state

Best Practices

Security

  1. Use Dedicated Credentials: Create app-specific passwords and access keys
  2. Limit Permissions: Grant minimum required access to source accounts
  3. Regular Rotation: Periodically update passwords and access keys
  4. Network Security: Use HTTPS/TLS for all connections

Performance

  1. Strategic Scheduling: Stagger sync times for multiple sources
  2. Scope Limitation: Use watch folders to limit sync scope
  3. File Filtering: Exclude unnecessary file types and large files
  4. Monitor Resources: Watch CPU, memory, and network usage

Organization

  1. Descriptive Names: Use clear, descriptive source names
  2. Consistent Structure: Maintain consistent folder organization
  3. Documentation: Document source purposes and configurations
  4. Regular Maintenance: Periodically review and clean up sources

Reliability

  1. Health Monitoring: Regularly check source health scores
  2. Backup Configuration: Document source configurations
  3. Test Scenarios: Periodically test sync and recovery procedures
  4. Monitor Logs: Review sync logs for patterns or issues

Next Steps