18 KiB
Sources Guide
Readur's Sources feature provides powerful automated document ingestion from multiple external storage systems. This comprehensive guide covers all supported source types and their configuration.
Table of Contents
- Overview
- Source Types
- Getting Started
- Configuration
- Sync Operations
- Health Monitoring
- Troubleshooting
- Best Practices
Overview
Sources allow Readur to automatically discover, download, and process documents from external storage systems. Key features include:
- Multi-Protocol Support: WebDAV, Local Folders, and S3-compatible storage
- Automated Syncing: Scheduled synchronization with configurable intervals
- Health Monitoring: Proactive monitoring and validation of source connections
- Intelligent Processing: Duplicate detection, incremental syncs, and OCR integration
- Real-time Status: Live sync progress via WebSocket connections
- Per-User Watch Directories: Individual watch folders for each user (v2.5.4+)
How Sources Work
- Configuration: Set up a source with connection details and preferences
- Discovery: Readur scans the source for supported file types
- Synchronization: New and changed files are downloaded and processed
- OCR Processing: Documents are automatically queued for text extraction
- Search Integration: Processed documents become searchable in your collection
Source Types
WebDAV Sources
WebDAV sources connect to cloud storage services and self-hosted servers that support the WebDAV protocol.
Supported WebDAV Servers
| Server Type | Status | Notes |
|---|---|---|
| Nextcloud | ✅ Fully Supported | Optimized discovery and authentication |
| ownCloud | ✅ Fully Supported | Native integration with server detection |
| Apache WebDAV | ✅ Supported | Generic WebDAV implementation |
| nginx WebDAV | ✅ Supported | Works with nginx dav module |
| Box.com | ⚠️ Limited | Basic WebDAV support |
| Other WebDAV | ✅ Supported | Generic WebDAV protocol compliance |
WebDAV Configuration
Required Fields:
- Name: Descriptive name for the source
- Server URL: Full WebDAV server URL (e.g.,
https://cloud.example.com/remote.php/dav/files/username/) - Username: WebDAV authentication username
- Password: WebDAV authentication password or app password
Optional Configuration:
- Watch Folders: Specific directories to monitor (leave empty to sync entire accessible space)
- File Extensions: Limit to specific file types (default: all supported types)
- Auto Sync: Enable automatic scheduled synchronization
- Sync Interval: How often to check for changes (15 minutes to 24 hours)
- Server Type: Specify server type for optimizations (auto-detected)
Setting Up WebDAV Sources
- Navigate to Sources: Go to Settings → Sources in the Readur interface
- Add New Source: Click "Add Source" and select "WebDAV"
- Configure Connection:
Name: My Nextcloud Documents Server URL: https://cloud.mycompany.com/remote.php/dav/files/john/ Username: john Password: app-password-here - Test Connection: Use the "Test Connection" button to verify credentials
- Configure Folders: Specify directories to monitor:
Watch Folders: - Documents/ - Projects/2024/ - Invoices/ - Set Sync Schedule: Choose automatic sync interval (recommended: 30 minutes)
- Save and Sync: Save configuration and trigger initial sync
WebDAV Best Practices
- Use App Passwords: Create dedicated app passwords instead of using main account passwords
- Limit Scope: Specify watch folders to avoid syncing unnecessary files
- Server Optimization: Let Readur auto-detect server type for optimal performance
- Network Considerations: Use longer sync intervals for slow connections
Local Folder Sources
Local folder sources monitor directories on the Readur server's filesystem, including mounted network drives.
Use Cases
- Watch Folders: Monitor directories where documents are dropped
- Network Mounts: Sync from NFS, SMB/CIFS, or other mounted filesystems
- Batch Processing: Automatically process documents placed in specific folders
- Archive Integration: Monitor existing document archives
- Per-User Ingestion: Individual watch directories for each user (v2.5.4+)
Local Folder Configuration
Required Fields:
- Name: Descriptive name for the source
- Watch Folders: Absolute paths to monitor directories
Optional Configuration:
- File Extensions: Filter by specific file types
- Auto Sync: Enable scheduled monitoring
- Sync Interval: Frequency of directory scans
- Recursive: Include subdirectories in scans
- Follow Symlinks: Follow symbolic links (use with caution)
Setting Up Local Folder Sources
-
Prepare Directory: Ensure the directory exists and is accessible
# Create watch folder mkdir -p /mnt/documents/inbox # Set permissions (if needed) chmod 755 /mnt/documents/inbox -
Configure Source:
Name: Document Inbox Watch Folders: /mnt/documents/inbox File Extensions: pdf,jpg,png,txt,docx Auto Sync: Enabled Sync Interval: 5 minutes Recursive: Yes -
Test Setup: Place a test document in the folder and verify detection
Network Mount Examples
NFS Mount:
# Mount NFS share
sudo mount -t nfs 192.168.1.100:/documents /mnt/nfs-docs
# Configure in Readur
Watch Folders: /mnt/nfs-docs/inbox
SMB/CIFS Mount:
# Mount SMB share
sudo mount -t cifs //server/documents /mnt/smb-docs -o username=user
# Configure in Readur
Watch Folders: /mnt/smb-docs/processing
Per-User Watch Directories (v2.5.4+)
Each user can have their own dedicated watch directory for automatic document ingestion. This feature is ideal for multi-tenant deployments, department separation, and maintaining clear data boundaries.
Configuration:
# Enable per-user watch directories
ENABLE_PER_USER_WATCH=true
USER_WATCH_BASE_DIR=/data/user_watches
Directory Structure:
/data/user_watches/
├── john_doe/
│ ├── invoice.pdf
│ └── report.docx
├── jane_smith/
│ └── presentation.pptx
└── admin/
└── policy.pdf
API Management:
# Get user watch directory info
GET /api/users/{userId}/watch-directory
# Create/ensure watch directory exists
POST /api/users/{userId}/watch-directory
{
"ensure_created": true
}
# Delete user watch directory
DELETE /api/users/{userId}/watch-directory
Use Cases:
- Multi-tenant deployments: Isolate document ingestion per customer
- Department separation: Each department has its own ingestion folder
- Compliance: Maintain clear data separation between users
- Automation: Connect scanners or automation tools to user-specific folders
S3 Sources
S3 sources connect to Amazon S3 or S3-compatible storage services for document synchronization.
📖 Complete S3 Guide: For detailed S3 storage backend configuration, migration from local storage, and advanced features, see the S3 Storage Guide.
Supported S3 Services
| Service | Status | Configuration |
|---|---|---|
| Amazon S3 | ✅ Fully Supported | Standard AWS configuration |
| MinIO | ✅ Fully Supported | Custom endpoint URL |
| DigitalOcean Spaces | ✅ Supported | S3-compatible API |
| Wasabi | ✅ Supported | Custom endpoint configuration |
| Google Cloud Storage | ⚠️ Limited | S3-compatible mode only |
S3 Configuration
Required Fields:
- Name: Descriptive name for the source
- Bucket Name: S3 bucket to monitor
- Region: AWS region (e.g.,
us-east-1) - Access Key ID: AWS/S3 access key
- Secret Access Key: AWS/S3 secret key
Optional Configuration:
- Endpoint URL: Custom endpoint for S3-compatible services
- Prefix: Bucket path prefix to limit scope
- Watch Folders: Specific S3 "directories" to monitor
- File Extensions: Filter by file types
- Auto Sync: Enable scheduled synchronization
- Sync Interval: Frequency of bucket scans
Setting Up S3 Sources
-
Prepare S3 Bucket: Ensure bucket exists and credentials have access
-
Configure Source:
Name: Company Documents S3 Bucket Name: company-documents Region: us-west-2 Access Key ID: AKIAIOSFODNN7EXAMPLE Secret Access Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Prefix: documents/ Watch Folders: - invoices/ - contracts/ - reports/ -
Test Connection: Verify credentials and bucket access
S3-Compatible Services
MinIO Configuration:
Endpoint URL: https://minio.example.com:9000
Bucket Name: documents
Region: us-east-1 (can be any value for MinIO)
DigitalOcean Spaces:
Endpoint URL: https://nyc3.digitaloceanspaces.com
Bucket Name: my-documents
Region: nyc3
Getting Started
Adding Your First Source
- Access Sources Management: Navigate to Settings → Sources
- Choose Source Type: Select WebDAV, Local Folder, or S3 based on your needs
- Configure Connection: Enter required credentials and connection details
- Test Connection: Verify connectivity before saving
- Configure Sync: Set up folders to monitor and sync schedule
- Initial Sync: Trigger first synchronization to import existing documents
Quick Setup Examples
Nextcloud WebDAV
Name: Nextcloud Documents
Server URL: https://cloud.company.com/remote.php/dav/files/username/
Username: username
Password: app-password
Watch Folders: Documents/, Shared/
Auto Sync: Every 30 minutes
Local Network Drive
Name: Network Archive
Watch Folders: /mnt/network/documents
File Extensions: pdf,doc,docx,txt
Recursive: Yes
Auto Sync: Every 15 minutes
AWS S3 Bucket
Name: AWS Document Bucket
Bucket: company-docs-bucket
Region: us-east-1
Access Key: [AWS Access Key]
Secret Key: [AWS Secret Key]
Prefix: active-documents/
Auto Sync: Every 1 hour
Configuration
Sync Settings
Sync Intervals:
- Real-time: Immediate processing (local folders only)
- 5-15 minutes: High-frequency monitoring
- 30-60 minutes: Standard monitoring (recommended)
- 2-24 hours: Low-frequency, large dataset sync
File Filtering:
- File Extensions:
pdf,jpg,jpeg,png,txt,doc,docx,rtf - Size Limits: Configurable maximum file size (default: 50MB)
- Path Exclusions: Skip specific directories or file patterns
Advanced Configuration
Concurrency Settings:
- Concurrent Files: Number of files processed simultaneously (default: 5)
- Network Timeout: Connection timeout for network sources
- Retry Logic: Automatic retry for failed downloads
Deduplication:
- Hash-based: SHA-256 content hashing prevents duplicate storage
- Cross-source: Duplicates detected across all sources
- Metadata Preservation: Tracks file origins while avoiding storage duplication
Sync Operations
Manual Sync
Trigger Immediate Sync:
- Navigate to Sources page
- Find the source to sync
- Click the "Sync Now" button
- Monitor progress in real-time
Deep Scan:
- Forces complete re-scan of entire source
- Useful for detecting changes in large directories
- Automatically triggered periodically
Sync Status
Status Indicators:
- 🟢 Idle: Source ready, no sync in progress
- 🟡 Syncing: Active synchronization in progress
- 🔴 Error: Sync failed, requires attention
- ⚪ Disabled: Source disabled, no automatic sync
Progress Information:
- Files discovered vs. processed
- Current operation (scanning, downloading, processing)
- Estimated completion time
- Transfer speeds and statistics
Real-Time Sync Progress (v2.5.4+)
Readur uses WebSocket connections for real-time sync progress updates, providing lower latency and bidirectional communication compared to the previous Server-Sent Events implementation.
WebSocket Connection:
// Connect to sync progress WebSocket
const ws = new WebSocket('wss://readur.example.com/api/sources/{sourceId}/sync/progress');
ws.onmessage = (event) => {
const progress = JSON.parse(event.data);
console.log(`Sync progress: ${progress.percentage}%`);
};
Progress Event Format:
{
"phase": "discovering",
"progress": 45,
"current_file": "document.pdf",
"total_files": 150,
"processed_files": 68,
"status": "in_progress"
}
Benefits:
- Bidirectional communication for interactive control
- 50% reduction in bandwidth compared to SSE
- Automatic reconnection handling
- Lower server CPU usage
Stopping Sync
Graceful Cancellation:
- Click "Stop Sync" button during active sync
- Current file processing completes
- Sync stops cleanly without corruption
- Partial progress is saved
Health Monitoring
Health Scores
Sources are continuously monitored and assigned health scores (0-100):
- 90-100: ✅ Excellent - No issues detected
- 75-89: ⚠️ Good - Minor issues or warnings
- 50-74: ⚠️ Fair - Moderate issues requiring attention
- 25-49: ❌ Poor - Significant problems
- 0-24: ❌ Critical - Severe issues, manual intervention required
Health Checks
Automatic Validation (every 30 minutes):
- Connection testing
- Credential verification
- Configuration validation
- Sync pattern analysis
- Error rate monitoring
Common Health Issues:
- Authentication failures
- Network connectivity problems
- Permission or access issues
- Configuration errors
- Rate limiting or throttling
Health Notifications
Alert Types:
- Connection failures
- Authentication expires
- Sync errors
- Performance degradation
- Configuration warnings
Troubleshooting
Common Issues
WebDAV Connection Problems
Symptom: "Connection failed" or authentication errors Solutions:
-
Verify server URL format:
- Nextcloud:
https://server.com/remote.php/dav/files/username/ - ownCloud:
https://server.com/remote.php/dav/files/username/ - Generic:
https://server.com/webdav/
- Nextcloud:
-
Check credentials:
- Use app passwords instead of main passwords
- Verify username/password combination
- Test credentials in web browser or WebDAV client
-
Network issues:
- Verify server is accessible from Readur
- Check firewall and SSL certificate issues
- Test with curl:
curl -u username:password https://server.com/webdav/
Local Folder Issues
Symptom: "Permission denied" or "Directory not found" Solutions:
-
Check directory permissions:
ls -la /path/to/watch/folder chmod 755 /path/to/watch/folder # If needed -
Verify path exists:
stat /path/to/watch/folder -
For network mounts:
mount | grep /path/to/mount # Verify mount ls -la /path/to/mount # Test access
S3 Access Problems
Symptom: "Access denied" or "Bucket not found" Solutions:
-
Verify credentials and permissions:
aws s3 ls s3://bucket-name --profile your-profile -
Check bucket policy and IAM permissions
-
Verify region configuration matches bucket region
-
For S3-compatible services, ensure correct endpoint URL
Performance Issues
Slow Sync Performance
Causes and Solutions:
- Large file sizes: Increase timeout values, consider file size limits
- Network latency: Reduce concurrent connections, increase intervals
- Server throttling: Implement longer delays between requests
- Large directories: Use watch folders to limit scope
High Resource Usage
Optimization Strategies:
- Reduce concurrency: Lower concurrent file processing
- Increase intervals: Less frequent sync checks
- Filter files: Limit to specific file types and sizes
- Stagger syncs: Avoid multiple sources syncing simultaneously
Error Recovery
Automatic Recovery:
- Failed files are automatically retried
- Temporary network issues are handled gracefully
- Sync resumes from last successful point
Manual Recovery:
- Check source health status
- Review error logs in source details
- Test connection manually
- Trigger deep scan to reset sync state
Best Practices
Security
- Use Dedicated Credentials: Create app-specific passwords and access keys
- Limit Permissions: Grant minimum required access to source accounts
- Regular Rotation: Periodically update passwords and access keys
- Network Security: Use HTTPS/TLS for all connections
Performance
- Strategic Scheduling: Stagger sync times for multiple sources
- Scope Limitation: Use watch folders to limit sync scope
- File Filtering: Exclude unnecessary file types and large files
- Monitor Resources: Watch CPU, memory, and network usage
Organization
- Descriptive Names: Use clear, descriptive source names
- Consistent Structure: Maintain consistent folder organization
- Documentation: Document source purposes and configurations
- Regular Maintenance: Periodically review and clean up sources
Reliability
- Health Monitoring: Regularly check source health scores
- Backup Configuration: Document source configurations
- Test Scenarios: Periodically test sync and recovery procedures
- Monitor Logs: Review sync logs for patterns or issues
Next Steps
- Configure notifications for sync events
- Set up advanced search to find synced documents
- Review OCR optimization for processing improvements
- Explore labels and organization for document management