498 lines
15 KiB
Markdown
498 lines
15 KiB
Markdown
# Sources Guide
|
|
|
|
Readur's Sources feature provides powerful automated document ingestion from multiple external storage systems. This comprehensive guide covers all supported source types and their configuration.
|
|
|
|
## Table of Contents
|
|
|
|
- [Overview](#overview)
|
|
- [Source Types](#source-types)
|
|
- [WebDAV Sources](#webdav-sources)
|
|
- [Local Folder Sources](#local-folder-sources)
|
|
- [S3 Sources](#s3-sources)
|
|
- [Getting Started](#getting-started)
|
|
- [Configuration](#configuration)
|
|
- [Sync Operations](#sync-operations)
|
|
- [Health Monitoring](#health-monitoring)
|
|
- [Troubleshooting](#troubleshooting)
|
|
- [Best Practices](#best-practices)
|
|
|
|
## Overview
|
|
|
|
Sources allow Readur to automatically discover, download, and process documents from external storage systems. Key features include:
|
|
|
|
- **Multi-Protocol Support**: WebDAV, Local Folders, and S3-compatible storage
|
|
- **Automated Syncing**: Scheduled synchronization with configurable intervals
|
|
- **Health Monitoring**: Proactive monitoring and validation of source connections
|
|
- **Intelligent Processing**: Duplicate detection, incremental syncs, and OCR integration
|
|
- **Real-time Status**: Live sync progress and comprehensive statistics
|
|
|
|
### How Sources Work
|
|
|
|
1. **Configuration**: Set up a source with connection details and preferences
|
|
2. **Discovery**: Readur scans the source for supported file types
|
|
3. **Synchronization**: New and changed files are downloaded and processed
|
|
4. **OCR Processing**: Documents are automatically queued for text extraction
|
|
5. **Search Integration**: Processed documents become searchable in your collection
|
|
|
|
## Source Types
|
|
|
|
### WebDAV Sources
|
|
|
|
WebDAV sources connect to cloud storage services and self-hosted servers that support the WebDAV protocol.
|
|
|
|
#### Supported WebDAV Servers
|
|
|
|
| Server Type | Status | Notes |
|
|
|-------------|--------|-------|
|
|
| **Nextcloud** | ✅ Fully Supported | Optimized discovery and authentication |
|
|
| **ownCloud** | ✅ Fully Supported | Native integration with server detection |
|
|
| **Apache WebDAV** | ✅ Supported | Generic WebDAV implementation |
|
|
| **nginx WebDAV** | ✅ Supported | Works with nginx dav module |
|
|
| **Box.com** | ⚠️ Limited | Basic WebDAV support |
|
|
| **Other WebDAV** | ✅ Supported | Generic WebDAV protocol compliance |
|
|
|
|
#### WebDAV Configuration
|
|
|
|
**Required Fields:**
|
|
- **Name**: Descriptive name for the source
|
|
- **Server URL**: Full WebDAV server URL (e.g., `https://cloud.example.com/remote.php/dav/files/username/`)
|
|
- **Username**: WebDAV authentication username
|
|
- **Password**: WebDAV authentication password or app password
|
|
|
|
**Optional Configuration:**
|
|
- **Watch Folders**: Specific directories to monitor (leave empty to sync entire accessible space)
|
|
- **File Extensions**: Limit to specific file types (default: all supported types)
|
|
- **Auto Sync**: Enable automatic scheduled synchronization
|
|
- **Sync Interval**: How often to check for changes (15 minutes to 24 hours)
|
|
- **Server Type**: Specify server type for optimizations (auto-detected)
|
|
|
|
#### Setting Up WebDAV Sources
|
|
|
|
1. **Navigate to Sources**: Go to Settings → Sources in the Readur interface
|
|
2. **Add New Source**: Click "Add Source" and select "WebDAV"
|
|
3. **Configure Connection**:
|
|
```
|
|
Name: My Nextcloud Documents
|
|
Server URL: https://cloud.mycompany.com/remote.php/dav/files/john/
|
|
Username: john
|
|
Password: app-password-here
|
|
```
|
|
4. **Test Connection**: Use the "Test Connection" button to verify credentials
|
|
5. **Configure Folders**: Specify directories to monitor:
|
|
```
|
|
Watch Folders:
|
|
- Documents/
|
|
- Projects/2024/
|
|
- Invoices/
|
|
```
|
|
6. **Set Sync Schedule**: Choose automatic sync interval (recommended: 30 minutes)
|
|
7. **Save and Sync**: Save configuration and trigger initial sync
|
|
|
|
#### WebDAV Best Practices
|
|
|
|
- **Use App Passwords**: Create dedicated app passwords instead of using main account passwords
|
|
- **Limit Scope**: Specify watch folders to avoid syncing unnecessary files
|
|
- **Server Optimization**: Let Readur auto-detect server type for optimal performance
|
|
- **Network Considerations**: Use longer sync intervals for slow connections
|
|
|
|
### Local Folder Sources
|
|
|
|
Local folder sources monitor directories on the Readur server's filesystem, including mounted network drives.
|
|
|
|
#### Use Cases
|
|
|
|
- **Watch Folders**: Monitor directories where documents are dropped
|
|
- **Network Mounts**: Sync from NFS, SMB/CIFS, or other mounted filesystems
|
|
- **Batch Processing**: Automatically process documents placed in specific folders
|
|
- **Archive Integration**: Monitor existing document archives
|
|
|
|
#### Local Folder Configuration
|
|
|
|
**Required Fields:**
|
|
- **Name**: Descriptive name for the source
|
|
- **Watch Folders**: Absolute paths to monitor directories
|
|
|
|
**Optional Configuration:**
|
|
- **File Extensions**: Filter by specific file types
|
|
- **Auto Sync**: Enable scheduled monitoring
|
|
- **Sync Interval**: Frequency of directory scans
|
|
- **Recursive**: Include subdirectories in scans
|
|
- **Follow Symlinks**: Follow symbolic links (use with caution)
|
|
|
|
#### Setting Up Local Folder Sources
|
|
|
|
1. **Prepare Directory**: Ensure the directory exists and is accessible
|
|
```bash
|
|
# Create watch folder
|
|
mkdir -p /mnt/documents/inbox
|
|
|
|
# Set permissions (if needed)
|
|
chmod 755 /mnt/documents/inbox
|
|
```
|
|
|
|
2. **Configure Source**:
|
|
```
|
|
Name: Document Inbox
|
|
Watch Folders: /mnt/documents/inbox
|
|
File Extensions: pdf,jpg,png,txt,docx
|
|
Auto Sync: Enabled
|
|
Sync Interval: 5 minutes
|
|
Recursive: Yes
|
|
```
|
|
|
|
3. **Test Setup**: Place a test document in the folder and verify detection
|
|
|
|
#### Network Mount Examples
|
|
|
|
**NFS Mount:**
|
|
```bash
|
|
# Mount NFS share
|
|
sudo mount -t nfs 192.168.1.100:/documents /mnt/nfs-docs
|
|
|
|
# Configure in Readur
|
|
Watch Folders: /mnt/nfs-docs/inbox
|
|
```
|
|
|
|
**SMB/CIFS Mount:**
|
|
```bash
|
|
# Mount SMB share
|
|
sudo mount -t cifs //server/documents /mnt/smb-docs -o username=user
|
|
|
|
# Configure in Readur
|
|
Watch Folders: /mnt/smb-docs/processing
|
|
```
|
|
|
|
### S3 Sources
|
|
|
|
S3 sources connect to Amazon S3 or S3-compatible storage services for document synchronization.
|
|
|
|
#### Supported S3 Services
|
|
|
|
| Service | Status | Configuration |
|
|
|---------|--------|---------------|
|
|
| **Amazon S3** | ✅ Fully Supported | Standard AWS configuration |
|
|
| **MinIO** | ✅ Fully Supported | Custom endpoint URL |
|
|
| **DigitalOcean Spaces** | ✅ Supported | S3-compatible API |
|
|
| **Wasabi** | ✅ Supported | Custom endpoint configuration |
|
|
| **Google Cloud Storage** | ⚠️ Limited | S3-compatible mode only |
|
|
|
|
#### S3 Configuration
|
|
|
|
**Required Fields:**
|
|
- **Name**: Descriptive name for the source
|
|
- **Bucket Name**: S3 bucket to monitor
|
|
- **Region**: AWS region (e.g., `us-east-1`)
|
|
- **Access Key ID**: AWS/S3 access key
|
|
- **Secret Access Key**: AWS/S3 secret key
|
|
|
|
**Optional Configuration:**
|
|
- **Endpoint URL**: Custom endpoint for S3-compatible services
|
|
- **Prefix**: Bucket path prefix to limit scope
|
|
- **Watch Folders**: Specific S3 "directories" to monitor
|
|
- **File Extensions**: Filter by file types
|
|
- **Auto Sync**: Enable scheduled synchronization
|
|
- **Sync Interval**: Frequency of bucket scans
|
|
|
|
#### Setting Up S3 Sources
|
|
|
|
1. **Prepare S3 Bucket**: Ensure bucket exists and credentials have access
|
|
2. **Configure Source**:
|
|
```
|
|
Name: Company Documents S3
|
|
Bucket Name: company-documents
|
|
Region: us-west-2
|
|
Access Key ID: AKIAIOSFODNN7EXAMPLE
|
|
Secret Access Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
|
|
Prefix: documents/
|
|
Watch Folders:
|
|
- invoices/
|
|
- contracts/
|
|
- reports/
|
|
```
|
|
|
|
3. **Test Connection**: Verify credentials and bucket access
|
|
|
|
#### S3-Compatible Services
|
|
|
|
**MinIO Configuration:**
|
|
```
|
|
Endpoint URL: https://minio.example.com:9000
|
|
Bucket Name: documents
|
|
Region: us-east-1 (can be any value for MinIO)
|
|
```
|
|
|
|
**DigitalOcean Spaces:**
|
|
```
|
|
Endpoint URL: https://nyc3.digitaloceanspaces.com
|
|
Bucket Name: my-documents
|
|
Region: nyc3
|
|
```
|
|
|
|
## Getting Started
|
|
|
|
### Adding Your First Source
|
|
|
|
1. **Access Sources Management**: Navigate to Settings → Sources
|
|
2. **Choose Source Type**: Select WebDAV, Local Folder, or S3 based on your needs
|
|
3. **Configure Connection**: Enter required credentials and connection details
|
|
4. **Test Connection**: Verify connectivity before saving
|
|
5. **Configure Sync**: Set up folders to monitor and sync schedule
|
|
6. **Initial Sync**: Trigger first synchronization to import existing documents
|
|
|
|
### Quick Setup Examples
|
|
|
|
#### Nextcloud WebDAV
|
|
```
|
|
Name: Nextcloud Documents
|
|
Server URL: https://cloud.company.com/remote.php/dav/files/username/
|
|
Username: username
|
|
Password: app-password
|
|
Watch Folders: Documents/, Shared/
|
|
Auto Sync: Every 30 minutes
|
|
```
|
|
|
|
#### Local Network Drive
|
|
```
|
|
Name: Network Archive
|
|
Watch Folders: /mnt/network/documents
|
|
File Extensions: pdf,doc,docx,txt
|
|
Recursive: Yes
|
|
Auto Sync: Every 15 minutes
|
|
```
|
|
|
|
#### AWS S3 Bucket
|
|
```
|
|
Name: AWS Document Bucket
|
|
Bucket: company-docs-bucket
|
|
Region: us-east-1
|
|
Access Key: [AWS Access Key]
|
|
Secret Key: [AWS Secret Key]
|
|
Prefix: active-documents/
|
|
Auto Sync: Every 1 hour
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Sync Settings
|
|
|
|
**Sync Intervals:**
|
|
- **Real-time**: Immediate processing (local folders only)
|
|
- **5-15 minutes**: High-frequency monitoring
|
|
- **30-60 minutes**: Standard monitoring (recommended)
|
|
- **2-24 hours**: Low-frequency, large dataset sync
|
|
|
|
**File Filtering:**
|
|
- **File Extensions**: `pdf,jpg,jpeg,png,txt,doc,docx,rtf`
|
|
- **Size Limits**: Configurable maximum file size (default: 50MB)
|
|
- **Path Exclusions**: Skip specific directories or file patterns
|
|
|
|
### Advanced Configuration
|
|
|
|
**Concurrency Settings:**
|
|
- **Concurrent Files**: Number of files processed simultaneously (default: 5)
|
|
- **Network Timeout**: Connection timeout for network sources
|
|
- **Retry Logic**: Automatic retry for failed downloads
|
|
|
|
**Deduplication:**
|
|
- **Hash-based**: SHA-256 content hashing prevents duplicate storage
|
|
- **Cross-source**: Duplicates detected across all sources
|
|
- **Metadata Preservation**: Tracks file origins while avoiding storage duplication
|
|
|
|
## Sync Operations
|
|
|
|
### Manual Sync
|
|
|
|
**Trigger Immediate Sync:**
|
|
1. Navigate to Sources page
|
|
2. Find the source to sync
|
|
3. Click the "Sync Now" button
|
|
4. Monitor progress in real-time
|
|
|
|
**Deep Scan:**
|
|
- Forces complete re-scan of entire source
|
|
- Useful for detecting changes in large directories
|
|
- Automatically triggered periodically
|
|
|
|
### Sync Status
|
|
|
|
**Status Indicators:**
|
|
- 🟢 **Idle**: Source ready, no sync in progress
|
|
- 🟡 **Syncing**: Active synchronization in progress
|
|
- 🔴 **Error**: Sync failed, requires attention
|
|
- ⚪ **Disabled**: Source disabled, no automatic sync
|
|
|
|
**Progress Information:**
|
|
- Files discovered vs. processed
|
|
- Current operation (scanning, downloading, processing)
|
|
- Estimated completion time
|
|
- Transfer speeds and statistics
|
|
|
|
### Stopping Sync
|
|
|
|
**Graceful Cancellation:**
|
|
1. Click "Stop Sync" button during active sync
|
|
2. Current file processing completes
|
|
3. Sync stops cleanly without corruption
|
|
4. Partial progress is saved
|
|
|
|
## Health Monitoring
|
|
|
|
### Health Scores
|
|
|
|
Sources are continuously monitored and assigned health scores (0-100):
|
|
|
|
- **90-100**: ✅ Excellent - No issues detected
|
|
- **75-89**: ⚠️ Good - Minor issues or warnings
|
|
- **50-74**: ⚠️ Fair - Moderate issues requiring attention
|
|
- **25-49**: ❌ Poor - Significant problems
|
|
- **0-24**: ❌ Critical - Severe issues, manual intervention required
|
|
|
|
### Health Checks
|
|
|
|
**Automatic Validation** (every 30 minutes):
|
|
- Connection testing
|
|
- Credential verification
|
|
- Configuration validation
|
|
- Sync pattern analysis
|
|
- Error rate monitoring
|
|
|
|
**Common Health Issues:**
|
|
- Authentication failures
|
|
- Network connectivity problems
|
|
- Permission or access issues
|
|
- Configuration errors
|
|
- Rate limiting or throttling
|
|
|
|
### Health Notifications
|
|
|
|
**Alert Types:**
|
|
- Connection failures
|
|
- Authentication expires
|
|
- Sync errors
|
|
- Performance degradation
|
|
- Configuration warnings
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
#### WebDAV Connection Problems
|
|
|
|
**Symptom**: "Connection failed" or authentication errors
|
|
**Solutions**:
|
|
1. Verify server URL format:
|
|
- Nextcloud: `https://server.com/remote.php/dav/files/username/`
|
|
- ownCloud: `https://server.com/remote.php/dav/files/username/`
|
|
- Generic: `https://server.com/webdav/`
|
|
|
|
2. Check credentials:
|
|
- Use app passwords instead of main passwords
|
|
- Verify username/password combination
|
|
- Test credentials in web browser or WebDAV client
|
|
|
|
3. Network issues:
|
|
- Verify server is accessible from Readur
|
|
- Check firewall and SSL certificate issues
|
|
- Test with curl: `curl -u username:password https://server.com/webdav/`
|
|
|
|
#### Local Folder Issues
|
|
|
|
**Symptom**: "Permission denied" or "Directory not found"
|
|
**Solutions**:
|
|
1. Check directory permissions:
|
|
```bash
|
|
ls -la /path/to/watch/folder
|
|
chmod 755 /path/to/watch/folder # If needed
|
|
```
|
|
|
|
2. Verify path exists:
|
|
```bash
|
|
stat /path/to/watch/folder
|
|
```
|
|
|
|
3. For network mounts:
|
|
```bash
|
|
mount | grep /path/to/mount # Verify mount
|
|
ls -la /path/to/mount # Test access
|
|
```
|
|
|
|
#### S3 Access Problems
|
|
|
|
**Symptom**: "Access denied" or "Bucket not found"
|
|
**Solutions**:
|
|
1. Verify credentials and permissions:
|
|
```bash
|
|
aws s3 ls s3://bucket-name --profile your-profile
|
|
```
|
|
|
|
2. Check bucket policy and IAM permissions
|
|
3. Verify region configuration matches bucket region
|
|
4. For S3-compatible services, ensure correct endpoint URL
|
|
|
|
### Performance Issues
|
|
|
|
#### Slow Sync Performance
|
|
|
|
**Causes and Solutions**:
|
|
1. **Large file sizes**: Increase timeout values, consider file size limits
|
|
2. **Network latency**: Reduce concurrent connections, increase intervals
|
|
3. **Server throttling**: Implement longer delays between requests
|
|
4. **Large directories**: Use watch folders to limit scope
|
|
|
|
#### High Resource Usage
|
|
|
|
**Optimization Strategies**:
|
|
1. **Reduce concurrency**: Lower concurrent file processing
|
|
2. **Increase intervals**: Less frequent sync checks
|
|
3. **Filter files**: Limit to specific file types and sizes
|
|
4. **Stagger syncs**: Avoid multiple sources syncing simultaneously
|
|
|
|
### Error Recovery
|
|
|
|
**Automatic Recovery:**
|
|
- Failed files are automatically retried
|
|
- Temporary network issues are handled gracefully
|
|
- Sync resumes from last successful point
|
|
|
|
**Manual Recovery:**
|
|
1. Check source health status
|
|
2. Review error logs in source details
|
|
3. Test connection manually
|
|
4. Trigger deep scan to reset sync state
|
|
|
|
## Best Practices
|
|
|
|
### Security
|
|
|
|
1. **Use Dedicated Credentials**: Create app-specific passwords and access keys
|
|
2. **Limit Permissions**: Grant minimum required access to source accounts
|
|
3. **Regular Rotation**: Periodically update passwords and access keys
|
|
4. **Network Security**: Use HTTPS/TLS for all connections
|
|
|
|
### Performance
|
|
|
|
1. **Strategic Scheduling**: Stagger sync times for multiple sources
|
|
2. **Scope Limitation**: Use watch folders to limit sync scope
|
|
3. **File Filtering**: Exclude unnecessary file types and large files
|
|
4. **Monitor Resources**: Watch CPU, memory, and network usage
|
|
|
|
### Organization
|
|
|
|
1. **Descriptive Names**: Use clear, descriptive source names
|
|
2. **Consistent Structure**: Maintain consistent folder organization
|
|
3. **Documentation**: Document source purposes and configurations
|
|
4. **Regular Maintenance**: Periodically review and clean up sources
|
|
|
|
### Reliability
|
|
|
|
1. **Health Monitoring**: Regularly check source health scores
|
|
2. **Backup Configuration**: Document source configurations
|
|
3. **Test Scenarios**: Periodically test sync and recovery procedures
|
|
4. **Monitor Logs**: Review sync logs for patterns or issues
|
|
|
|
## Next Steps
|
|
|
|
- Configure [notifications](notifications.md) for sync events
|
|
- Set up [advanced search](advanced-search.md) to find synced documents
|
|
- Review [OCR optimization](dev/OCR_OPTIMIZATION_GUIDE.md) for processing improvements
|
|
- Explore [labels and organization](labels-and-organization.md) for document management |