263 lines
6.9 KiB
Markdown
263 lines
6.9 KiB
Markdown
# Watch Folder Guide
|
|
|
|
The watch folder feature automatically monitors a directory for new files and processes them with OCR, making them searchable in Readur. Your original files are never modified or deleted - Readur simply copies and processes them while leaving the originals untouched.
|
|
|
|
## What is Watch Folder?
|
|
|
|
Watch folder allows you to:
|
|
- **Drop files anywhere** - Point Readur to any folder (local, network drive, cloud mount)
|
|
- **Automatic processing** - New files are automatically detected and processed
|
|
- **Non-destructive** - Original files remain exactly where you put them
|
|
- **Background operation** - Processing happens in the background while you continue working
|
|
|
|
Perfect for scenarios where you want to automatically process files from:
|
|
- Network drives (NFS, SMB shares)
|
|
- Cloud storage mounts (Google Drive, Dropbox, OneDrive)
|
|
- Local folders where you save scanned documents
|
|
- Shared team folders
|
|
|
|
## How It Works
|
|
|
|
1. **Point Readur to your folder** - Set the `WATCH_FOLDER` path to any directory you want monitored
|
|
2. **Drop files** - Add documents to that folder (PDFs, images, text files, Word docs)
|
|
3. **Automatic detection** - Readur notices new files within seconds (local) or minutes (network)
|
|
4. **OCR processing** - Files are automatically processed to extract searchable text
|
|
5. **Search and find** - Your documents become searchable in the Readur web interface
|
|
|
|
## Key Features
|
|
|
|
✅ **Works with any storage type** - Local drives, network shares, cloud mounts
|
|
✅ **Smart processing** - Only processes supported file types
|
|
✅ **Duplicate prevention** - Won't process the same file twice
|
|
✅ **Safe operation** - Never modifies or deletes your original files
|
|
✅ **Background processing** - Doesn't interrupt your workflow
|
|
|
|
## Quick Setup
|
|
|
|
### Basic Setup (Docker Compose)
|
|
|
|
1. **Edit your docker-compose.yml**:
|
|
```yaml
|
|
services:
|
|
readur:
|
|
image: readur:latest
|
|
volumes:
|
|
# Mount your folder to the watch directory
|
|
- /path/to/your/documents:/app/watch
|
|
environment:
|
|
- WATCH_FOLDER=/app/watch
|
|
```
|
|
|
|
2. **Start Readur**:
|
|
```bash
|
|
docker compose up -d
|
|
```
|
|
|
|
3. **Start dropping files** into `/path/to/your/documents` - they'll be automatically processed!
|
|
|
|
### Configuration Options
|
|
|
|
| Setting | Default | What it does |
|
|
|---------|---------|-------------|
|
|
| `WATCH_FOLDER` | `./watch` | Which folder to monitor |
|
|
| `WATCH_INTERVAL_SECONDS` | `30` | How often to check for new files (network drives) |
|
|
| `MAX_FILE_AGE_HOURS` | _(none)_ | Ignore files older than this |
|
|
| `ALLOWED_FILE_TYPES` | `pdf,png,jpg,jpeg,tiff,bmp,txt,doc,docx` | Which file types to process |
|
|
|
|
## Usage
|
|
|
|
### Basic Setup
|
|
|
|
1. **Set the watch folder path:**
|
|
```bash
|
|
export WATCH_FOLDER=/path/to/your/mounted/folder
|
|
```
|
|
|
|
2. **Start the application:**
|
|
```bash
|
|
./readur
|
|
```
|
|
|
|
3. **Copy files to the watch folder:**
|
|
The application will automatically detect and process new files.
|
|
|
|
### Docker Usage
|
|
|
|
```dockerfile
|
|
# Mount your folder to the container's watch directory
|
|
docker run -d \
|
|
-v /path/to/your/files:/app/watch \
|
|
-e WATCH_FOLDER=/app/watch \
|
|
-e WATCH_INTERVAL_SECONDS=60 \
|
|
readur:latest
|
|
```
|
|
|
|
### Docker Compose
|
|
|
|
```yaml
|
|
services:
|
|
readur:
|
|
image: readur:latest
|
|
volumes:
|
|
- /mnt/nfs/documents:/app/watch
|
|
- readur_uploads:/app/uploads
|
|
environment:
|
|
WATCH_FOLDER: /app/watch
|
|
WATCH_INTERVAL_SECONDS: 30
|
|
FILE_STABILITY_CHECK_MS: 1000
|
|
MAX_FILE_AGE_HOURS: 168 # 1 week
|
|
ports:
|
|
- "8000:8000"
|
|
```
|
|
|
|
## Filesystem-Specific Configuration
|
|
|
|
### NFS Mounts
|
|
```bash
|
|
# Recommended settings for NFS
|
|
export WATCH_INTERVAL_SECONDS=60
|
|
export FILE_STABILITY_CHECK_MS=1000
|
|
export FORCE_POLLING_WATCH=1
|
|
```
|
|
|
|
### SMB/CIFS Mounts
|
|
```bash
|
|
# Recommended settings for SMB
|
|
export WATCH_INTERVAL_SECONDS=30
|
|
export FILE_STABILITY_CHECK_MS=2000
|
|
```
|
|
|
|
### S3 Mounts (s3fs, goofys, etc.)
|
|
```bash
|
|
# Recommended settings for S3
|
|
export WATCH_INTERVAL_SECONDS=120
|
|
export FILE_STABILITY_CHECK_MS=5000
|
|
export FORCE_POLLING_WATCH=1
|
|
```
|
|
|
|
### Local Filesystems
|
|
```bash
|
|
# Optimal settings for local storage (default behavior)
|
|
# No special configuration needed - uses inotify automatically
|
|
```
|
|
|
|
## Supported File Types
|
|
|
|
The watch folder processes these file types for OCR:
|
|
|
|
- **PDF**: `*.pdf`
|
|
- **Images**: `*.png`, `*.jpg`, `*.jpeg`, `*.tiff`, `*.bmp`, `*.gif`
|
|
- **Text**: `*.txt`
|
|
- **Word Documents**: `*.doc`, `*.docx`
|
|
|
|
## File Processing Priority
|
|
|
|
Files are prioritized for OCR processing based on:
|
|
|
|
1. **File Size**: Smaller files get higher priority
|
|
2. **File Type**: Images > Text files > PDFs > Word documents
|
|
3. **Queue Time**: Older items get higher priority within the same size/type category
|
|
|
|
## Monitoring and Logs
|
|
|
|
The application provides detailed logging for watch folder operations:
|
|
|
|
```
|
|
INFO readur::watcher: Starting hybrid folder watcher on: /app/watch
|
|
INFO readur::watcher: Using watch strategy: Hybrid
|
|
INFO readur::watcher: Started polling-based watcher on: /app/watch
|
|
INFO readur::watcher: Processing new file: "/app/watch/document.pdf"
|
|
INFO readur::watcher: Successfully queued file for OCR: document.pdf (size: 2048 bytes)
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Files Not Being Detected
|
|
|
|
1. **Check permissions:**
|
|
```bash
|
|
ls -la /path/to/watch/folder
|
|
chmod 755 /path/to/watch/folder
|
|
```
|
|
|
|
2. **Verify file types:**
|
|
```bash
|
|
# Only supported file types are processed
|
|
echo $ALLOWED_FILE_TYPES
|
|
```
|
|
|
|
3. **Check file stability:**
|
|
```bash
|
|
# Increase stability check time for slow networks
|
|
export FILE_STABILITY_CHECK_MS=2000
|
|
```
|
|
|
|
### High CPU Usage
|
|
|
|
1. **Increase polling interval:**
|
|
```bash
|
|
export WATCH_INTERVAL_SECONDS=120
|
|
```
|
|
|
|
2. **Limit file age:**
|
|
```bash
|
|
export MAX_FILE_AGE_HOURS=24
|
|
```
|
|
|
|
### Network Mount Issues
|
|
|
|
1. **Force polling mode:**
|
|
```bash
|
|
export FORCE_POLLING_WATCH=1
|
|
```
|
|
|
|
2. **Increase stability check:**
|
|
```bash
|
|
export FILE_STABILITY_CHECK_MS=5000
|
|
```
|
|
|
|
## Testing
|
|
|
|
Use the provided test script to verify functionality:
|
|
|
|
```bash
|
|
./test_watch_folder.sh
|
|
```
|
|
|
|
This creates sample files in the watch folder for testing.
|
|
|
|
## Security Considerations
|
|
|
|
- Files are copied to a secure upload directory, not processed in-place
|
|
- Original files in the watch folder are never modified or deleted
|
|
- System files and hidden files are automatically excluded
|
|
- File size limits prevent processing of excessively large files (>500MB)
|
|
|
|
## Performance
|
|
|
|
- **Local filesystems**: Near-instant detection via inotify
|
|
- **Network filesystems**: Detection within polling interval (default 30s)
|
|
- **Concurrent processing**: Multiple files processed simultaneously
|
|
- **Memory efficient**: Streams large files without loading entirely into memory
|
|
|
|
## Examples
|
|
|
|
### Basic File Drop
|
|
```bash
|
|
# Copy a file to the watch folder
|
|
cp document.pdf /app/watch/
|
|
# File will be automatically detected and processed
|
|
```
|
|
|
|
### Batch Processing
|
|
```bash
|
|
# Copy multiple files
|
|
cp *.pdf /app/watch/
|
|
# All supported files will be queued for processing
|
|
```
|
|
|
|
### Real-time Monitoring
|
|
```bash
|
|
# Watch the logs for processing updates
|
|
docker logs -f readur-container | grep watcher
|
|
``` |