Readur/docs/per-user-watch-directories.md

902 lines
24 KiB
Markdown

# Per-User Watch Directories Documentation
## Table of Contents
1. [Overview](#overview)
2. [Architecture and Components](#architecture-and-components)
3. [Prerequisites and Requirements](#prerequisites-and-requirements)
4. [Administrator Setup Guide](#administrator-setup-guide)
5. [User Guide](#user-guide)
6. [API Reference](#api-reference)
7. [Configuration Reference](#configuration-reference)
8. [Security Considerations](#security-considerations)
9. [Troubleshooting](#troubleshooting)
10. [Examples and Best Practices](#examples-and-best-practices)
## Overview
The Per-User Watch Directories feature in Readur allows each user to have their own dedicated folder for automatic document ingestion. When enabled, documents placed in a user's watch directory are automatically processed, OCR'd, and associated with that specific user's account.
### Key Benefits
- **User Isolation**: Each user's documents remain private and separate
- **Automatic Attribution**: Documents are automatically assigned to the correct user
- **Simplified Workflow**: Users can drop files into their folder without manual upload
- **Batch Processing**: Process multiple documents simultaneously
- **Integration Support**: Works with network shares, sync tools, and automated workflows
### How It Works
1. Administrator enables per-user watch directories in configuration
2. System creates a dedicated folder for each user (e.g., `/data/user_watch/username/`)
3. Users place documents in their watch folder
4. Readur's file watcher detects new files
5. Documents are automatically ingested and associated with the user
6. OCR processing extracts text for searching
7. Documents appear in the user's library
## Architecture and Components
### System Components
1. **UserWatchService** (`src/services/user_watch_service.rs`)
- Manages user-specific watch directories
- Handles directory creation, validation, and cleanup
- Provides secure path operations
2. **UserWatchManager** (`src/scheduling/user_watch_manager.rs`)
- Coordinates between file watcher and user management
- Maps file paths to users
- Manages user cache for performance
3. **File Watcher** (`src/scheduling/watcher.rs`)
- Monitors both global and per-user directories
- Determines file ownership based on directory location
- Triggers document ingestion pipeline
4. **API Endpoints** (`src/routes/users.rs`)
- REST API for managing user watch directories
- Provides status, creation, and deletion operations
### Directory Structure
```
user_watch_base_dir/ # Base directory (configurable)
├── alice/ # User alice's watch directory
│ ├── document1.pdf
│ └── report.docx
├── bob/ # User bob's watch directory
│ └── invoice.pdf
└── charlie/ # User charlie's watch directory
├── presentation.pptx
└── notes.txt
```
## Prerequisites and Requirements
### System Requirements
- **Operating System**: Linux, macOS, or Windows with proper file permissions
- **Storage**: Sufficient disk space for user directories and documents
- **File System**: Support for directory permissions (recommended: ext4, NTFS, APFS)
- **Readur Version**: 2.5.4 or later
### Software Requirements
- PostgreSQL database
- Readur server with file watching enabled
- Proper file system permissions for the Readur process
### Network Requirements (Optional)
- Network file system support (NFS, SMB/CIFS) for remote directories
- Stable network connection for remote file access
## Administrator Setup Guide
### Step 1: Enable Per-User Watch Directories
Edit your `.env` file or set environment variables:
```bash
# Enable the feature
ENABLE_PER_USER_WATCH=true
# Set the base directory for user watch folders
USER_WATCH_BASE_DIR=/data/user_watch
# Configure watch interval (optional, default: 60 seconds)
WATCH_INTERVAL_SECONDS=30
# Set file stability check (optional, default: 2000ms)
FILE_STABILITY_CHECK_MS=3000
# Set maximum file age to process (optional, default: 24 hours)
MAX_FILE_AGE_HOURS=48
```
### Step 2: Create Base Directory
Ensure the base directory exists with proper permissions:
```bash
# Create the base directory
sudo mkdir -p /data/user_watch
# Set ownership to the user running Readur
sudo chown readur:readur /data/user_watch
# Set permissions (owner: read/write/execute, group: read/execute)
sudo chmod 755 /data/user_watch
```
### Step 3: Configure Directory Permissions
For production environments, configure appropriate permissions:
```bash
# Option 1: Shared group access
sudo groupadd readur-users
sudo usermod -a -G readur-users readur
sudo chgrp -R readur-users /data/user_watch
sudo chmod -R 2775 /data/user_watch # SGID bit ensures new files inherit group
# Option 2: ACL-based permissions (more granular)
sudo setfacl -R -m u:readur:rwx /data/user_watch
sudo setfacl -R -d -m u:readur:rwx /data/user_watch
```
### Step 4: Network Share Setup (Optional)
To allow users to access their watch directories via network shares:
#### SMB/CIFS Share Configuration
```ini
# /etc/samba/smb.conf
[readur-watch]
path = /data/user_watch
valid users = @readur-users
writable = yes
browseable = yes
create mask = 0660
directory mask = 0770
force group = readur-users
```
#### NFS Export Configuration
```bash
# /etc/exports
/data/user_watch *(rw,sync,no_subtree_check,no_root_squash)
```
### Step 5: Restart Readur
After configuration, restart the Readur service:
```bash
# Systemd
sudo systemctl restart readur
# Docker
docker-compose restart readur
# Direct execution
# Stop the current process and start with new configuration
```
### Step 6: Verify Configuration
Check the Readur logs to confirm per-user watch is enabled:
```bash
# Check logs for confirmation
grep "Per-user watch enabled" /var/log/readur/readur.log
# Expected output:
# ✅ Per-user watch enabled: true
# 📂 User watch base directory: /data/user_watch
```
## User Guide
### Accessing Your Watch Directory
#### Method 1: Direct File System Access
If you have direct access to the server:
```bash
# Navigate to your watch directory
cd /data/user_watch/your-username/
# Copy files
cp ~/Documents/*.pdf /data/user_watch/your-username/
# Move files
mv ~/Downloads/report.docx /data/user_watch/your-username/
```
#### Method 2: Network Share Access
Access via SMB/CIFS on Windows:
1. Open File Explorer
2. Type in address bar: `\\server-name\readur-watch\your-username`
3. Drag and drop files into your folder
Access via SMB/CIFS on macOS:
1. Open Finder
2. Press Cmd+K
3. Enter: `smb://server-name/readur-watch/your-username`
4. Drag and drop files into your folder
#### Method 3: Sync Tools
Use synchronization tools for automatic uploads:
```bash
# Using rsync
rsync -avz ~/Documents/*.pdf server:/data/user_watch/your-username/
# Using rclone
rclone copy ~/Documents server:user_watch/your-username/
# Using Syncthing (configure folder sync)
# Add /data/user_watch/your-username as a sync folder
```
### Managing Your Watch Directory via Web Interface
1. **Check Directory Status**
- Navigate to Settings → Watch Folder
- View your watch directory path and status
- See if directory exists and is enabled
2. **Create Your Directory**
- Click "Create Watch Directory" button
- System will create your personal folder
- Confirmation message will appear
3. **View Directory Path**
- Your directory path is displayed
- Copy path for reference
- Share with IT for network access setup
### Supported File Types
Place any of these file types in your watch directory:
- **Documents**: PDF, TXT, DOC, DOCX, ODT, RTF
- **Images**: PNG, JPG, JPEG, TIFF, BMP
- **Presentations**: PPT, PPTX, ODP
- **Spreadsheets**: XLS, XLSX, ODS
### File Processing Workflow
1. **File Detection**: System checks for new files every 30-60 seconds
2. **Stability Check**: Waits for file to stop changing (2-3 seconds)
3. **Validation**: Verifies file type and size
4. **Ingestion**: Creates document record in database
5. **OCR Queue**: Adds to processing queue
6. **Text Extraction**: OCR processes the document
7. **Search Index**: Document becomes searchable
### Best Practices for Users
1. **File Naming**: Use descriptive names for easier identification
2. **File Size**: Keep files under 50MB for optimal processing
3. **Batch Upload**: Can upload multiple files simultaneously
4. **Organization**: Create subfolders within your watch directory
5. **Patience**: Allow 1-5 minutes for processing depending on file size
## API Reference
### Get User Watch Directory Information
Retrieve information about a user's watch directory.
**Endpoint**: `GET /api/users/{user_id}/watch-directory`
**Headers**:
```http
Authorization: Bearer {jwt_token}
```
**Response** (200 OK):
```json
{
"user_id": "550e8400-e29b-41d4-a716-446655440000",
"username": "alice",
"watch_directory_path": "/data/user_watch/alice",
"exists": true,
"enabled": true
}
```
**Error Responses**:
- `401 Unauthorized`: Missing or invalid authentication
- `403 Forbidden`: Insufficient permissions
- `404 Not Found`: User not found
- `500 Internal Server Error`: Per-user watch disabled
### Create User Watch Directory
Create or ensure a user's watch directory exists.
**Endpoint**: `POST /api/users/{user_id}/watch-directory`
**Headers**:
```http
Authorization: Bearer {jwt_token}
Content-Type: application/json
```
**Request Body**:
```json
{
"ensure_created": true
}
```
**Response** (200 OK):
```json
{
"success": true,
"message": "Watch directory ready for user 'alice'",
"watch_directory_path": "/data/user_watch/alice"
}
```
**Error Responses**:
- `401 Unauthorized`: Missing or invalid authentication
- `403 Forbidden`: Insufficient permissions
- `404 Not Found`: User not found
- `500 Internal Server Error`: Creation failed or feature disabled
### Delete User Watch Directory
Remove a user's watch directory and its contents.
**Endpoint**: `DELETE /api/users/{user_id}/watch-directory`
**Headers**:
```http
Authorization: Bearer {jwt_token}
```
**Note**: Only administrators can delete watch directories.
**Response** (200 OK):
```json
{
"success": true,
"message": "Watch directory removed for user 'alice'",
"watch_directory_path": null
}
```
**Error Responses**:
- `401 Unauthorized`: Missing or invalid authentication
- `403 Forbidden`: Admin access required
- `404 Not Found`: User not found
- `500 Internal Server Error`: Deletion failed
### API Usage Examples
#### Python Example
```python
import requests
# Configuration
base_url = "https://readur.example.com/api"
token = "your-jwt-token"
user_id = "550e8400-e29b-41d4-a716-446655440000"
headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "application/json"
}
# Get watch directory info
response = requests.get(
f"{base_url}/users/{user_id}/watch-directory",
headers=headers
)
info = response.json()
print(f"Watch directory: {info['watch_directory_path']}")
print(f"Exists: {info['exists']}")
# Create watch directory
response = requests.post(
f"{base_url}/users/{user_id}/watch-directory",
headers=headers,
json={"ensure_created": True}
)
result = response.json()
if result['success']:
print(f"Created: {result['watch_directory_path']}")
```
#### JavaScript/TypeScript Example
```typescript
// Using the provided API service
import { userWatchService } from './services/api';
// Get watch directory information
const getWatchInfo = async (userId: string) => {
try {
const response = await userWatchService.getUserWatchDirectory(userId);
console.log('Watch directory:', response.data.watch_directory_path);
console.log('Exists:', response.data.exists);
return response.data;
} catch (error) {
console.error('Failed to get watch directory info:', error);
}
};
// Create watch directory
const createWatchDirectory = async (userId: string) => {
try {
const response = await userWatchService.createUserWatchDirectory(userId);
if (response.data.success) {
console.log('Created:', response.data.watch_directory_path);
}
return response.data;
} catch (error) {
console.error('Failed to create watch directory:', error);
}
};
```
#### cURL Examples
```bash
# Get watch directory information
curl -X GET "https://readur.example.com/api/users/${USER_ID}/watch-directory" \
-H "Authorization: Bearer ${TOKEN}"
# Create watch directory
curl -X POST "https://readur.example.com/api/users/${USER_ID}/watch-directory" \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{"ensure_created": true}'
# Delete watch directory (admin only)
curl -X DELETE "https://readur.example.com/api/users/${USER_ID}/watch-directory" \
-H "Authorization: Bearer ${TOKEN}"
```
## Configuration Reference
### Environment Variables
| Variable | Type | Default | Description |
|----------|------|---------|-------------|
| `ENABLE_PER_USER_WATCH` | Boolean | `false` | Enable/disable per-user watch directories |
| `USER_WATCH_BASE_DIR` | String | `./user_watch` | Base directory for all user watch folders |
| `WATCH_INTERVAL_SECONDS` | Integer | `60` | How often to scan for new files (seconds) |
| `FILE_STABILITY_CHECK_MS` | Integer | `2000` | Time to wait for file size stability (milliseconds) |
| `MAX_FILE_AGE_HOURS` | Integer | `24` | Maximum age of files to process (hours) |
### Configuration Validation
The system performs several validation checks:
1. **Path Validation**: Ensures paths are distinct and non-overlapping
2. **Directory Conflicts**: Prevents USER_WATCH_BASE_DIR from being:
- The same as UPLOAD_PATH
- The same as WATCH_FOLDER
- Inside UPLOAD_PATH
- Containing UPLOAD_PATH
### Docker Configuration
When using Docker, mount the user watch directory:
```yaml
version: '3.8'
services:
readur:
image: readur:latest
environment:
- ENABLE_PER_USER_WATCH=true
- USER_WATCH_BASE_DIR=/app/user_watch
- WATCH_INTERVAL_SECONDS=30
volumes:
- ./user_watch:/app/user_watch
- ./uploads:/app/uploads
- ./watch:/app/watch
ports:
- "8000:8000"
```
### Kubernetes Configuration
For Kubernetes deployments:
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: readur-config
data:
ENABLE_PER_USER_WATCH: "true"
USER_WATCH_BASE_DIR: "/data/user_watch"
WATCH_INTERVAL_SECONDS: "30"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: readur
spec:
template:
spec:
containers:
- name: readur
image: readur:latest
envFrom:
- configMapRef:
name: readur-config
volumeMounts:
- name: user-watch
mountPath: /data/user_watch
volumes:
- name: user-watch
persistentVolumeClaim:
claimName: readur-user-watch-pvc
```
## Security Considerations
### Username Validation
The system enforces strict username validation to prevent security issues:
- **Length**: 1-64 characters
- **Allowed Characters**: Alphanumeric, underscore (_), dash (-)
- **Prohibited Patterns**:
- Path traversal attempts (.., /)
- Hidden directories (starting with .)
- Null bytes or special characters
### Directory Permissions
1. **User Isolation**: Each user's directory is separate
2. **Permission Model**: 755 (owner: rwx, group: r-x, others: r-x)
3. **Ownership**: Readur process owns all directories
4. **SGID Bit**: Optional for group inheritance
### Path Security
- **Canonicalization**: All paths are canonicalized to prevent traversal
- **Boundary Checking**: Files must be within designated directories
- **Validation**: Extracted usernames are validated before use
### Access Control
- **API Protection**: JWT authentication required
- **Permission Levels**:
- Users: Can only access their own directory
- Admins: Can manage all directories
- **Directory Creation**: Users can create their own, admins can create any
- **Directory Deletion**: Admin-only operation
### Audit Considerations
1. **Logging**: All directory operations are logged
2. **File Attribution**: Documents tracked to source user
3. **Access Tracking**: API access logged with user context
## Troubleshooting
### Common Issues and Solutions
#### Issue: Per-user watch directories not working
**Symptoms**: Files in user directories are not processed
**Solutions**:
1. Verify feature is enabled:
```bash
grep ENABLE_PER_USER_WATCH .env
# Should show: ENABLE_PER_USER_WATCH=true
```
**Check base directory exists and has correct permissions:** Verify that the base watch directory has been created with proper ownership.
```bash
ls -la /data/user_watch
# Should show readur as owner with 755 permissions
```
**Review logs for errors:** Search for watch directory related error messages in the application logs.
```bash
grep -i "user watch" /var/log/readur/readur.log
```
#### Issue: "User watch service not initialized" error
**Symptoms**: API returns 500 error when accessing watch directories
**Solutions**:
1. Ensure ENABLE_PER_USER_WATCH=true in configuration
2. Restart Readur service
3. Check initialization logs for errors
#### Issue: Files not being detected
**Symptoms**: Files placed in watch directory are not processed
**Solutions**:
1. Check file permissions:
```bash
ls -la /data/user_watch/username/
# Files should be readable by readur user
```
2. Verify file type is supported:
```bash
echo $ALLOWED_FILE_TYPES
# Ensure your file extension is included
```
3. Check file age restriction:
```bash
# Files older than MAX_FILE_AGE_HOURS are ignored
find /data/user_watch -type f -mtime +1
```
#### Issue: Permission denied errors
**Symptoms**: Users cannot write to their watch directories
**Solutions**:
1. Fix directory ownership:
```bash
sudo chown -R readur:readur /data/user_watch
```
2. Set correct permissions:
```bash
sudo chmod -R 755 /data/user_watch
```
3. For shared access, use group permissions:
```bash
sudo chmod -R 775 /data/user_watch
sudo chgrp -R readur-users /data/user_watch
```
#### Issue: Duplicate documents created
**Symptoms**: Same file creates multiple documents
**Solutions**:
1. Ensure file stability check is adequate:
```bash
# Increase if files are still being written
FILE_STABILITY_CHECK_MS=5000
```
2. Check for file system issues (timestamps, inode changes)
3. Review deduplication settings in configuration
### Diagnostic Commands
```bash
# Check if user watch is enabled
curl -H "Authorization: Bearer $TOKEN" \
https://readur.example.com/api/users/$USER_ID/watch-directory
# List all user directories
ls -la /data/user_watch/
# Check file watcher logs
journalctl -u readur | grep -i "watch"
# Monitor file processing in real-time
tail -f /var/log/readur/readur.log | grep -E "(Processing new file|watch)"
# Check directory permissions
namei -l /data/user_watch/username/
# Find recently modified files
find /data/user_watch -type f -mmin -60
# Check disk space
df -h /data/user_watch
```
## Examples and Best Practices
### Example 1: Small Team Setup
For a team of 5-10 users with local file access:
```bash
# .env configuration
ENABLE_PER_USER_WATCH=true
USER_WATCH_BASE_DIR=/srv/readur/user_watches
WATCH_INTERVAL_SECONDS=60
FILE_STABILITY_CHECK_MS=2000
MAX_FILE_AGE_HOURS=72
# Directory structure
/srv/readur/user_watches/
├── alice/
├── bob/
├── charlie/
├── diana/
└── edward/
```
### Example 2: Enterprise Network Share Integration
For larger organizations with network shares:
```bash
# Mount network share
sudo mount -t cifs //fileserver/readur /mnt/readur \
-o username=readur,domain=COMPANY
# .env configuration
ENABLE_PER_USER_WATCH=true
USER_WATCH_BASE_DIR=/mnt/readur/user_watches
WATCH_INTERVAL_SECONDS=120 # Slower for network
FILE_STABILITY_CHECK_MS=5000 # Higher for network delays
```
### Example 3: Automated Document Workflow
Script for automatic document routing:
```python
#!/usr/bin/env python3
"""
Auto-route documents to user watch directories based on metadata
"""
import os
import shutil
from pathlib import Path
def route_document(file_path, user_mapping):
"""Route document to appropriate user watch directory"""
# Extract metadata (example: from filename)
filename = os.path.basename(file_path)
# Determine target user (implement your logic)
if "invoice" in filename.lower():
target_user = "accounting"
elif "report" in filename.lower():
target_user = "management"
else:
target_user = "general"
# Move to user's watch directory
user_watch_dir = Path(f"/data/user_watch/{target_user}")
if user_watch_dir.exists():
dest = user_watch_dir / filename
shutil.move(file_path, dest)
print(f"Moved {filename} to {target_user}'s watch directory")
else:
print(f"User {target_user} watch directory does not exist")
# Monitor incoming directory
incoming_dir = Path("/srv/incoming")
for file_path in incoming_dir.glob("*.pdf"):
route_document(file_path, user_mapping={})
```
### Example 4: Bulk User Setup
PowerShell script for creating multiple user directories:
```powershell
# bulk-create-watch-dirs.ps1
$baseUrl = "https://readur.example.com/api"
$adminToken = "your-admin-token"
$users = @("alice", "bob", "charlie", "diana", "edward")
foreach ($username in $users) {
# Get user ID
$userResponse = Invoke-RestMethod `
-Uri "$baseUrl/users" `
-Headers @{Authorization="Bearer $adminToken"}
$user = $userResponse | Where-Object {$_.username -eq $username}
if ($user) {
# Create watch directory
$body = @{ensure_created=$true} | ConvertTo-Json
$result = Invoke-RestMethod `
-Method Post `
-Uri "$baseUrl/users/$($user.id)/watch-directory" `
-Headers @{
Authorization="Bearer $adminToken"
"Content-Type"="application/json"
} `
-Body $body
Write-Host "Created watch directory for $username at $($result.watch_directory_path)"
}
}
```
### Best Practices Summary
#### For Administrators
1. **Capacity Planning**: Allocate 1-5GB per user for watch directories
2. **Backup Strategy**: Include user watch directories in backup plans
3. **Monitoring**: Set up alerts for disk space and processing failures
4. **Documentation**: Maintain user guide with network paths
5. **Testing**: Test with various file types and sizes before deployment
#### For Users
1. **File Organization**: Use meaningful filenames and folder structure
2. **File Formats**: Prefer PDF for best OCR results
3. **Batch Processing**: Group related documents for upload
4. **Size Limits**: Split large documents if over 50MB
5. **Patience**: Allow processing time before expecting search results
#### For Developers
1. **API Integration**: Use provided client libraries when available
2. **Error Handling**: Implement retry logic for transient failures
3. **Validation**: Validate file types before placing in watch directories
4. **Monitoring**: Track processing status via WebSocket updates
5. **Caching**: Cache user directory paths to reduce API calls
### Performance Optimization
1. **File System**: Use SSD storage for watch directories
2. **Network**: Minimize latency for network-mounted directories
3. **Scheduling**: Adjust watch interval based on usage patterns
4. **Concurrency**: Configure OCR workers based on CPU cores
5. **Cleanup**: Implement retention policies for processed files
### Migration from Global Watch Directory
To migrate from a single global watch directory to per-user directories:
1. **Preparation**:
```bash
# Backup existing watch directory
tar -czf watch_backup.tar.gz /data/watch/
```
2. **Enable Feature**:
```bash
# Update configuration
ENABLE_PER_USER_WATCH=true
USER_WATCH_BASE_DIR=/data/user_watch
```
3. **Create User Directories**:
```bash
# Script to create directories for existing users
for user in $(psql -d readur -c "SELECT username FROM users" -t); do
mkdir -p "/data/user_watch/$user"
chown readur:readur "/data/user_watch/$user"
done
```
4. **Migrate Documents** (optional):
- Keep existing documents in place
- Or reassign to appropriate users through the UI
5. **Update Documentation**:
- Notify users of new directory locations
- Update any automation scripts
- Revise backup procedures
This completes the comprehensive documentation for the Per-User Watch Directories feature in Readur.