83 lines
3.2 KiB
Markdown
83 lines
3.2 KiB
Markdown
# 📤 Smart File Upload Guide
|
|
|
|
Readur provides an intuitive drag-and-drop file upload system that supports multiple document formats and batch processing.
|
|
|
|
## Supported File Types
|
|
|
|
- **PDF Files** (.pdf) - Direct text extraction and OCR for scanned PDFs
|
|
- **Images** (.png, .jpg, .jpeg, .tiff, .bmp, .webp) - Full OCR text extraction
|
|
- **Text Files** (.txt, .rtf) - Direct text import
|
|
- **Office Documents** (.docx, .doc, .xlsx, .xls, .pptx, .ppt) - Text extraction and OCR
|
|
|
|
## Upload Methods
|
|
|
|
### Drag & Drop
|
|
1. Navigate to the main dashboard
|
|
2. Drag files from your computer directly onto the upload area
|
|
3. Multiple files can be selected and dropped simultaneously
|
|
4. Progress indicators show upload and processing status
|
|
|
|
### Browse & Select
|
|
1. Click the "Upload Documents" button
|
|
2. Use the file browser to select one or multiple files
|
|
3. Click "Open" to begin the upload process
|
|
|
|
## Batch Processing
|
|
|
|
- Upload multiple files at once for efficient processing
|
|
- Each file is processed independently for OCR and text extraction
|
|
- Real-time status updates show processing progress
|
|
- Failed uploads can be retried individually
|
|
|
|
## Processing Pipeline
|
|
|
|
1. **File Validation** - Verify file type and size limits
|
|
2. **Enhanced File Type Detection** (v2.5.4+) - Magic number detection using Rust 'infer' crate
|
|
3. **Storage** - Secure file storage with backup (local or S3)
|
|
4. **OCR Processing** - Automatic text extraction using Tesseract
|
|
5. **Indexing** - Full-text search indexing in PostgreSQL
|
|
6. **Metadata Extraction** - File properties and document information
|
|
|
|
### Enhanced File Type Detection (v2.5.4+)
|
|
|
|
Readur now uses content-based file type detection rather than relying solely on file extensions:
|
|
|
|
- **Magic Number Detection**: Identifies files by their content signature, not just extension
|
|
- **Broader Format Support**: Automatically recognizes more document and image formats
|
|
- **Security Enhancement**: Prevents malicious files with incorrect extensions from being processed
|
|
- **Performance**: Fast, native Rust implementation for minimal overhead
|
|
|
|
**Automatically Detected Formats:**
|
|
- Documents: PDF, DOCX, XLSX, PPTX, ODT, ODS, ODP
|
|
- Images: PNG, JPEG, GIF, BMP, TIFF, WebP, HEIC
|
|
- Archives: ZIP, RAR, 7Z, TAR, GZ
|
|
- Text: TXT, MD, CSV, JSON, XML
|
|
|
|
This enhancement ensures files are correctly identified even when extensions are missing or incorrect, improving both reliability and security.
|
|
|
|
## Best Practices
|
|
|
|
- **File Size**: Keep individual files under 50MB for optimal performance
|
|
- **File Names**: Use descriptive names for better organization
|
|
- **Batch Size**: Upload 10-20 files at once for best performance
|
|
- **Network**: Stable internet connection recommended for large uploads
|
|
|
|
## Troubleshooting
|
|
|
|
### Upload Fails
|
|
- Check file size limits
|
|
- Verify file format is supported
|
|
- Ensure stable internet connection
|
|
- Try uploading fewer files at once
|
|
|
|
### OCR Issues
|
|
- Ensure images have good contrast and resolution
|
|
- PDF files may need higher quality scans
|
|
- Check the [OCR Optimization Guide](dev/OCR_OPTIMIZATION_GUIDE.md) for advanced tips
|
|
|
|
## Security
|
|
|
|
- All uploads are scanned for malicious content
|
|
- Files are stored securely with proper access controls
|
|
- User permissions apply to all uploaded documents
|
|
- Automatic backup ensures data safety |