Readur/docs/office-document-support.md

239 lines
6.5 KiB
Markdown

# Office Document Support
Readur provides comprehensive support for extracting text from Microsoft Office documents, enabling full-text search and content analysis across your document library.
## Supported Formats
### Modern Office Formats (Native Support)
These formats are fully supported without any additional dependencies:
- **DOCX** - Word documents (Office 2007+)
- Full text extraction from document body
- Section and paragraph structure preservation
- Header and footer content extraction
- **XLSX** - Excel spreadsheets (Office 2007+)
- Text extraction from all worksheets
- Cell content with proper formatting
- Sheet names and structure preservation
### Legacy Office Formats (External Tools Required)
These older formats require external tools for text extraction:
- **DOC** - Legacy Word documents (Office 97-2003)
- Requires `antiword`, `catdoc`, or `wvText`
- Binary format parsing via external tools
- **XLS** - Legacy Excel spreadsheets (Office 97-2003)
- Currently returns an error suggesting conversion to XLSX
## Installation
### Docker Installation
The official Docker image includes all necessary dependencies:
```bash
docker pull readur/readur:latest
```
The Docker image includes `antiword` and `catdoc` pre-installed for legacy DOC support.
### Manual Installation
#### For Modern Formats (DOCX, XLSX)
No additional dependencies required - these formats are parsed using built-in XML processing.
#### For Legacy DOC Files
Install one of the following tools:
**Ubuntu/Debian:**
```bash
# Option 1: antiword (recommended, lightweight)
sudo apt-get install antiword
# Option 2: catdoc (good alternative)
sudo apt-get install catdoc
# Option 3: wv (includes wvText)
sudo apt-get install wv
```
**macOS:**
```bash
# Option 1: antiword
brew install antiword
# Option 2: catdoc
brew install catdoc
# Option 3: wv
brew install wv
```
**Alpine Linux:**
```bash
# Option 1: antiword
apk add antiword
# Option 2: catdoc
apk add catdoc
```
## How It Works
### Modern Office Format Processing (DOCX/XLSX)
1. **ZIP Extraction**: Modern Office files are ZIP archives containing XML files
2. **XML Parsing**: Secure XML parser extracts text content
3. **Content Assembly**: Text from different document parts is assembled
4. **Cleaning**: Excessive whitespace and formatting artifacts are removed
### Legacy DOC Processing
1. **Tool Detection**: System checks for available tools (antiword, catdoc, wvText)
2. **External Processing**: Selected tool converts DOC to plain text
3. **Security Validation**: File paths are validated to prevent injection attacks
4. **Timeout Protection**: 30-second timeout prevents hanging processes
5. **Text Cleaning**: Output is sanitized and normalized
## Configuration
### Timeout Settings
Office document extraction timeout can be configured in user settings:
- **Default**: 120 seconds
- **Range**: 1-600 seconds
- **Applies to**: DOCX and XLSX processing
### Error Handling
When processing fails, Readur provides helpful error messages:
- **Missing Tools**: Instructions for installing required tools
- **File Too Large**: Suggestions for file size reduction
- **Corrupted Files**: Guidance on file repair options
- **Unsupported Formats**: Conversion recommendations
## Security Features
### Built-in Protections
1. **ZIP Bomb Protection**: Limits decompressed size to prevent resource exhaustion
2. **Path Validation**: Prevents directory traversal and injection attacks
3. **XML Security**: Entity expansion and external entity attacks prevented
4. **Process Isolation**: External tools run with limited permissions
5. **Timeout Enforcement**: Prevents infinite processing loops
### File Size Limits
- **Maximum Office Document Size**: 50MB
- **Maximum Decompressed Size**: 500MB (ZIP bomb protection)
- **Compression Ratio Limit**: 100:1
## Performance Considerations
### Processing Speed
Typical extraction times:
- **DOCX (1-10 pages)**: 50-200ms
- **DOCX (100+ pages)**: 500-2000ms
- **XLSX (small)**: 100-300ms
- **XLSX (large)**: 1000-5000ms
- **DOC (via antiword)**: 100-500ms
### Resource Usage
- **Memory**: ~10-50MB per document during processing
- **CPU**: Single-threaded extraction, minimal impact
- **Disk**: Temporary files cleaned automatically
## Troubleshooting
### Common Issues
#### "No DOC extraction tools available"
**Solution**: Install antiword or catdoc as described above.
#### "Document processing timed out"
**Possible causes**:
- Very large or complex document
- Corrupted file structure
- System resource constraints
**Solutions**:
1. Increase timeout in settings
2. Convert to PDF format
3. Split large documents
#### "Document format not supported"
**Affected formats**: PPT, PPTX, and other Office formats
**Solution**: Convert to supported format (PDF, DOCX, TXT)
### Verification
To verify Office document support:
```bash
# Check for DOC support
which antiword || which catdoc || echo "No DOC tools installed"
# Test extraction (Docker)
docker exec readur-container antiword -v
# Test extraction (Manual)
antiword test.doc
```
## Best Practices
1. **Prefer Modern Formats**: Use DOCX over DOC when possible
2. **Convert Legacy Files**: Batch convert DOC to DOCX for better performance
3. **Monitor File Sizes**: Large Office files may need splitting
4. **Regular Updates**: Keep external tools updated for security
5. **Test Extraction**: Verify text extraction quality after setup
## Migration from DOC to DOCX
For better performance and reliability, consider converting legacy DOC files:
### Using LibreOffice (Batch Conversion)
```bash
libreoffice --headless --convert-to docx *.doc
```
### Using Microsoft Word (Windows)
PowerShell script for batch conversion available in `/scripts/convert-doc-to-docx.ps1`
## API Usage
### Upload Office Document
```bash
curl -X POST http://localhost:8000/api/documents/upload \
-H "Authorization: Bearer YOUR_TOKEN" \
-F "file=@document.docx"
```
### Check Processing Status
```bash
curl http://localhost:8000/api/documents/{id}/status \
-H "Authorization: Bearer YOUR_TOKEN"
```
## Future Enhancements
Planned improvements for Office document support:
- [ ] Native DOC parsing (without external tools)
- [ ] PowerPoint (PPTX/PPT) support
- [ ] Table structure preservation
- [ ] Embedded image extraction
- [ ] Style and formatting metadata
- [ ] Track changes and comments extraction
## Related Documentation
- [File Upload Guide](./file-upload-guide.md)
- [OCR Optimization Guide](./dev/OCR_OPTIMIZATION_GUIDE.md)
- [Advanced Search](./advanced-search.md)
- [Configuration Reference](./configuration-reference.md)