feat(docs): add docs about multiple ocr languages

This commit is contained in:
perf3ct 2025-07-21 23:34:57 +00:00
parent 98c3bd50ef
commit 2e6c1ef238
2 changed files with 249 additions and 0 deletions

View File

@ -54,6 +54,7 @@ open http://localhost:8000
- [👥 User Management](docs/user-management-guide.md) - Authentication, roles, and administration - [👥 User Management](docs/user-management-guide.md) - Authentication, roles, and administration
- [🏷️ Labels & Organization](docs/labels-and-organization.md) - Document tagging and categorization - [🏷️ Labels & Organization](docs/labels-and-organization.md) - Document tagging and categorization
- [🔎 Advanced Search](docs/advanced-search.md) - Search modes, syntax, and optimization - [🔎 Advanced Search](docs/advanced-search.md) - Search modes, syntax, and optimization
- [🌍 Multi-Language OCR Guide](docs/multi-language-ocr-guide.md) - Process documents in multiple languages simultaneously
- [🔐 OIDC Setup](docs/oidc-setup.md) - Single Sign-On integration - [🔐 OIDC Setup](docs/oidc-setup.md) - Single Sign-On integration
### Deployment & Operations ### Deployment & Operations
@ -69,6 +70,7 @@ open http://localhost:8000
- [🔍 OCR Optimization](docs/dev/OCR_OPTIMIZATION_GUIDE.md) - Improve OCR performance - [🔍 OCR Optimization](docs/dev/OCR_OPTIMIZATION_GUIDE.md) - Improve OCR performance
- [🗄️ Database Best Practices](docs/dev/DATABASE_GUARDRAILS.md) - Concurrency and safety - [🗄️ Database Best Practices](docs/dev/DATABASE_GUARDRAILS.md) - Concurrency and safety
- [📊 Queue Architecture](docs/dev/QUEUE_IMPROVEMENTS.md) - Background job processing - [📊 Queue Architecture](docs/dev/QUEUE_IMPROVEMENTS.md) - Background job processing
- [⚠️ Error System Guide](docs/dev/ERROR_SYSTEM.md) - Comprehensive error handling architecture
## 🏗️ Architecture ## 🏗️ Architecture

View File

@ -0,0 +1,247 @@
# Multi-Language OCR Guide
Readur supports powerful multi-language OCR capabilities that allow you to process documents in multiple languages simultaneously for optimal text extraction accuracy.
## 🌍 Overview
The multi-language OCR system allows you to:
- **Process documents in up to 4 languages simultaneously** for best results
- **Set preferred languages** that apply to all your document uploads
- **Retry failed OCR** with different language combinations
- **Automatically optimize** text extraction by using multiple language models
## 🚀 Getting Started
### Setting Your Language Preferences
1. **Navigate to Settings** in your account
2. **Select OCR Languages** section
3. **Choose up to 4 preferred languages** - these will be used for all new uploads
4. **Set a primary language** - this language gets processing priority
5. **Save your preferences**
**Example preferred language setup:**
- Primary: English (`eng`)
- Additional: Spanish (`spa`), French (`fra`)
- Result: Documents processed with English priority, plus Spanish and French recognition
### Language Selection During Upload
When uploading documents, you can:
1. **Use your default preferences** - no action needed
2. **Override for specific documents:**
- Click the language selector in the upload area
- Choose different languages for this upload session
- These languages will be applied to all files in the current upload
## 📋 Available Languages
Readur supports 67+ languages including:
### Major World Languages
- **English** (`eng`) - Default and most reliable
- **Spanish** (`spa`) - Excellent accuracy
- **French** (`fra`) - High quality results
- **German** (`deu`) - Strong performance
- **Italian** (`ita`) - Good accuracy
- **Portuguese** (`por`) - Reliable processing
- **Russian** (`rus`) - Solid results
### Asian Languages
- **Chinese Simplified** (`chi_sim`)
- **Chinese Traditional** (`chi_tra`)
- **Japanese** (`jpn`)
- **Korean** (`kor`)
- **Hindi** (`hin`)
- **Thai** (`tha`)
- **Vietnamese** (`vie`)
### European Languages
- **Dutch** (`nld`)
- **Swedish** (`swe`)
- **Norwegian** (`nor`)
- **Danish** (`dan`)
- **Finnish** (`fin`)
- **Polish** (`pol`)
- **Czech** (`ces`)
### And Many More
Including Arabic (`ara`), Hebrew (`heb`), Turkish (`tur`), and dozens of other languages.
> **Tip:** For the complete list of available languages, visit the OCR Languages page in your settings or call the API endpoint: `GET /api/ocr/languages`
## 🛠️ Using the API
### Get Available Languages
```bash
curl -H "Authorization: Bearer YOUR_TOKEN" \
https://your-readur-instance.com/api/ocr/languages
```
**Response:**
```json
{
"available_languages": [
{
"code": "eng",
"name": "English",
"installed": true
},
{
"code": "spa",
"name": "Spanish",
"installed": true
}
],
"current_user_language": "eng"
}
```
### Update Language Preferences
```bash
curl -X PUT \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"preferred_languages": ["eng", "spa", "fra"],
"primary_language": "eng"
}' \
https://your-readur-instance.com/api/settings
```
### Retry OCR with Different Languages
```bash
curl -X POST \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"languages": ["eng", "deu"]
}' \
https://your-readur-instance.com/api/documents/DOCUMENT_ID/ocr/retry
```
## 🎯 Best Practices
### Language Selection Strategy
**For Mixed-Language Documents:**
- Choose 2-3 languages that appear in your document
- Always include English as a fallback (most reliable)
- Put the dominant language first as your primary language
**Examples:**
- **Business document with English/Spanish:** `["eng", "spa"]`
- **European legal document:** `["eng", "fra", "deu"]`
- **Academic paper with multiple references:** `["eng", "spa", "ita"]`
### Performance Optimization
**Do:**
- ✅ Limit to 2-4 languages for best performance
- ✅ Include English when processing mixed content
- ✅ Use specific language combinations for consistent document types
- ✅ Set realistic expectations for complex multilingual documents
**Don't:**
- ❌ Select languages not present in your documents
- ❌ Use more than 4 languages simultaneously
- ❌ Expect perfect results with very low-quality scans
- ❌ Mix completely unrelated language families unnecessarily
## 🔄 Retrying OCR Processing
If OCR results are poor, you can retry with different languages:
### Via Web Interface
1. **Navigate to the document** with poor OCR results
2. **Click "Retry OCR"** button
3. **Select different languages** that better match your document
4. **Start retry process**
### Common Retry Scenarios
**Scenario 1: Wrong Language Detected**
- Original: English-only processing of Spanish document
- Solution: Retry with `["spa", "eng"]`
**Scenario 2: Mixed Language Document**
- Original: Single language processing
- Solution: Add 2-3 relevant languages
**Scenario 3: Poor Quality Scan**
- Original: Fast processing with limited languages
- Solution: Try with primary language + English fallback
## 📊 Monitoring OCR Results
### Understanding OCR Confidence
- **90%+** - Excellent results, high accuracy
- **70-89%** - Good results, minor errors possible
- **50-69%** - Moderate results, review recommended
- **Below 50%** - Poor results, consider retry with different languages
### Language-Specific Performance
Different languages have varying accuracy rates:
- **Latin-based scripts** (English, Spanish, French): Highest accuracy
- **Germanic languages** (German, Dutch): Very good accuracy
- **Asian languages** (Chinese, Japanese): Good accuracy with proper font recognition
- **Arabic/Hebrew scripts**: Moderate accuracy, depends on text quality
## 🐛 Troubleshooting
### Common Issues
**Problem:** "Language not available" error
**Solution:**
- Check language code spelling (e.g., `eng` not `english`)
- Verify language is installed on the server
- Contact administrator if language should be available
**Problem:** Poor OCR results despite correct language
**Solutions:**
- Ensure document scan quality is sufficient (300+ DPI recommended)
- Try adding English as a fallback language
- Consider document preprocessing (contrast, rotation correction)
- Retry with fewer languages for better performance
**Problem:** Slow processing with multiple languages
**Solutions:**
- Reduce number of selected languages to 2-3
- Use languages only present in your document
- Consider processing during off-peak hours
### Getting Help
If you're experiencing issues:
1. **Check the OCR Health page** - `GET /api/ocr/health`
2. **Review your language selection** - ensure languages match document content
3. **Try with English fallback** - adds reliability to processing
4. **Contact support** with document ID and language combination used
## 🔮 Advanced Features
### Planned Enhancements
- **Auto-language detection**: Automatic suggestion of optimal language combinations
- **Custom language models**: Upload your own specialized language data
- **Batch language updates**: Change languages for multiple documents at once
- **Language-specific confidence thresholds**: Fine-tune accuracy requirements per language
### Integration Options
The multi-language OCR system integrates with:
- **Document management workflows**
- **Automated processing pipelines**
- **Third-party applications via REST API**
- **Webhook notifications for completion**
## 📚 Additional Resources
- **API Documentation**: Complete endpoint reference
- **Language Codes Reference**: Full list of supported language codes
- **Performance Guidelines**: Optimization recommendations
- **Migration Guide**: Upgrading from single-language setup
---
**Need Help?** Contact support or check the system health dashboard for real-time OCR capability status.