Readur/docs/multi-language-ocr-guide.md

247 lines
7.8 KiB
Markdown

# Multi-Language OCR Guide
Readur supports powerful multi-language OCR capabilities that allow you to process documents in multiple languages simultaneously for optimal text extraction accuracy.
## 🌍 Overview
The multi-language OCR system allows you to:
- **Process documents in up to 4 languages simultaneously** for best results
- **Set preferred languages** that apply to all your document uploads
- **Retry failed OCR** with different language combinations
- **Automatically optimize** text extraction by using multiple language models
## 🚀 Getting Started
### Setting Your Language Preferences
1. **Navigate to Settings** in your account
2. **Select OCR Languages** section
3. **Choose up to 4 preferred languages** - these will be used for all new uploads
4. **Set a primary language** - this language gets processing priority
5. **Save your preferences**
**Example preferred language setup:**
- Primary: English (`eng`)
- Additional: Spanish (`spa`), French (`fra`)
- Result: Documents processed with English priority, plus Spanish and French recognition
### Language Selection During Upload
When uploading documents, you can:
1. **Use your default preferences** - no action needed
2. **Override for specific documents:**
- Click the language selector in the upload area
- Choose different languages for this upload session
- These languages will be applied to all files in the current upload
## 📋 Available Languages
Readur supports 67+ languages including:
### Major World Languages
- **English** (`eng`) - Default and most reliable
- **Spanish** (`spa`) - Excellent accuracy
- **French** (`fra`) - High quality results
- **German** (`deu`) - Strong performance
- **Italian** (`ita`) - Good accuracy
- **Portuguese** (`por`) - Reliable processing
- **Russian** (`rus`) - Solid results
### Asian Languages
- **Chinese Simplified** (`chi_sim`)
- **Chinese Traditional** (`chi_tra`)
- **Japanese** (`jpn`)
- **Korean** (`kor`)
- **Hindi** (`hin`)
- **Thai** (`tha`)
- **Vietnamese** (`vie`)
### European Languages
- **Dutch** (`nld`)
- **Swedish** (`swe`)
- **Norwegian** (`nor`)
- **Danish** (`dan`)
- **Finnish** (`fin`)
- **Polish** (`pol`)
- **Czech** (`ces`)
### And Many More
Including Arabic (`ara`), Hebrew (`heb`), Turkish (`tur`), and dozens of other languages.
> **Tip:** For the complete list of available languages, visit the OCR Languages page in your settings or call the API endpoint: `GET /api/ocr/languages`
## 🛠️ Using the API
### Get Available Languages
```bash
curl -H "Authorization: Bearer YOUR_TOKEN" \
https://your-readur-instance.com/api/ocr/languages
```
**Response:**
```json
{
"available_languages": [
{
"code": "eng",
"name": "English",
"installed": true
},
{
"code": "spa",
"name": "Spanish",
"installed": true
}
],
"current_user_language": "eng"
}
```
### Update Language Preferences
```bash
curl -X PUT \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"preferred_languages": ["eng", "spa", "fra"],
"primary_language": "eng"
}' \
https://your-readur-instance.com/api/settings
```
### Retry OCR with Different Languages
```bash
curl -X POST \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"languages": ["eng", "deu"]
}' \
https://your-readur-instance.com/api/documents/DOCUMENT_ID/ocr/retry
```
## 🎯 Best Practices
### Language Selection Strategy
**For Mixed-Language Documents:**
- Choose 2-3 languages that appear in your document
- Always include English as a fallback (most reliable)
- Put the dominant language first as your primary language
**Examples:**
- **Business document with English/Spanish:** `["eng", "spa"]`
- **European legal document:** `["eng", "fra", "deu"]`
- **Academic paper with multiple references:** `["eng", "spa", "ita"]`
### Performance Optimization
**Do:**
- ✅ Limit to 2-4 languages for best performance
- ✅ Include English when processing mixed content
- ✅ Use specific language combinations for consistent document types
- ✅ Set realistic expectations for complex multilingual documents
**Don't:**
- ❌ Select languages not present in your documents
- ❌ Use more than 4 languages simultaneously
- ❌ Expect perfect results with very low-quality scans
- ❌ Mix completely unrelated language families unnecessarily
## 🔄 Retrying OCR Processing
If OCR results are poor, you can retry with different languages:
### Via Web Interface
1. **Navigate to the document** with poor OCR results
2. **Click "Retry OCR"** button
3. **Select different languages** that better match your document
4. **Start retry process**
### Common Retry Scenarios
**Scenario 1: Wrong Language Detected**
- Original: English-only processing of Spanish document
- Solution: Retry with `["spa", "eng"]`
**Scenario 2: Mixed Language Document**
- Original: Single language processing
- Solution: Add 2-3 relevant languages
**Scenario 3: Poor Quality Scan**
- Original: Fast processing with limited languages
- Solution: Try with primary language + English fallback
## 📊 Monitoring OCR Results
### Understanding OCR Confidence
- **90%+** - Excellent results, high accuracy
- **70-89%** - Good results, minor errors possible
- **50-69%** - Moderate results, review recommended
- **Below 50%** - Poor results, consider retry with different languages
### Language-Specific Performance
Different languages have varying accuracy rates:
- **Latin-based scripts** (English, Spanish, French): Highest accuracy
- **Germanic languages** (German, Dutch): Very good accuracy
- **Asian languages** (Chinese, Japanese): Good accuracy with proper font recognition
- **Arabic/Hebrew scripts**: Moderate accuracy, depends on text quality
## 🐛 Troubleshooting
### Common Issues
**Problem:** "Language not available" error
**Solution:**
- Check language code spelling (e.g., `eng` not `english`)
- Verify language is installed on the server
- Contact administrator if language should be available
**Problem:** Poor OCR results despite correct language
**Solutions:**
- Ensure document scan quality is sufficient (300+ DPI recommended)
- Try adding English as a fallback language
- Consider document preprocessing (contrast, rotation correction)
- Retry with fewer languages for better performance
**Problem:** Slow processing with multiple languages
**Solutions:**
- Reduce number of selected languages to 2-3
- Use languages only present in your document
- Consider processing during off-peak hours
### Getting Help
If you're experiencing issues:
1. **Check the OCR Health page** - `GET /api/ocr/health`
2. **Review your language selection** - ensure languages match document content
3. **Try with English fallback** - adds reliability to processing
4. **Contact support** with document ID and language combination used
## 🔮 Advanced Features
### Planned Enhancements
- **Auto-language detection**: Automatic suggestion of optimal language combinations
- **Custom language models**: Upload your own specialized language data
- **Batch language updates**: Change languages for multiple documents at once
- **Language-specific confidence thresholds**: Fine-tune accuracy requirements per language
### Integration Options
The multi-language OCR system integrates with:
- **Document management workflows**
- **Automated processing pipelines**
- **Third-party applications via REST API**
- **Webhook notifications for completion**
## 📚 Additional Resources
- **API Documentation**: Complete endpoint reference
- **Language Codes Reference**: Full list of supported language codes
- **Performance Guidelines**: Optimization recommendations
- **Migration Guide**: Upgrading from single-language setup
---
**Need Help?** Contact support or check the system health dashboard for real-time OCR capability status.