247 lines
7.8 KiB
Markdown
247 lines
7.8 KiB
Markdown
# Multi-Language OCR Guide
|
|
|
|
Readur supports powerful multi-language OCR capabilities that allow you to process documents in multiple languages simultaneously for optimal text extraction accuracy.
|
|
|
|
## 🌍 Overview
|
|
|
|
The multi-language OCR system allows you to:
|
|
- **Process documents in up to 4 languages simultaneously** for best results
|
|
- **Set preferred languages** that apply to all your document uploads
|
|
- **Retry failed OCR** with different language combinations
|
|
- **Automatically optimize** text extraction by using multiple language models
|
|
|
|
## 🚀 Getting Started
|
|
|
|
### Setting Your Language Preferences
|
|
|
|
1. **Navigate to Settings** in your account
|
|
2. **Select OCR Languages** section
|
|
3. **Choose up to 4 preferred languages** - these will be used for all new uploads
|
|
4. **Set a primary language** - this language gets processing priority
|
|
5. **Save your preferences**
|
|
|
|
**Example preferred language setup:**
|
|
- Primary: English (`eng`)
|
|
- Additional: Spanish (`spa`), French (`fra`)
|
|
- Result: Documents processed with English priority, plus Spanish and French recognition
|
|
|
|
### Language Selection During Upload
|
|
|
|
When uploading documents, you can:
|
|
|
|
1. **Use your default preferences** - no action needed
|
|
2. **Override for specific documents:**
|
|
- Click the language selector in the upload area
|
|
- Choose different languages for this upload session
|
|
- These languages will be applied to all files in the current upload
|
|
|
|
## 📋 Available Languages
|
|
|
|
Readur supports 67+ languages including:
|
|
|
|
### Major World Languages
|
|
- **English** (`eng`) - Default and most reliable
|
|
- **Spanish** (`spa`) - Excellent accuracy
|
|
- **French** (`fra`) - High quality results
|
|
- **German** (`deu`) - Strong performance
|
|
- **Italian** (`ita`) - Good accuracy
|
|
- **Portuguese** (`por`) - Reliable processing
|
|
- **Russian** (`rus`) - Solid results
|
|
|
|
### Asian Languages
|
|
- **Chinese Simplified** (`chi_sim`)
|
|
- **Chinese Traditional** (`chi_tra`)
|
|
- **Japanese** (`jpn`)
|
|
- **Korean** (`kor`)
|
|
- **Hindi** (`hin`)
|
|
- **Thai** (`tha`)
|
|
- **Vietnamese** (`vie`)
|
|
|
|
### European Languages
|
|
- **Dutch** (`nld`)
|
|
- **Swedish** (`swe`)
|
|
- **Norwegian** (`nor`)
|
|
- **Danish** (`dan`)
|
|
- **Finnish** (`fin`)
|
|
- **Polish** (`pol`)
|
|
- **Czech** (`ces`)
|
|
|
|
### And Many More
|
|
Including Arabic (`ara`), Hebrew (`heb`), Turkish (`tur`), and dozens of other languages.
|
|
|
|
> **Tip:** For the complete list of available languages, visit the OCR Languages page in your settings or call the API endpoint: `GET /api/ocr/languages`
|
|
|
|
## 🛠️ Using the API
|
|
|
|
### Get Available Languages
|
|
```bash
|
|
curl -H "Authorization: Bearer YOUR_TOKEN" \
|
|
https://your-readur-instance.com/api/ocr/languages
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"available_languages": [
|
|
{
|
|
"code": "eng",
|
|
"name": "English",
|
|
"installed": true
|
|
},
|
|
{
|
|
"code": "spa",
|
|
"name": "Spanish",
|
|
"installed": true
|
|
}
|
|
],
|
|
"current_user_language": "eng"
|
|
}
|
|
```
|
|
|
|
### Update Language Preferences
|
|
```bash
|
|
curl -X PUT \
|
|
-H "Authorization: Bearer YOUR_TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"preferred_languages": ["eng", "spa", "fra"],
|
|
"primary_language": "eng"
|
|
}' \
|
|
https://your-readur-instance.com/api/settings
|
|
```
|
|
|
|
### Retry OCR with Different Languages
|
|
```bash
|
|
curl -X POST \
|
|
-H "Authorization: Bearer YOUR_TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"languages": ["eng", "deu"]
|
|
}' \
|
|
https://your-readur-instance.com/api/documents/DOCUMENT_ID/ocr/retry
|
|
```
|
|
|
|
## 🎯 Best Practices
|
|
|
|
### Language Selection Strategy
|
|
|
|
**For Mixed-Language Documents:**
|
|
- Choose 2-3 languages that appear in your document
|
|
- Always include English as a fallback (most reliable)
|
|
- Put the dominant language first as your primary language
|
|
|
|
**Examples:**
|
|
- **Business document with English/Spanish:** `["eng", "spa"]`
|
|
- **European legal document:** `["eng", "fra", "deu"]`
|
|
- **Academic paper with multiple references:** `["eng", "spa", "ita"]`
|
|
|
|
### Performance Optimization
|
|
|
|
**Do:**
|
|
- ✅ Limit to 2-4 languages for best performance
|
|
- ✅ Include English when processing mixed content
|
|
- ✅ Use specific language combinations for consistent document types
|
|
- ✅ Set realistic expectations for complex multilingual documents
|
|
|
|
**Don't:**
|
|
- ❌ Select languages not present in your documents
|
|
- ❌ Use more than 4 languages simultaneously
|
|
- ❌ Expect perfect results with very low-quality scans
|
|
- ❌ Mix completely unrelated language families unnecessarily
|
|
|
|
## 🔄 Retrying OCR Processing
|
|
|
|
If OCR results are poor, you can retry with different languages:
|
|
|
|
### Via Web Interface
|
|
1. **Navigate to the document** with poor OCR results
|
|
2. **Click "Retry OCR"** button
|
|
3. **Select different languages** that better match your document
|
|
4. **Start retry process**
|
|
|
|
### Common Retry Scenarios
|
|
|
|
**Scenario 1: Wrong Language Detected**
|
|
- Original: English-only processing of Spanish document
|
|
- Solution: Retry with `["spa", "eng"]`
|
|
|
|
**Scenario 2: Mixed Language Document**
|
|
- Original: Single language processing
|
|
- Solution: Add 2-3 relevant languages
|
|
|
|
**Scenario 3: Poor Quality Scan**
|
|
- Original: Fast processing with limited languages
|
|
- Solution: Try with primary language + English fallback
|
|
|
|
## 📊 Monitoring OCR Results
|
|
|
|
### Understanding OCR Confidence
|
|
- **90%+** - Excellent results, high accuracy
|
|
- **70-89%** - Good results, minor errors possible
|
|
- **50-69%** - Moderate results, review recommended
|
|
- **Below 50%** - Poor results, consider retry with different languages
|
|
|
|
### Language-Specific Performance
|
|
Different languages have varying accuracy rates:
|
|
- **Latin-based scripts** (English, Spanish, French): Highest accuracy
|
|
- **Germanic languages** (German, Dutch): Very good accuracy
|
|
- **Asian languages** (Chinese, Japanese): Good accuracy with proper font recognition
|
|
- **Arabic/Hebrew scripts**: Moderate accuracy, depends on text quality
|
|
|
|
## 🐛 Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
**Problem:** "Language not available" error
|
|
**Solution:**
|
|
- Check language code spelling (e.g., `eng` not `english`)
|
|
- Verify language is installed on the server
|
|
- Contact administrator if language should be available
|
|
|
|
**Problem:** Poor OCR results despite correct language
|
|
**Solutions:**
|
|
- Ensure document scan quality is sufficient (300+ DPI recommended)
|
|
- Try adding English as a fallback language
|
|
- Consider document preprocessing (contrast, rotation correction)
|
|
- Retry with fewer languages for better performance
|
|
|
|
**Problem:** Slow processing with multiple languages
|
|
**Solutions:**
|
|
- Reduce number of selected languages to 2-3
|
|
- Use languages only present in your document
|
|
- Consider processing during off-peak hours
|
|
|
|
### Getting Help
|
|
|
|
If you're experiencing issues:
|
|
|
|
1. **Check the OCR Health page** - `GET /api/ocr/health`
|
|
2. **Review your language selection** - ensure languages match document content
|
|
3. **Try with English fallback** - adds reliability to processing
|
|
4. **Contact support** with document ID and language combination used
|
|
|
|
## 🔮 Advanced Features
|
|
|
|
### Planned Enhancements
|
|
- **Auto-language detection**: Automatic suggestion of optimal language combinations
|
|
- **Custom language models**: Upload your own specialized language data
|
|
- **Batch language updates**: Change languages for multiple documents at once
|
|
- **Language-specific confidence thresholds**: Fine-tune accuracy requirements per language
|
|
|
|
### Integration Options
|
|
The multi-language OCR system integrates with:
|
|
- **Document management workflows**
|
|
- **Automated processing pipelines**
|
|
- **Third-party applications via REST API**
|
|
- **Webhook notifications for completion**
|
|
|
|
## 📚 Additional Resources
|
|
|
|
- **API Documentation**: Complete endpoint reference
|
|
- **Language Codes Reference**: Full list of supported language codes
|
|
- **Performance Guidelines**: Optimization recommendations
|
|
- **Migration Guide**: Upgrading from single-language setup
|
|
|
|
---
|
|
|
|
**Need Help?** Contact support or check the system health dashboard for real-time OCR capability status. |