Readur/docs/multi-language-ocr-guide.md

# Multi-Language OCR Guide

Readur supports powerful multi-language OCR capabilities that allow you to process documents in multiple languages simultaneously for optimal text extraction accuracy.

## 🌍 Overview

The multi-language OCR system allows you to:
- **Process documents in up to 4 languages simultaneously** for best results
- **Set preferred languages** that apply to all your document uploads
- **Retry failed OCR** with different language combinations
- **Automatically optimize** text extraction by using multiple language models

## 🚀 Getting Started

### Setting Your Language Preferences

1. **Navigate to Settings** in your account
2. **Select OCR Languages** section
3. **Choose up to 4 preferred languages** - these will be used for all new uploads
4. **Set a primary language** - this language gets processing priority
5. **Save your preferences**

**Example preferred language setup:**
- Primary: English (`eng`)
- Additional: Spanish (`spa`), French (`fra`)
- Result: Documents processed with English priority, plus Spanish and French recognition

### Language Selection During Upload

When uploading documents, you can:

1. **Use your default preferences** - no action needed
2. **Override for specific documents:**
   - Click the language selector in the upload area
   - Choose different languages for this upload session
   - These languages will be applied to all files in the current upload

## 📋 Available Languages

Readur supports 67+ languages including:

### Major World Languages
- **English** (`eng`) - Default and most reliable
- **Spanish** (`spa`) - Excellent accuracy
- **French** (`fra`) - High quality results
- **German** (`deu`) - Strong performance
- **Italian** (`ita`) - Good accuracy
- **Portuguese** (`por`) - Reliable processing
- **Russian** (`rus`) - Solid results

### Asian Languages
- **Chinese Simplified** (`chi_sim`)
- **Chinese Traditional** (`chi_tra`)
- **Japanese** (`jpn`)
- **Korean** (`kor`)
- **Hindi** (`hin`)
- **Thai** (`tha`)
- **Vietnamese** (`vie`)

### European Languages
- **Dutch** (`nld`)
- **Swedish** (`swe`)
- **Norwegian** (`nor`)
- **Danish** (`dan`)
- **Finnish** (`fin`)
- **Polish** (`pol`)
- **Czech** (`ces`)

### And Many More
Including Arabic (`ara`), Hebrew (`heb`), Turkish (`tur`), and dozens of other languages.

> **Tip:** For the complete list of available languages, visit the OCR Languages page in your settings or call the API endpoint: `GET /api/ocr/languages`

## 🛠️ Using the API

### Get Available Languages
```bash
curl -H "Authorization: Bearer YOUR_TOKEN" \
     https://your-readur-instance.com/api/ocr/languages
```

**Response:**
```json
{
  "available_languages": [
    {
      "code": "eng",
      "name": "English",
      "installed": true
    },
    {
      "code": "spa",
      "name": "Spanish",
      "installed": true
    }
  ],
  "current_user_language": "eng"
}
```

### Update Language Preferences
```bash
curl -X PUT \
     -H "Authorization: Bearer YOUR_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "preferred_languages": ["eng", "spa", "fra"],
       "primary_language": "eng"
     }' \
     https://your-readur-instance.com/api/settings
```

### Retry OCR with Different Languages
```bash
curl -X POST \
     -H "Authorization: Bearer YOUR_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "languages": ["eng", "deu"]
     }' \
     https://your-readur-instance.com/api/documents/DOCUMENT_ID/ocr/retry
```

## 🎯 Best Practices

### Language Selection Strategy

**For Mixed-Language Documents:**
- Choose 2-3 languages that appear in your document
- Always include English as a fallback (most reliable)
- Put the dominant language first as your primary language

**Examples:**
- **Business document with English/Spanish:** `["eng", "spa"]`
- **European legal document:** `["eng", "fra", "deu"]`
- **Academic paper with multiple references:** `["eng", "spa", "ita"]`

### Performance Optimization

**Do:**
- ✅ Limit to 2-4 languages for best performance
- ✅ Include English when processing mixed content
- ✅ Use specific language combinations for consistent document types
- ✅ Set realistic expectations for complex multilingual documents

**Don't:**
- ❌ Select languages not present in your documents
- ❌ Use more than 4 languages simultaneously
- ❌ Expect perfect results with very low-quality scans
- ❌ Mix completely unrelated language families unnecessarily

## 🔄 Retrying OCR Processing

If OCR results are poor, you can retry with different languages:

### Via Web Interface
1. **Navigate to the document** with poor OCR results
2. **Click "Retry OCR"** button
3. **Select different languages** that better match your document
4. **Start retry process**

### Common Retry Scenarios

**Scenario 1: Wrong Language Detected**
- Original: English-only processing of Spanish document
- Solution: Retry with `["spa", "eng"]`

**Scenario 2: Mixed Language Document**
- Original: Single language processing
- Solution: Add 2-3 relevant languages

**Scenario 3: Poor Quality Scan**
- Original: Fast processing with limited languages
- Solution: Try with primary language + English fallback

## 📊 Monitoring OCR Results

### Understanding OCR Confidence
- **90%+** - Excellent results, high accuracy
- **70-89%** - Good results, minor errors possible
- **50-69%** - Moderate results, review recommended
- **Below 50%** - Poor results, consider retry with different languages

### Language-Specific Performance
Different languages have varying accuracy rates:
- **Latin-based scripts** (English, Spanish, French): Highest accuracy
- **Germanic languages** (German, Dutch): Very good accuracy
- **Asian languages** (Chinese, Japanese): Good accuracy with proper font recognition
- **Arabic/Hebrew scripts**: Moderate accuracy, depends on text quality

## 🐛 Troubleshooting

### Common Issues

**Problem:** "Language not available" error
**Solution:**
- Check language code spelling (e.g., `eng` not `english`)
- Verify language is installed on the server
- Contact administrator if language should be available

**Problem:** Poor OCR results despite correct language
**Solutions:**
- Ensure document scan quality is sufficient (300+ DPI recommended)
- Try adding English as a fallback language
- Consider document preprocessing (contrast, rotation correction)
- Retry with fewer languages for better performance

**Problem:** Slow processing with multiple languages
**Solutions:**
- Reduce number of selected languages to 2-3
- Use languages only present in your document
- Consider processing during off-peak hours

### Getting Help

If you're experiencing issues:

1. **Check the OCR Health page** - `GET /api/ocr/health`
2. **Review your language selection** - ensure languages match document content
3. **Try with English fallback** - adds reliability to processing
4. **Contact support** with document ID and language combination used

## 🔮 Advanced Features

### Planned Enhancements
- **Auto-language detection**: Automatic suggestion of optimal language combinations
- **Custom language models**: Upload your own specialized language data
- **Batch language updates**: Change languages for multiple documents at once
- **Language-specific confidence thresholds**: Fine-tune accuracy requirements per language

### Integration Options
The multi-language OCR system integrates with:
- **Document management workflows**
- **Automated processing pipelines**
- **Third-party applications via REST API**
- **Webhook notifications for completion**

## 📚 Additional Resources

- **API Documentation**: Complete endpoint reference
- **Language Codes Reference**: Full list of supported language codes
- **Performance Guidelines**: Optimization recommendations
- **Migration Guide**: Upgrading from single-language setup

---

**Need Help?** Contact support or check the system health dashboard for real-time OCR capability status.