feat(docs): add docs about multiple ocr languages

2025-07-21 23:34:57 +00:00 · 2025-07-21 23:34:57 +00:00 · 2e6c1ef238
parent 98c3bd50ef
commit 2e6c1ef238
2 changed files with 249 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -54,6 +54,7 @@ open http://localhost:8000
 - [👥 User Management](docs/user-management-guide.md) - Authentication, roles, and administration
 - [🏷️ Labels & Organization](docs/labels-and-organization.md) - Document tagging and categorization
 - [🔎 Advanced Search](docs/advanced-search.md) - Search modes, syntax, and optimization
 - [🌍 Multi-Language OCR Guide](docs/multi-language-ocr-guide.md) - Process documents in multiple languages simultaneously
 - [🔐 OIDC Setup](docs/oidc-setup.md) - Single Sign-On integration
 ### Deployment & Operations
@ -69,6 +70,7 @@ open http://localhost:8000
 - [🔍 OCR Optimization](docs/dev/OCR_OPTIMIZATION_GUIDE.md) - Improve OCR performance
 - [🗄️ Database Best Practices](docs/dev/DATABASE_GUARDRAILS.md) - Concurrency and safety
 - [📊 Queue Architecture](docs/dev/QUEUE_IMPROVEMENTS.md) - Background job processing
 - [⚠️ Error System Guide](docs/dev/ERROR_SYSTEM.md) - Comprehensive error handling architecture
 ## 🏗️ Architecture
--- a/docs/multi-language-ocr-guide.md
+++ b/docs/multi-language-ocr-guide.md
@ -0,0 +1,247 @@
 # Multi-Language OCR Guide
 Readur supports powerful multi-language OCR capabilities that allow you to process documents in multiple languages simultaneously for optimal text extraction accuracy.
 ## 🌍 Overview
 The multi-language OCR system allows you to:
 - **Process documents in up to 4 languages simultaneously** for best results
 - **Set preferred languages** that apply to all your document uploads
 - **Retry failed OCR** with different language combinations
 - **Automatically optimize** text extraction by using multiple language models
 ## 🚀 Getting Started
 ### Setting Your Language Preferences
 1. **Navigate to Settings** in your account
 2. **Select OCR Languages** section
 3. **Choose up to 4 preferred languages** - these will be used for all new uploads
 4. **Set a primary language** - this language gets processing priority
 5. **Save your preferences**
 **Example preferred language setup:**
 - Primary: English (`eng`)
 - Additional: Spanish (`spa`), French (`fra`)
 - Result: Documents processed with English priority, plus Spanish and French recognition
 ### Language Selection During Upload
 When uploading documents, you can:
 1. **Use your default preferences** - no action needed
 2. **Override for specific documents:**
   - Click the language selector in the upload area
   - Choose different languages for this upload session
   - These languages will be applied to all files in the current upload
 ## 📋 Available Languages
 Readur supports 67+ languages including:
 ### Major World Languages
 - **English** (`eng`) - Default and most reliable
 - **Spanish** (`spa`) - Excellent accuracy
 - **French** (`fra`) - High quality results
 - **German** (`deu`) - Strong performance
 - **Italian** (`ita`) - Good accuracy
 - **Portuguese** (`por`) - Reliable processing
 - **Russian** (`rus`) - Solid results
 ### Asian Languages  
 - **Chinese Simplified** (`chi_sim`)
 - **Chinese Traditional** (`chi_tra`)
 - **Japanese** (`jpn`)
 - **Korean** (`kor`)
 - **Hindi** (`hin`)
 - **Thai** (`tha`)
 - **Vietnamese** (`vie`)
 ### European Languages
 - **Dutch** (`nld`)
 - **Swedish** (`swe`)
 - **Norwegian** (`nor`)
 - **Danish** (`dan`)
 - **Finnish** (`fin`)
 - **Polish** (`pol`)
 - **Czech** (`ces`)
 ### And Many More
 Including Arabic (`ara`), Hebrew (`heb`), Turkish (`tur`), and dozens of other languages.
 > **Tip:** For the complete list of available languages, visit the OCR Languages page in your settings or call the API endpoint: `GET /api/ocr/languages`
 ## 🛠️ Using the API
 ### Get Available Languages
 ```bash
 curl -H "Authorization: Bearer YOUR_TOKEN" \
     https://your-readur-instance.com/api/ocr/languages
 ```
 **Response:**
 ```json
 {
  "available_languages": [
    {
      "code": "eng",
      "name": "English",
      "installed": true
    },
    {
      "code": "spa", 
      "name": "Spanish",
      "installed": true
    }
  ],
  "current_user_language": "eng"
 }
 ```
 ### Update Language Preferences
 ```bash
 curl -X PUT \
     -H "Authorization: Bearer YOUR_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "preferred_languages": ["eng", "spa", "fra"],
       "primary_language": "eng"
     }' \
     https://your-readur-instance.com/api/settings
 ```
 ### Retry OCR with Different Languages
 ```bash
 curl -X POST \
     -H "Authorization: Bearer YOUR_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "languages": ["eng", "deu"]
     }' \
     https://your-readur-instance.com/api/documents/DOCUMENT_ID/ocr/retry
 ```
 ## 🎯 Best Practices
 ### Language Selection Strategy
 **For Mixed-Language Documents:**
 - Choose 2-3 languages that appear in your document
 - Always include English as a fallback (most reliable)
 - Put the dominant language first as your primary language
 **Examples:**
 - **Business document with English/Spanish:** `["eng", "spa"]`
 - **European legal document:** `["eng", "fra", "deu"]`
 - **Academic paper with multiple references:** `["eng", "spa", "ita"]`
 ### Performance Optimization
 **Do:**
 - ✅ Limit to 2-4 languages for best performance
 - ✅ Include English when processing mixed content
 - ✅ Use specific language combinations for consistent document types
 - ✅ Set realistic expectations for complex multilingual documents
 **Don't:**
 - ❌ Select languages not present in your documents
 - ❌ Use more than 4 languages simultaneously
 - ❌ Expect perfect results with very low-quality scans
 - ❌ Mix completely unrelated language families unnecessarily
 ## 🔄 Retrying OCR Processing
 If OCR results are poor, you can retry with different languages:
 ### Via Web Interface
 1. **Navigate to the document** with poor OCR results
 2. **Click "Retry OCR"** button
 3. **Select different languages** that better match your document
 4. **Start retry process**
 ### Common Retry Scenarios
 **Scenario 1: Wrong Language Detected**
 - Original: English-only processing of Spanish document
 - Solution: Retry with `["spa", "eng"]`
 **Scenario 2: Mixed Language Document**
 - Original: Single language processing
 - Solution: Add 2-3 relevant languages
 **Scenario 3: Poor Quality Scan**
 - Original: Fast processing with limited languages
 - Solution: Try with primary language + English fallback
 ## 📊 Monitoring OCR Results
 ### Understanding OCR Confidence
 - **90%+** - Excellent results, high accuracy
 - **70-89%** - Good results, minor errors possible  
 - **50-69%** - Moderate results, review recommended
 - **Below 50%** - Poor results, consider retry with different languages
 ### Language-Specific Performance
 Different languages have varying accuracy rates:
 - **Latin-based scripts** (English, Spanish, French): Highest accuracy
 - **Germanic languages** (German, Dutch): Very good accuracy
 - **Asian languages** (Chinese, Japanese): Good accuracy with proper font recognition
 - **Arabic/Hebrew scripts**: Moderate accuracy, depends on text quality
 ## 🐛 Troubleshooting
 ### Common Issues
 **Problem:** "Language not available" error
 **Solution:** 
 - Check language code spelling (e.g., `eng` not `english`)
 - Verify language is installed on the server
 - Contact administrator if language should be available
 **Problem:** Poor OCR results despite correct language
 **Solutions:**
 - Ensure document scan quality is sufficient (300+ DPI recommended)
 - Try adding English as a fallback language
 - Consider document preprocessing (contrast, rotation correction)
 - Retry with fewer languages for better performance
 **Problem:** Slow processing with multiple languages  
 **Solutions:**
 - Reduce number of selected languages to 2-3
 - Use languages only present in your document
 - Consider processing during off-peak hours
 ### Getting Help
 If you're experiencing issues:
 1. **Check the OCR Health page** - `GET /api/ocr/health`
 2. **Review your language selection** - ensure languages match document content
 3. **Try with English fallback** - adds reliability to processing
 4. **Contact support** with document ID and language combination used
 ## 🔮 Advanced Features
 ### Planned Enhancements
 - **Auto-language detection**: Automatic suggestion of optimal language combinations
 - **Custom language models**: Upload your own specialized language data
 - **Batch language updates**: Change languages for multiple documents at once
 - **Language-specific confidence thresholds**: Fine-tune accuracy requirements per language
 ### Integration Options
 The multi-language OCR system integrates with:
 - **Document management workflows**
 - **Automated processing pipelines**  
 - **Third-party applications via REST API**
 - **Webhook notifications for completion**
 ## 📚 Additional Resources
 - **API Documentation**: Complete endpoint reference
 - **Language Codes Reference**: Full list of supported language codes
 - **Performance Guidelines**: Optimization recommendations
 - **Migration Guide**: Upgrading from single-language setup
 ---
 **Need Help?** Contact support or check the system health dashboard for real-time OCR capability status.