From 2e6c1ef2381ace30af72d235e7e6ba841faad1c3 Mon Sep 17 00:00:00 2001 From: perf3ct Date: Mon, 21 Jul 2025 23:34:57 +0000 Subject: [PATCH] feat(docs): add docs about multiple ocr languages --- README.md | 2 + docs/multi-language-ocr-guide.md | 247 +++++++++++++++++++++++++++++++ 2 files changed, 249 insertions(+) create mode 100644 docs/multi-language-ocr-guide.md diff --git a/README.md b/README.md index a9c5e73..d01cb28 100644 --- a/README.md +++ b/README.md @@ -54,6 +54,7 @@ open http://localhost:8000 - [👥 User Management](docs/user-management-guide.md) - Authentication, roles, and administration - [🏷️ Labels & Organization](docs/labels-and-organization.md) - Document tagging and categorization - [🔎 Advanced Search](docs/advanced-search.md) - Search modes, syntax, and optimization +- [🌍 Multi-Language OCR Guide](docs/multi-language-ocr-guide.md) - Process documents in multiple languages simultaneously - [🔐 OIDC Setup](docs/oidc-setup.md) - Single Sign-On integration ### Deployment & Operations @@ -69,6 +70,7 @@ open http://localhost:8000 - [🔍 OCR Optimization](docs/dev/OCR_OPTIMIZATION_GUIDE.md) - Improve OCR performance - [🗄️ Database Best Practices](docs/dev/DATABASE_GUARDRAILS.md) - Concurrency and safety - [📊 Queue Architecture](docs/dev/QUEUE_IMPROVEMENTS.md) - Background job processing +- [⚠️ Error System Guide](docs/dev/ERROR_SYSTEM.md) - Comprehensive error handling architecture ## 🏗️ Architecture diff --git a/docs/multi-language-ocr-guide.md b/docs/multi-language-ocr-guide.md new file mode 100644 index 0000000..b76f4cc --- /dev/null +++ b/docs/multi-language-ocr-guide.md @@ -0,0 +1,247 @@ +# Multi-Language OCR Guide + +Readur supports powerful multi-language OCR capabilities that allow you to process documents in multiple languages simultaneously for optimal text extraction accuracy. + +## 🌍 Overview + +The multi-language OCR system allows you to: +- **Process documents in up to 4 languages simultaneously** for best results +- **Set preferred languages** that apply to all your document uploads +- **Retry failed OCR** with different language combinations +- **Automatically optimize** text extraction by using multiple language models + +## 🚀 Getting Started + +### Setting Your Language Preferences + +1. **Navigate to Settings** in your account +2. **Select OCR Languages** section +3. **Choose up to 4 preferred languages** - these will be used for all new uploads +4. **Set a primary language** - this language gets processing priority +5. **Save your preferences** + +**Example preferred language setup:** +- Primary: English (`eng`) +- Additional: Spanish (`spa`), French (`fra`) +- Result: Documents processed with English priority, plus Spanish and French recognition + +### Language Selection During Upload + +When uploading documents, you can: + +1. **Use your default preferences** - no action needed +2. **Override for specific documents:** + - Click the language selector in the upload area + - Choose different languages for this upload session + - These languages will be applied to all files in the current upload + +## 📋 Available Languages + +Readur supports 67+ languages including: + +### Major World Languages +- **English** (`eng`) - Default and most reliable +- **Spanish** (`spa`) - Excellent accuracy +- **French** (`fra`) - High quality results +- **German** (`deu`) - Strong performance +- **Italian** (`ita`) - Good accuracy +- **Portuguese** (`por`) - Reliable processing +- **Russian** (`rus`) - Solid results + +### Asian Languages +- **Chinese Simplified** (`chi_sim`) +- **Chinese Traditional** (`chi_tra`) +- **Japanese** (`jpn`) +- **Korean** (`kor`) +- **Hindi** (`hin`) +- **Thai** (`tha`) +- **Vietnamese** (`vie`) + +### European Languages +- **Dutch** (`nld`) +- **Swedish** (`swe`) +- **Norwegian** (`nor`) +- **Danish** (`dan`) +- **Finnish** (`fin`) +- **Polish** (`pol`) +- **Czech** (`ces`) + +### And Many More +Including Arabic (`ara`), Hebrew (`heb`), Turkish (`tur`), and dozens of other languages. + +> **Tip:** For the complete list of available languages, visit the OCR Languages page in your settings or call the API endpoint: `GET /api/ocr/languages` + +## 🛠️ Using the API + +### Get Available Languages +```bash +curl -H "Authorization: Bearer YOUR_TOKEN" \ + https://your-readur-instance.com/api/ocr/languages +``` + +**Response:** +```json +{ + "available_languages": [ + { + "code": "eng", + "name": "English", + "installed": true + }, + { + "code": "spa", + "name": "Spanish", + "installed": true + } + ], + "current_user_language": "eng" +} +``` + +### Update Language Preferences +```bash +curl -X PUT \ + -H "Authorization: Bearer YOUR_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "preferred_languages": ["eng", "spa", "fra"], + "primary_language": "eng" + }' \ + https://your-readur-instance.com/api/settings +``` + +### Retry OCR with Different Languages +```bash +curl -X POST \ + -H "Authorization: Bearer YOUR_TOKEN" \ + -H "Content-Type: application/json" \ + -d '{ + "languages": ["eng", "deu"] + }' \ + https://your-readur-instance.com/api/documents/DOCUMENT_ID/ocr/retry +``` + +## 🎯 Best Practices + +### Language Selection Strategy + +**For Mixed-Language Documents:** +- Choose 2-3 languages that appear in your document +- Always include English as a fallback (most reliable) +- Put the dominant language first as your primary language + +**Examples:** +- **Business document with English/Spanish:** `["eng", "spa"]` +- **European legal document:** `["eng", "fra", "deu"]` +- **Academic paper with multiple references:** `["eng", "spa", "ita"]` + +### Performance Optimization + +**Do:** +- ✅ Limit to 2-4 languages for best performance +- ✅ Include English when processing mixed content +- ✅ Use specific language combinations for consistent document types +- ✅ Set realistic expectations for complex multilingual documents + +**Don't:** +- ❌ Select languages not present in your documents +- ❌ Use more than 4 languages simultaneously +- ❌ Expect perfect results with very low-quality scans +- ❌ Mix completely unrelated language families unnecessarily + +## 🔄 Retrying OCR Processing + +If OCR results are poor, you can retry with different languages: + +### Via Web Interface +1. **Navigate to the document** with poor OCR results +2. **Click "Retry OCR"** button +3. **Select different languages** that better match your document +4. **Start retry process** + +### Common Retry Scenarios + +**Scenario 1: Wrong Language Detected** +- Original: English-only processing of Spanish document +- Solution: Retry with `["spa", "eng"]` + +**Scenario 2: Mixed Language Document** +- Original: Single language processing +- Solution: Add 2-3 relevant languages + +**Scenario 3: Poor Quality Scan** +- Original: Fast processing with limited languages +- Solution: Try with primary language + English fallback + +## 📊 Monitoring OCR Results + +### Understanding OCR Confidence +- **90%+** - Excellent results, high accuracy +- **70-89%** - Good results, minor errors possible +- **50-69%** - Moderate results, review recommended +- **Below 50%** - Poor results, consider retry with different languages + +### Language-Specific Performance +Different languages have varying accuracy rates: +- **Latin-based scripts** (English, Spanish, French): Highest accuracy +- **Germanic languages** (German, Dutch): Very good accuracy +- **Asian languages** (Chinese, Japanese): Good accuracy with proper font recognition +- **Arabic/Hebrew scripts**: Moderate accuracy, depends on text quality + +## 🐛 Troubleshooting + +### Common Issues + +**Problem:** "Language not available" error +**Solution:** +- Check language code spelling (e.g., `eng` not `english`) +- Verify language is installed on the server +- Contact administrator if language should be available + +**Problem:** Poor OCR results despite correct language +**Solutions:** +- Ensure document scan quality is sufficient (300+ DPI recommended) +- Try adding English as a fallback language +- Consider document preprocessing (contrast, rotation correction) +- Retry with fewer languages for better performance + +**Problem:** Slow processing with multiple languages +**Solutions:** +- Reduce number of selected languages to 2-3 +- Use languages only present in your document +- Consider processing during off-peak hours + +### Getting Help + +If you're experiencing issues: + +1. **Check the OCR Health page** - `GET /api/ocr/health` +2. **Review your language selection** - ensure languages match document content +3. **Try with English fallback** - adds reliability to processing +4. **Contact support** with document ID and language combination used + +## 🔮 Advanced Features + +### Planned Enhancements +- **Auto-language detection**: Automatic suggestion of optimal language combinations +- **Custom language models**: Upload your own specialized language data +- **Batch language updates**: Change languages for multiple documents at once +- **Language-specific confidence thresholds**: Fine-tune accuracy requirements per language + +### Integration Options +The multi-language OCR system integrates with: +- **Document management workflows** +- **Automated processing pipelines** +- **Third-party applications via REST API** +- **Webhook notifications for completion** + +## 📚 Additional Resources + +- **API Documentation**: Complete endpoint reference +- **Language Codes Reference**: Full list of supported language codes +- **Performance Guidelines**: Optimization recommendations +- **Migration Guide**: Upgrading from single-language setup + +--- + +**Need Help?** Contact support or check the system health dashboard for real-time OCR capability status. \ No newline at end of file