Readur/frontend/test_data/multilingual/README.md

99 lines
3.5 KiB
Markdown

# Multilingual OCR Test Files
This directory contains test files for validating the multiple OCR language capabilities of Readur.
## Test Files
### Spanish Test Files
- **`spanish_test.pdf`** - Basic Spanish document with common words, accents, and phrases
- **`spanish_complex.pdf`** - Complex Spanish document with special characters (ñ, ü, ¿, ¡)
### English Test Files
- **`english_test.pdf`** - Basic English document with common words and technical terms
- **`english_complex.pdf`** - Complex English document with contractions, hyphens, and abbreviations
### Mixed Language Test Files
- **`mixed_language_test.pdf`** - Document containing both Spanish and English text sections
## Expected OCR Content
### Spanish Content Keywords
- español, documento, reconocimiento
- café, niño, comunicación, corazón
- también, habitación, compañía
- informática, educación, investigación
### English Content Keywords
- English, document, recognition
- technology, computer, software, hardware
- testing, validation, verification, quality
### Mixed Content
Both Spanish and English keywords should be recognized in the mixed language document.
## Usage in E2E Tests
These files are used by the `ocr-multiple-languages.spec.ts` test suite to validate:
1. **Language Selection**: Testing the OCR language selector component
2. **Document Upload**: Uploading documents with specific language preferences
3. **OCR Processing**: Validating OCR results contain expected language-specific content
4. **Language Persistence**: Ensuring language preferences are saved across sessions
5. **Retry Functionality**: Testing OCR retry with different languages
6. **Error Handling**: Testing graceful fallback behavior
## Test Languages
- **Spanish (spa)**: Primary test language with accents and special characters
- **English (eng)**: Secondary test language with technical terminology
- **Auto-detect**: Testing automatic language detection
## File Creation
These files were created using the `create_multilingual_test_pdfs.py` script in the repository root.
To regenerate the test files:
```bash
python3 create_multilingual_test_pdfs.py
```
## OCR Language Testing Workflow
1. Set language preference in Settings page
2. Upload test document with specific language content
3. Wait for OCR processing to complete
4. Validate OCR results contain expected keywords
5. Test retry functionality with different languages
6. Verify bulk operations work with multiple languages
## Expected Test Results
When OCR is configured correctly for Spanish (`spa`):
- Spanish documents should have high recognition accuracy for accented characters
- Phrases like "Hola mundo", "este es un documento", "en español" should be recognized
- Special characters (ñ, ü, ¿, ¡) should be preserved
When OCR is configured correctly for English (`eng`):
- English documents should have high recognition accuracy
- Technical terms and abbreviations should be recognized
- Phrases like "Hello world", "this is an English", "document" should be recognized
## Troubleshooting
If tests fail:
1. **Check Tesseract Installation**: Ensure Spanish language pack is installed
```bash
# Ubuntu/Debian
sudo apt-get install tesseract-ocr-spa
# macOS
brew install tesseract-lang
```
2. **Verify Language Availability**: Check `/api/ocr/languages` endpoint returns Spanish and English
3. **File Paths**: Ensure test files exist in the correct directory structure
4. **OCR Processing Time**: Allow sufficient timeout (120s) for OCR processing to complete