Commit Graph

3 Commits

Author SHA1 Message Date
perf3ct 036941b3dc
feat(dev): trigger re-ocr on doc and docx 2025-09-02 23:04:31 +00:00
perf3ct 483d89132f
feat(office): add documentation around using antiword/catdoc for `doc` functionality 2025-09-02 20:29:17 +00:00
perf3ct 774efd1140
refactor(server): remove XML vs library comparison functionality
Remove all comparison-related code used to evaluate XML vs library-based
Office document extraction. The XML approach has proven superior, so the
comparison functionality is no longer needed.

Changes:
- Remove extraction_comparator.rs (entire comparison engine)
- Remove test_extraction_comparison.rs binary
- Remove comparison mode logic from enhanced.rs
- Simplify fallback_strategy.rs to use XML extraction only
- Update OCR service to use XML extraction as primary method
- Clean up database migration to remove comparison-specific settings
- Remove test_extraction binary from Cargo.toml
- Update integration tests to work with simplified extraction

The Office document extraction now flows directly to XML-based
extraction
without any comparison checks, maintaining the superior extraction
quality
while removing unnecessary complexity.
2025-09-02 01:22:19 +00:00