Commit Graph

22 Commits

Author SHA1 Message Date
perf3ct 65c49ef4f2
feat(ocr): implement new dev stack and allow for more numbers in ocr documents 2025-10-28 14:34:34 -07:00
perf3ct d5963585fd
feat(ocr): soften the requirements around OCR, and update the UI to better handle issues in word count 2025-10-18 14:31:10 -07:00
perf3ct e7574cb0da
feat(ui): handle strange responses that the UI could recieve 2025-10-05 13:45:10 -07:00
perf3ct d5d6d2edb4
feat(office): xml extraction seems to work now 2025-09-02 01:22:19 +00:00
perf3ct 774efd1140
refactor(server): remove XML vs library comparison functionality
Remove all comparison-related code used to evaluate XML vs library-based
Office document extraction. The XML approach has proven superior, so the
comparison functionality is no longer needed.

Changes:
- Remove extraction_comparator.rs (entire comparison engine)
- Remove test_extraction_comparison.rs binary
- Remove comparison mode logic from enhanced.rs
- Simplify fallback_strategy.rs to use XML extraction only
- Update OCR service to use XML extraction as primary method
- Clean up database migration to remove comparison-specific settings
- Remove test_extraction binary from Cargo.toml
- Update integration tests to work with simplified extraction

The Office document extraction now flows directly to XML-based
extraction
without any comparison checks, maintaining the superior extraction
quality
while removing unnecessary complexity.
2025-09-02 01:22:19 +00:00
perf3ct 73525eca02
feat(office): add library-based and xml-based parsing 2025-09-02 00:25:06 +00:00
perf3ct 57a5d2ab15
feat(office): add xml parsing 2025-09-01 22:32:42 +00:00
perf3ct b8bf7c9585
feat(office): use catdoc and antiword to convert doc 2025-09-01 21:49:30 +00:00
perf3ct 78af7e7861
feat(office): use actual packages for extraction 2025-09-01 21:21:22 +00:00
perf3ct 546b41b462
feat(office): try to resolve docx/doc not working 2025-09-01 19:58:06 +00:00
perf3ct 67ae68745c
fix(dev): remove unneeded docs 2025-08-13 20:51:13 +00:00
perf3ct 862c36aa72
feat(storage): further support the s3 storage backend 2025-08-01 17:57:09 +00:00
perf3ct abd55ef419
feat(storage): abstract storage to also support s3, along with local filesystem still 2025-08-01 04:33:08 +00:00
perf3ct ccc3bc2ce4 feat(ocr): use ocrmypdf and pdftotext to get OCR layer if it already exists 2025-07-15 15:59:29 +00:00
perf3ct a3f33140ee feat(dev): drop pdf_extract in favor of ocrmypdf 2025-07-15 14:50:17 +00:00
perf3ct 7317fd5ebb Merge branch 'feat/multiple-ocr-languages' of https://github.com/readur/readur into feat/multiple-ocr-languages 2025-07-14 19:33:51 +00:00
perf3ct 849c9f91c7 feat(lang): update backend to support multiple languages at the same time during OCR 2025-07-14 19:33:43 +00:00
perf3ct 6165148e4d feat(ocr): gracefully handle problematic PDFs in all the ways, create tests so that it doesn't happen again 2025-07-14 16:36:32 +00:00
perf3ct df281f3b26 feat(pdf): implement ocrmypdf to extract text from PDFs 2025-07-01 00:56:48 +00:00
perf3ct 0052032772 fix(pdf): resolve PDF wordcount error 2025-07-01 00:10:49 +00:00
perfectra1n 582617ab88 fix(server/client): fix incorrect OCR measurements 2025-06-27 20:23:59 -07:00
perf3ct 9a8bf72ff7 feat(server): reorganize components into their own modules and fix imports 2025-06-27 18:27:42 +00:00