perf3ct
65c49ef4f2
feat(ocr): implement new dev stack and allow for more numbers in ocr documents
2025-10-28 14:34:34 -07:00
perf3ct
d5963585fd
feat(ocr): soften the requirements around OCR, and update the UI to better handle issues in word count
2025-10-18 14:31:10 -07:00
perf3ct
e7574cb0da
feat(ui): handle strange responses that the UI could recieve
2025-10-05 13:45:10 -07:00
perf3ct
7863b9100f
feat(ocr): no longer add explicit section / page break
2025-09-05 00:06:09 +00:00
perf3ct
483d89132f
feat(office): add documentation around using antiword/catdoc for `doc` functionality
2025-09-02 20:29:17 +00:00
perf3ct
149c3b9a3f
feat(office): yeet unused fallback strategy
2025-09-02 03:47:20 +00:00
perf3ct
d5d6d2edb4
feat(office): xml extraction seems to work now
2025-09-02 01:22:19 +00:00
perf3ct
774efd1140
refactor(server): remove XML vs library comparison functionality
...
Remove all comparison-related code used to evaluate XML vs library-based
Office document extraction. The XML approach has proven superior, so the
comparison functionality is no longer needed.
Changes:
- Remove extraction_comparator.rs (entire comparison engine)
- Remove test_extraction_comparison.rs binary
- Remove comparison mode logic from enhanced.rs
- Simplify fallback_strategy.rs to use XML extraction only
- Update OCR service to use XML extraction as primary method
- Clean up database migration to remove comparison-specific settings
- Remove test_extraction binary from Cargo.toml
- Update integration tests to work with simplified extraction
The Office document extraction now flows directly to XML-based
extraction
without any comparison checks, maintaining the superior extraction
quality
while removing unnecessary complexity.
2025-09-02 01:22:19 +00:00
perf3ct
73525eca02
feat(office): add library-based and xml-based parsing
2025-09-02 00:25:06 +00:00
perf3ct
57a5d2ab15
feat(office): add xml parsing
2025-09-01 22:32:42 +00:00
perf3ct
b8bf7c9585
feat(office): use catdoc and antiword to convert doc
2025-09-01 21:49:30 +00:00
perf3ct
78af7e7861
feat(office): use actual packages for extraction
2025-09-01 21:21:22 +00:00
perf3ct
546b41b462
feat(office): try to resolve docx/doc not working
2025-09-01 19:58:06 +00:00
perf3ct
67ae68745c
fix(dev): remove unneeded docs
2025-08-13 20:51:13 +00:00
perf3ct
862c36aa72
feat(storage): further support the s3 storage backend
2025-08-01 17:57:09 +00:00
perf3ct
abd55ef419
feat(storage): abstract storage to also support s3, along with local filesystem still
2025-08-01 04:33:08 +00:00
perf3ct
65f42c2cd7
fix(ocr): use proper failure reasons to avoid constraint violations in failed_documents table
2025-07-21 20:43:37 +00:00
perf3ct
45ec99a031
feat(ocr): get rid of managing TESSDATA_PREFIX
2025-07-20 02:23:06 +00:00
perf3ct
ccc3bc2ce4
feat(ocr): use ocrmypdf and pdftotext to get OCR layer if it already exists
2025-07-15 15:59:29 +00:00
perf3ct
a3f33140ee
feat(dev): drop pdf_extract in favor of ocrmypdf
2025-07-15 14:50:17 +00:00
perf3ct
862eb3217a
fix(tests): resolve issues in integration tests for the new multiple ocr languages
2025-07-14 21:28:55 +00:00
perf3ct
7317fd5ebb
Merge branch 'feat/multiple-ocr-languages' of https://github.com/readur/readur into feat/multiple-ocr-languages
2025-07-14 19:33:51 +00:00
perf3ct
849c9f91c7
feat(lang): update backend to support multiple languages at the same time during OCR
2025-07-14 19:33:43 +00:00
Jon Fuller
f0e39d155e
Merge branch 'main' into feat/multiple-ocr-languages
2025-07-14 11:29:46 -07:00
perf3ct
6165148e4d
feat(ocr): gracefully handle problematic PDFs in all the ways, create tests so that it doesn't happen again
2025-07-14 16:36:32 +00:00
perf3ct
e6fd8424d2
fix(dev): merge main into feature
2025-07-13 17:15:59 +00:00
perf3ct
b31e1a672d
feat(server): gracefully manage requeue requests for the same document
2025-07-11 21:27:12 +00:00
perf3ct
f2a050458b
fix(stats): create new get_queue_statistics function to avoid conflicts
2025-07-09 00:27:43 +00:00
perf3ct
a6f2b6df09
fix(stats): try to fix the stats extraction, again
2025-07-08 21:18:21 +00:00
perf3ct
e628b0d4d5
fix(server): resolve incorrect document failure titles
2025-07-08 20:24:52 +00:00
perf3ct
a7e9f75eab
fix(stats): try to fix stats export, again
2025-07-08 20:03:55 +00:00
perf3ct
03555ed756
fix(tests): fix the crazy metrics collection issue
2025-07-08 16:52:23 +00:00
perf3ct
58b8a71404
fix(tests): and resolve missing endpoint
2025-07-08 04:37:33 +00:00
perf3ct
a4b9626616
fix(web_upload): resolve issue that caused files that were uploaded via the web, to not be added to the queue
2025-07-07 19:28:08 +00:00
perf3ct
497b34ce0a
fix(server): resolve type issues and functions for compilation issues
2025-07-04 00:53:32 +00:00
perf3ct
44aaaca5c5
feat(ocr): add even more about the multiple ocr languages
2025-07-03 19:20:19 +00:00
perf3ct
6bdd6f4a56
feat(server): implement DEBUG environment variable
2025-07-02 17:57:57 +00:00
Jon Fuller
2e1a05fc8d
Merge branch 'main' into feat/multiple-ocr-languages
2025-07-01 11:53:42 -07:00
perf3ct
df281f3b26
feat(pdf): implement ocrmypdf to extract text from PDFs
2025-07-01 00:56:48 +00:00
perf3ct
0052032772
fix(pdf): resolve PDF wordcount error
2025-07-01 00:10:49 +00:00
perf3ct
830f9d0b38
feat(server): mark documents with 0 words as failed, and fix webdav unit tests
2025-06-30 22:43:25 +00:00
perf3ct
fef28a33c6
feat(server): continue to try to wrangle the failed and ignored documents
2025-06-29 23:27:51 +00:00
perf3ct
87cfab9ff8
fix(tests): resolve compilation error in the multiple OCR functionality
2025-06-29 23:21:42 +00:00
perf3ct
197afc19f4
feat(tests): implement and update tests for multiple OCR languages
2025-06-29 23:03:37 +00:00
perf3ct
6b6890d529
feat(server/client): support multiple OCR languages
2025-06-29 22:51:06 +00:00
perf3ct
84577806ef
feat(server/client): add failed_documents table to handle failures, and move logic of failures
2025-06-28 20:52:58 +00:00
perfectra1n
582617ab88
fix(server/client): fix incorrect OCR measurements
2025-06-27 20:23:59 -07:00
perf3ct
9a8bf72ff7
feat(server): reorganize components into their own modules and fix imports
2025-06-27 18:27:42 +00:00