Readur

Commit Graph

Author	SHA1	Message	Date
perf3ct	65c49ef4f2	feat(ocr): implement new dev stack and allow for more numbers in ocr documents	2025-10-28 14:34:34 -07:00
perf3ct	d5963585fd	feat(ocr): soften the requirements around OCR, and update the UI to better handle issues in word count	2025-10-18 14:31:10 -07:00
perf3ct	e7574cb0da	feat(ui): handle strange responses that the UI could recieve	2025-10-05 13:45:10 -07:00
perf3ct	7863b9100f	feat(ocr): no longer add explicit section / page break	2025-09-05 00:06:09 +00:00
perf3ct	483d89132f	feat(office): add documentation around using antiword/catdoc for `doc` functionality	2025-09-02 20:29:17 +00:00
perf3ct	149c3b9a3f	feat(office): yeet unused fallback strategy	2025-09-02 03:47:20 +00:00
perf3ct	d5d6d2edb4	feat(office): xml extraction seems to work now	2025-09-02 01:22:19 +00:00
perf3ct	774efd1140	refactor(server): remove XML vs library comparison functionality Remove all comparison-related code used to evaluate XML vs library-based Office document extraction. The XML approach has proven superior, so the comparison functionality is no longer needed. Changes: - Remove extraction_comparator.rs (entire comparison engine) - Remove test_extraction_comparison.rs binary - Remove comparison mode logic from enhanced.rs - Simplify fallback_strategy.rs to use XML extraction only - Update OCR service to use XML extraction as primary method - Clean up database migration to remove comparison-specific settings - Remove test_extraction binary from Cargo.toml - Update integration tests to work with simplified extraction The Office document extraction now flows directly to XML-based extraction without any comparison checks, maintaining the superior extraction quality while removing unnecessary complexity.	2025-09-02 01:22:19 +00:00
perf3ct	73525eca02	feat(office): add library-based and xml-based parsing	2025-09-02 00:25:06 +00:00
perf3ct	57a5d2ab15	feat(office): add xml parsing	2025-09-01 22:32:42 +00:00
perf3ct	b8bf7c9585	feat(office): use catdoc and antiword to convert doc	2025-09-01 21:49:30 +00:00
perf3ct	78af7e7861	feat(office): use actual packages for extraction	2025-09-01 21:21:22 +00:00
perf3ct	546b41b462	feat(office): try to resolve docx/doc not working	2025-09-01 19:58:06 +00:00
perf3ct	67ae68745c	fix(dev): remove unneeded docs	2025-08-13 20:51:13 +00:00
perf3ct	862c36aa72	feat(storage): further support the s3 storage backend	2025-08-01 17:57:09 +00:00
perf3ct	abd55ef419	feat(storage): abstract storage to also support s3, along with local filesystem still	2025-08-01 04:33:08 +00:00
perf3ct	65f42c2cd7	fix(ocr): use proper failure reasons to avoid constraint violations in failed_documents table	2025-07-21 20:43:37 +00:00
perf3ct	45ec99a031	feat(ocr): get rid of managing TESSDATA_PREFIX	2025-07-20 02:23:06 +00:00
perf3ct	ccc3bc2ce4	feat(ocr): use ocrmypdf and pdftotext to get OCR layer if it already exists	2025-07-15 15:59:29 +00:00
perf3ct	a3f33140ee	feat(dev): drop pdf_extract in favor of ocrmypdf	2025-07-15 14:50:17 +00:00
perf3ct	862eb3217a	fix(tests): resolve issues in integration tests for the new multiple ocr languages	2025-07-14 21:28:55 +00:00
perf3ct	7317fd5ebb	Merge branch 'feat/multiple-ocr-languages' of https://github.com/readur/readur into feat/multiple-ocr-languages	2025-07-14 19:33:51 +00:00
perf3ct	849c9f91c7	feat(lang): update backend to support multiple languages at the same time during OCR	2025-07-14 19:33:43 +00:00
Jon Fuller	f0e39d155e	Merge branch 'main' into feat/multiple-ocr-languages	2025-07-14 11:29:46 -07:00
perf3ct	6165148e4d	feat(ocr): gracefully handle problematic PDFs in all the ways, create tests so that it doesn't happen again	2025-07-14 16:36:32 +00:00
perf3ct	e6fd8424d2	fix(dev): merge main into feature	2025-07-13 17:15:59 +00:00
perf3ct	b31e1a672d	feat(server): gracefully manage requeue requests for the same document	2025-07-11 21:27:12 +00:00
perf3ct	f2a050458b	fix(stats): create new get_queue_statistics function to avoid conflicts	2025-07-09 00:27:43 +00:00
perf3ct	a6f2b6df09	fix(stats): try to fix the stats extraction, again	2025-07-08 21:18:21 +00:00
perf3ct	e628b0d4d5	fix(server): resolve incorrect document failure titles	2025-07-08 20:24:52 +00:00
perf3ct	a7e9f75eab	fix(stats): try to fix stats export, again	2025-07-08 20:03:55 +00:00
perf3ct	03555ed756	fix(tests): fix the crazy metrics collection issue	2025-07-08 16:52:23 +00:00
perf3ct	58b8a71404	fix(tests): and resolve missing endpoint	2025-07-08 04:37:33 +00:00
perf3ct	a4b9626616	fix(web_upload): resolve issue that caused files that were uploaded via the web, to not be added to the queue	2025-07-07 19:28:08 +00:00
perf3ct	497b34ce0a	fix(server): resolve type issues and functions for compilation issues	2025-07-04 00:53:32 +00:00
perf3ct	44aaaca5c5	feat(ocr): add even more about the multiple ocr languages	2025-07-03 19:20:19 +00:00
perf3ct	6bdd6f4a56	feat(server): implement DEBUG environment variable	2025-07-02 17:57:57 +00:00
Jon Fuller	2e1a05fc8d	Merge branch 'main' into feat/multiple-ocr-languages	2025-07-01 11:53:42 -07:00
perf3ct	df281f3b26	feat(pdf): implement ocrmypdf to extract text from PDFs	2025-07-01 00:56:48 +00:00
perf3ct	0052032772	fix(pdf): resolve PDF wordcount error	2025-07-01 00:10:49 +00:00
perf3ct	830f9d0b38	feat(server): mark documents with 0 words as failed, and fix webdav unit tests	2025-06-30 22:43:25 +00:00
perf3ct	fef28a33c6	feat(server): continue to try to wrangle the failed and ignored documents	2025-06-29 23:27:51 +00:00
perf3ct	87cfab9ff8	fix(tests): resolve compilation error in the multiple OCR functionality	2025-06-29 23:21:42 +00:00
perf3ct	197afc19f4	feat(tests): implement and update tests for multiple OCR languages	2025-06-29 23:03:37 +00:00
perf3ct	6b6890d529	feat(server/client): support multiple OCR languages	2025-06-29 22:51:06 +00:00
perf3ct	84577806ef	feat(server/client): add failed_documents table to handle failures, and move logic of failures	2025-06-28 20:52:58 +00:00
perfectra1n	582617ab88	fix(server/client): fix incorrect OCR measurements	2025-06-27 20:23:59 -07:00
perf3ct	9a8bf72ff7	feat(server): reorganize components into their own modules and fix imports	2025-06-27 18:27:42 +00:00

48 Commits