Readur

Commit Graph

Author	SHA1	Message	Date
perf3ct	65c49ef4f2	feat(ocr): implement new dev stack and allow for more numbers in ocr documents	2025-10-28 14:34:34 -07:00
perf3ct	d5963585fd	feat(ocr): soften the requirements around OCR, and update the UI to better handle issues in word count	2025-10-18 14:31:10 -07:00
perf3ct	e7574cb0da	feat(ui): handle strange responses that the UI could recieve	2025-10-05 13:45:10 -07:00
perf3ct	d5d6d2edb4	feat(office): xml extraction seems to work now	2025-09-02 01:22:19 +00:00
perf3ct	774efd1140	refactor(server): remove XML vs library comparison functionality Remove all comparison-related code used to evaluate XML vs library-based Office document extraction. The XML approach has proven superior, so the comparison functionality is no longer needed. Changes: - Remove extraction_comparator.rs (entire comparison engine) - Remove test_extraction_comparison.rs binary - Remove comparison mode logic from enhanced.rs - Simplify fallback_strategy.rs to use XML extraction only - Update OCR service to use XML extraction as primary method - Clean up database migration to remove comparison-specific settings - Remove test_extraction binary from Cargo.toml - Update integration tests to work with simplified extraction The Office document extraction now flows directly to XML-based extraction without any comparison checks, maintaining the superior extraction quality while removing unnecessary complexity.	2025-09-02 01:22:19 +00:00
perf3ct	73525eca02	feat(office): add library-based and xml-based parsing	2025-09-02 00:25:06 +00:00
perf3ct	57a5d2ab15	feat(office): add xml parsing	2025-09-01 22:32:42 +00:00
perf3ct	b8bf7c9585	feat(office): use catdoc and antiword to convert doc	2025-09-01 21:49:30 +00:00
perf3ct	78af7e7861	feat(office): use actual packages for extraction	2025-09-01 21:21:22 +00:00
perf3ct	546b41b462	feat(office): try to resolve docx/doc not working	2025-09-01 19:58:06 +00:00
perf3ct	67ae68745c	fix(dev): remove unneeded docs	2025-08-13 20:51:13 +00:00
perf3ct	862c36aa72	feat(storage): further support the s3 storage backend	2025-08-01 17:57:09 +00:00
perf3ct	abd55ef419	feat(storage): abstract storage to also support s3, along with local filesystem still	2025-08-01 04:33:08 +00:00
perf3ct	ccc3bc2ce4	feat(ocr): use ocrmypdf and pdftotext to get OCR layer if it already exists	2025-07-15 15:59:29 +00:00
perf3ct	a3f33140ee	feat(dev): drop pdf_extract in favor of ocrmypdf	2025-07-15 14:50:17 +00:00
perf3ct	7317fd5ebb	Merge branch 'feat/multiple-ocr-languages' of https://github.com/readur/readur into feat/multiple-ocr-languages	2025-07-14 19:33:51 +00:00
perf3ct	849c9f91c7	feat(lang): update backend to support multiple languages at the same time during OCR	2025-07-14 19:33:43 +00:00
perf3ct	6165148e4d	feat(ocr): gracefully handle problematic PDFs in all the ways, create tests so that it doesn't happen again	2025-07-14 16:36:32 +00:00
perf3ct	df281f3b26	feat(pdf): implement ocrmypdf to extract text from PDFs	2025-07-01 00:56:48 +00:00
perf3ct	0052032772	fix(pdf): resolve PDF wordcount error	2025-07-01 00:10:49 +00:00
perfectra1n	582617ab88	fix(server/client): fix incorrect OCR measurements	2025-06-27 20:23:59 -07:00
perf3ct	9a8bf72ff7	feat(server): reorganize components into their own modules and fix imports	2025-06-27 18:27:42 +00:00

22 Commits