Commit Graph

65 Commits

Author SHA1 Message Date
perf3ct 483d89132f
feat(office): add documentation around using antiword/catdoc for `doc` functionality 2025-09-02 20:29:17 +00:00
perf3ct 774efd1140
refactor(server): remove XML vs library comparison functionality
Remove all comparison-related code used to evaluate XML vs library-based
Office document extraction. The XML approach has proven superior, so the
comparison functionality is no longer needed.

Changes:
- Remove extraction_comparator.rs (entire comparison engine)
- Remove test_extraction_comparison.rs binary
- Remove comparison mode logic from enhanced.rs
- Simplify fallback_strategy.rs to use XML extraction only
- Update OCR service to use XML extraction as primary method
- Clean up database migration to remove comparison-specific settings
- Remove test_extraction binary from Cargo.toml
- Update integration tests to work with simplified extraction

The Office document extraction now flows directly to XML-based
extraction
without any comparison checks, maintaining the superior extraction
quality
while removing unnecessary complexity.
2025-09-02 01:22:19 +00:00
perf3ct f6eb7ba49f
feat(metrics): try to simplify webdav metrics some 2025-08-23 22:17:40 +00:00
perf3ct 1b4573f658
feat(webdav): resolve failing migration tests, and implement better error handling 2025-08-23 18:52:52 +00:00
perf3ct b7dd64c8f6
feat(webdav): try to do better webdav errors to not slam webdav endpoints 2025-08-20 21:59:14 +00:00
perf3ct d323aa53c3
fix(migrations): also resolve issue in the new generic source scan failure migration 2025-08-18 15:59:22 +00:00
perf3ct 6a64d9e6ed
feat(source): implement generic "SourceError" and then have it be propagated as "WebDAVerror", etc. 2025-08-17 22:05:58 +00:00
perf3ct 93c2863d01
feat(webdav): support capturing individual directory errors in webdav 2025-08-14 16:24:05 +00:00
perf3ct aff7b907c7 fix(db): backfill data for sources given missing counts 2025-07-29 21:27:54 +00:00
perf3ct 0d65cab4aa fix(migrations): resolve issue with latest migration and multi-language support 2025-07-18 19:30:51 +00:00
perf3ct 686596481c fix(migrations): resolve new broken migration for multiple ocr languages 2025-07-14 20:52:42 +00:00
perf3ct 849c9f91c7 feat(lang): update backend to support multiple languages at the same time during OCR 2025-07-14 19:33:43 +00:00
perf3ct 0465777890 feat(client): show more fields for Documents 2025-07-10 21:02:15 +00:00
perf3ct f2a050458b fix(stats): create new get_queue_statistics function to avoid conflicts 2025-07-09 00:27:43 +00:00
perf3ct f0dc0669bd debug(tests): add some debug lines to see why CI is upset 2025-07-08 22:32:32 +00:00
perf3ct 7fa95e5e17 fix(migrations): resolve PostgreSQL function type mismatch in get_ocr_queue_stats 2025-07-08 22:30:21 +00:00
perf3ct 8153f9a4cb fix(migrations): resolve PostgreSQL function type mismatch in get_ocr_queue_stats 2025-07-08 22:04:24 +00:00
perf3ct 2a59651fb9 fix(stats): try to fix stats export, again again 2025-07-08 20:16:33 +00:00
perf3ct 03555ed756 fix(tests): fix the crazy metrics collection issue 2025-07-08 16:52:23 +00:00
perf3ct 459b8622bb feat(webdav): also add some crazy source automatic validation 2025-07-03 05:26:36 +00:00
perf3ct 69c40c10fa feat(webdav): gracefully recover webdav from stops/crashes 2025-07-03 04:45:25 +00:00
perf3ct 6d40feadb3 fix(server): resolve issues with the retry ocr tests 2025-07-02 22:47:51 +00:00
perf3ct ab03b8d73d fix(server): resolve ocr test functionality failing due to db trigger 2025-07-02 22:38:13 +00:00
perf3ct ffad8c4561 feat(tests): fix ocr_retry issues in tests 2025-07-02 21:30:36 +00:00
perf3ct d4b57d2ae0 feat(server/client): implement retry functionality for both successful and failed documents 2025-07-02 00:06:47 +00:00
perf3ct 92b21350db feat(webdav): track directory etags
✅ Core Optimizations Implemented

  1. 📊 New Database Schema: Added webdav_directories table to track
directory ETags, file counts, and metadata
  2. 🔍 Smart Directory Checking: Before deep scans, check directory
ETags with lightweight Depth: 0 PROPFIND requests
  3. ΓÜí Skip Unchanged Directories: If directory ETag matches, skip the
entire deep scan
  4. 🗂️ N-Depth Subdirectory Tracking: Recursively track all
subdirectories found during scans
  5. 🎯 Individual Subdirectory Checks: When parent unchanged, check
each known subdirectory individually

  🚀 Performance Benefits

  Before: Every sync = Full Depth: infinity scan of entire directory
treeAfter:
  - First sync: Full scan + directory tracking setup
  - Subsequent syncs: Quick ETag checks → skip unchanged directories
entirely
  - Changed directories: Only scan the specific changed subdirectories

  📁 How It Works

  1. Initial Request: PROPFIND Depth: 0 on /Documents → get directory
ETag
  2. Database Check: Compare with stored ETag for /Documents
  3. If Unchanged: Check each known subdirectory (/Documents/2024,
/Documents/Archive) individually
  4. If Changed: Full recursive scan + update all directory tracking
data
2025-07-01 21:22:16 +00:00
perf3ct 9e43df2fbe feat(server/client): add metadata to file view 2025-06-30 19:13:16 +00:00
perf3ct 97fa50c1b5 feat(server/client): resolve failing tests 2025-06-28 21:21:05 +00:00
perf3ct 84577806ef feat(server/client): add failed_documents table to handle failures, and move logic of failures 2025-06-28 20:52:58 +00:00
perf3ct 2d04f0094a fix(ocr_status): populate the ocr queue with pending jobs and add easy 'retry' button 2025-06-28 18:08:00 +00:00
perf3ct 9f3371e4f3 feat(migration): disable OCR consistency trigger for OCR confidence backfill 2025-06-28 17:23:35 +00:00
perf3ct 69425b2201 feat(migration): instead of hardcoded guessing, re-enter those documents into the queue 2025-06-28 14:53:45 +00:00
perf3ct e995653d69 fix(migrations): resolve issue in migration for ocr confidence 2025-06-28 14:51:06 +00:00
perfectra1n 582617ab88 fix(server/client): fix incorrect OCR measurements 2025-06-27 20:23:59 -07:00
perf3ct e9496b921e feat(server): set up oidc system and migrations 2025-06-26 18:52:57 +00:00
perf3ct a0e75d4619 feat(server/client): implement feature of ignoring already deleted files, and add failed OCR queue tests 2025-06-24 17:20:33 +00:00
perf3ct a6121c2849 fix(migrations): fix comment referencing old migration name 2025-06-23 21:10:44 +00:00
perf3ct 5510765035 feat(migrations): resolve migrations names and remove legacy migrations code 2025-06-23 21:08:43 +00:00
perf3ct 67d1e0ee2f feat(webdav): move etag parser to own function, create required migration 2025-06-23 19:39:39 +00:00
perf3ct 5dae03635a feat(ocr_queue): fix completed_today count 2025-06-22 16:04:17 +00:00
aaldebs99 2058e5db8d fix(db): more labels migrations 2025-06-19 21:28:13 +00:00
aaldebs99 889f00bc71 fix(migrations): de-dupe migrations and fix labels migrations 2025-06-19 19:47:29 +00:00
aaldebs99 95d52f477e fix(db): add labels sql table 2025-06-19 18:58:00 +00:00
perf3ct d055e9f350 feat(server/client): implement labels for documents 2025-06-18 16:12:42 +00:00
perf3ct 58aaedf4a6 feat(server): add hash for documents 2025-06-17 15:41:42 +00:00
perf3ct fad6756c8c feat(server): stop image preprocessing in OCR 2025-06-17 00:35:03 +00:00
perf3ct 801038a26e feat(server): break up large db.rs file into multiple files, and add more PDF guardrails 2025-06-17 00:25:21 +00:00
perf3ct c656a96d91 feat(server): create folders within 'upload' path to manage thumbnails, processed images, etc. 2025-06-16 21:24:46 +00:00
perf3ct bf7ec25dc1 feat(server): create more DB guardrails, and lots of missing tests 2025-06-15 22:14:02 +00:00
perf3ct b21f2684bc feat(server/client): add lots of OCR tweaks 2025-06-15 21:24:06 +00:00