Commit Graph

295 Commits

Author SHA1 Message Date
Jon Fuller ac6b4a522f Merge pull request #96 from readur/feat/deduplicate-test-utils-1
feat(tests): deduplicate test functionalities
2025-07-04 09:12:40 -07:00
perf3ct 545046509f fix(server): fix axum groups 2025-07-04 03:07:28 +00:00
perf3ct 922478d995 fix(server): resolve compilation errors due to splitting up the large files 2025-07-04 03:06:29 +00:00
perf3ct 497b34ce0a fix(server): resolve type issues and functions for compilation issues 2025-07-04 00:53:32 +00:00
perf3ct 51fb3a7e48 fix(tests): resolve broken test utils 2025-07-04 00:31:53 +00:00
perf3ct 0e84993afa fix(server): resolve import issues 2025-07-03 23:58:11 +00:00
perf3ct f862df9a90 feat(dev): break up the large sources.rs file into smaller ones 2025-07-03 23:44:49 +00:00
perf3ct 3a7c8e8bda feat(dev): break up the large documents.rs file, again 2025-07-03 23:33:53 +00:00
perf3ct b9e0e5b905 feat(dev): also break up the large webdav_service.rs file into smaller ones 2025-07-03 19:57:31 +00:00
perf3ct ed942d02c7 feat(dev): break up the large documents.rs file 2025-07-03 19:47:31 +00:00
perf3ct 86bcd613e4 feat(dev): split up large models.rs file to smaller ones 2025-07-03 19:35:36 +00:00
perf3ct 44aaaca5c5 feat(ocr): add even more about the multiple ocr languages 2025-07-03 19:20:19 +00:00
perf3ct a3f49f9bd7 feat(tests): try to deduplicate test code even more 2025-07-03 19:17:33 +00:00
perf3ct 7993786e18 feat(tests): deduplicate tests too 2025-07-03 17:21:39 +00:00
perf3ct 7074a8d868 feat(webdav): add validation statuses to sources 2025-07-03 14:03:26 +00:00
perf3ct 459b8622bb feat(webdav): also add some crazy source automatic validation 2025-07-03 05:26:36 +00:00
perf3ct 15b1f40cc1 feat(webdav): make sure to have scanned all subdirectories 2025-07-03 05:02:17 +00:00
perf3ct 69c40c10fa feat(webdav): gracefully recover webdav from stops/crashes 2025-07-03 04:45:25 +00:00
perf3ct 2297eb8261 feat(webdav): also set up deep scanning button and fix unit tests 2025-07-03 04:24:26 +00:00
perf3ct b8dd23655d feat(webdav): directory etag smart checking and all that 2025-07-03 00:26:56 +00:00
perf3ct be29316ff4 fix(tests): resolve compilation error in tests and source scheduler 2025-07-02 23:49:46 +00:00
perf3ct d26b9e386b fix(webdav): resolve issue with webdav subdirectories not being discovered 2025-07-02 23:37:39 +00:00
perf3ct f7414af15c fix(tests): resolve silly new ocr retry tests 2025-07-02 22:51:09 +00:00
perf3ct 6d40feadb3 fix(server): resolve issues with the retry ocr tests 2025-07-02 22:47:51 +00:00
perf3ct ab03b8d73d fix(server): resolve ocr test functionality failing due to db trigger 2025-07-02 22:38:13 +00:00
perf3ct dd4bd03af6 feat(tests): fix ocr_retry issues in tests 2025-07-02 21:48:01 +00:00
perf3ct ffad8c4561 feat(tests): fix ocr_retry issues in tests 2025-07-02 21:30:36 +00:00
perf3ct 3c4e06fa77 feat(tests): fix ocr_retry issues in tests 2025-07-02 18:48:26 +00:00
perf3ct 8cea916abf feat(server): allow also completed documents to be retried 2025-07-02 18:15:41 +00:00
perf3ct 6bdd6f4a56 feat(server): implement DEBUG environment variable 2025-07-02 17:57:57 +00:00
perf3ct 68aa492a96 fix(server): resolve NUMERIC db type and f64 rust type 2025-07-02 02:26:11 +00:00
perf3ct d4b57d2ae0 feat(server/client): implement retry functionality for both successful and failed documents 2025-07-02 00:06:47 +00:00
perf3ct a381cdd12c feat(webdav): also fix the parser to include directories, and add tests 2025-07-01 22:03:06 +00:00
perf3ct c1dbd06df2 feat(tests): add unit tests for new webdav functionality 2025-07-01 21:39:31 +00:00
perf3ct 92b21350db feat(webdav): track directory etags
✅ Core Optimizations Implemented

  1. 📊 New Database Schema: Added webdav_directories table to track
directory ETags, file counts, and metadata
  2. 🔍 Smart Directory Checking: Before deep scans, check directory
ETags with lightweight Depth: 0 PROPFIND requests
  3. ΓÜí Skip Unchanged Directories: If directory ETag matches, skip the
entire deep scan
  4. 🗂️ N-Depth Subdirectory Tracking: Recursively track all
subdirectories found during scans
  5. 🎯 Individual Subdirectory Checks: When parent unchanged, check
each known subdirectory individually

  🚀 Performance Benefits

  Before: Every sync = Full Depth: infinity scan of entire directory
treeAfter:
  - First sync: Full scan + directory tracking setup
  - Subsequent syncs: Quick ETag checks → skip unchanged directories
entirely
  - Changed directories: Only scan the specific changed subdirectories

  📁 How It Works

  1. Initial Request: PROPFIND Depth: 0 on /Documents → get directory
ETag
  2. Database Check: Compare with stored ETag for /Documents
  3. If Unchanged: Check each known subdirectory (/Documents/2024,
/Documents/Archive) individually
  4. If Changed: Full recursive scan + update all directory tracking
data
2025-07-01 21:22:16 +00:00
perf3ct 6a23a407bf feat(client): update swagger ui endpoints 2025-07-01 20:54:45 +00:00
Jon Fuller 2e1a05fc8d Merge branch 'main' into feat/multiple-ocr-languages 2025-07-01 11:53:42 -07:00
perf3ct df281f3b26 feat(pdf): implement ocrmypdf to extract text from PDFs 2025-07-01 00:56:48 +00:00
Jon Fuller 706e20f35c Merge branch 'main' into feat/debug-page 2025-06-30 17:19:31 -07:00
perf3ct 231f88f038 feat(debug): debug page actually works and does something 2025-07-01 00:15:48 +00:00
perf3ct 0052032772 fix(pdf): resolve PDF wordcount error 2025-07-01 00:10:49 +00:00
perf3ct 830f9d0b38 feat(server): mark documents with 0 words as failed, and fix webdav unit tests 2025-06-30 22:43:25 +00:00
perf3ct 69279344cb fix(tests): fix documents tests 2025-06-30 21:56:21 +00:00
perf3ct b38c1fca07 feat(server): fix serialization issues 2025-06-30 19:40:05 +00:00
perf3ct 9e43df2fbe feat(server/client): add metadata to file view 2025-06-30 19:13:16 +00:00
perf3ct fef28a33c6 feat(server): continue to try to wrangle the failed and ignored documents 2025-06-29 23:27:51 +00:00
perf3ct 87cfab9ff8 fix(tests): resolve compilation error in the multiple OCR functionality 2025-06-29 23:21:42 +00:00
perf3ct 197afc19f4 feat(tests): implement and update tests for multiple OCR languages 2025-06-29 23:03:37 +00:00
perf3ct 6b6890d529 feat(server/client): support multiple OCR languages 2025-06-29 22:51:06 +00:00
perf3ct fbf89c213d fix(tests): resolve a whole lot of test issues 2025-06-28 22:50:40 +00:00
perf3ct edd0c7514f fix(server): resolve compilation errors in constraint_validation.rs 2025-06-28 22:04:01 +00:00
perf3ct 97fa50c1b5 feat(server/client): resolve failing tests 2025-06-28 21:21:05 +00:00
perf3ct 84577806ef feat(server/client): add failed_documents table to handle failures, and move logic of failures 2025-06-28 20:52:58 +00:00
Jon Fuller fce56b660b Merge pull request #72 from readur/feat/better-db-tests
feat(tests): add regression tests and better sql type safety tests
2025-06-28 12:43:52 -07:00
perf3ct f4adafe2bd feat(tests): add regression tests and better sql type safety tests 2025-06-28 19:25:15 +00:00
perf3ct 9be70dc245 feat(swagger): add missing oidc endpoints into swagger ui 2025-06-28 19:19:48 +00:00
perf3ct 099f4853a7 fix(server): resolve incorrect db type 2025-06-28 18:41:48 +00:00
perf3ct fe1deb1e9d fix(server): resolve compilation issues from queue.rs 2025-06-28 18:15:55 +00:00
perf3ct 2d04f0094a fix(ocr_status): populate the ocr queue with pending jobs and add easy 'retry' button 2025-06-28 18:08:00 +00:00
Jon Fuller 5aae560d7e Merge pull request #69 from readur/fix/ocr-confidence-1
fix(server/client): fix incorrect OCR measurements
2025-06-28 09:53:56 -07:00
perf3ct 9079529eb5 feat(tests): create generic migration tests 2025-06-28 16:38:12 +00:00
perfectra1n 582617ab88 fix(server/client): fix incorrect OCR measurements 2025-06-27 20:23:59 -07:00
perf3ct cc0d647590 fix(server): resolve compilation issue in IgnoredFilesQuery 2025-06-28 01:01:51 +00:00
perf3ct 0b8dbfb8d9 feat(server/client): easily undelete ignored files, if the user wishes to do so 2025-06-28 00:37:49 +00:00
perf3ct 2c6bd92bf4 fix(server): fix unclosed delimiter 2025-06-27 22:51:02 +00:00
Jon Fuller 929f27eaa9 Merge branch 'main' into feat/delete-low-confidence-documents 2025-06-27 15:17:50 -07:00
perf3ct aacfc96825 feat(server/client): implement button deleting low confidence documents (e.g. documents that have no text) 2025-06-27 22:16:38 +00:00
perf3ct a75fca0c28 feat(client/server): add a new badge for each source that shows the number of documents stored from each source 2025-06-27 21:32:50 +00:00
perf3ct 341a91e1a7 fix(tests): move oidc tests to correct folder 2025-06-27 19:33:58 +00:00
perf3ct 57bb0ccd2c fix(server): resolve broken imports on tests and test helpers 2025-06-27 18:46:41 +00:00
perf3ct 9a8bf72ff7 feat(server): reorganize components into their own modules and fix imports 2025-06-27 18:27:42 +00:00
Jon Fuller b095cb951f Merge pull request #55 from readur/feat/oidc-setup
feat(server): set up oidc system and migrations
2025-06-27 10:48:28 -07:00
perf3ct 0b6d96df03 fix(tests): resolve last OIDC test issues 2025-06-27 17:32:33 +00:00
perf3ct 12cdd0ffd6 fix(tests): resolve some difficult race conditions in test 2025-06-27 05:08:12 +00:00
perf3ct 3c5b7c7dfb feat(oidc): fix oidc, tests, and everything in between 2025-06-27 05:03:27 +00:00
perf3ct 51907f81f2 fix(metrics): fix broken prometheus metrics 2025-06-26 22:14:42 +00:00
Jon Fuller 269ba4d46a Merge pull request #56 from readur/fix/pdf-thumbnail-generation
feat(server): actually render PDF thumbnails
2025-06-26 14:14:25 -07:00
perf3ct e626f3a131 feat(metrics): add more prometheus metrics, and create grafana dashboard 2025-06-26 21:14:00 +00:00
perf3ct 075657899f feat(server): use poppler for pdf image generation 2025-06-26 20:39:42 +00:00
perf3ct a94acd7ffe feat(server): actually render PDF thumbnails? 2025-06-26 20:25:52 +00:00
perf3ct e9496b921e feat(server): set up oidc system and migrations 2025-06-26 18:52:57 +00:00
Jon Fuller 70451d728f Merge pull request #46 from readur/fix/catch-pdf-extract-errors
fix(server): catch pdf-extract spammy logs
2025-06-25 21:35:09 -07:00
perf3ct 715b94ec66 feat(swagger): add a ton of docstrings to functions 2025-06-25 23:58:37 +00:00
perf3ct 20b90e92d3 feat(swagger): add missing endpoints to swagger-ui 2025-06-25 23:47:27 +00:00
perf3ct 40afb5ade5 fix(server): catch pdf-extract spammy logs 2025-06-25 23:26:11 +00:00
perf3ct a5ca6e33f2 feat(server): decrease logging verbosity for ingestion 2025-06-25 21:41:46 +00:00
perf3ct 00d771c15f fix(server): resolve compilation issues due to increased logging 2025-06-25 20:00:09 +00:00
perf3ct bcd03bf0d4 fix(server): don't log postgres passwords 2025-06-25 19:44:58 +00:00
perf3ct 04bf3500fa feat(server): implement better error for configuration issues 2025-06-25 19:37:16 +00:00
perf3ct 05a1a07494 fix(server): also fix these broken user isolation SQL statements 2025-06-24 17:43:58 +00:00
perf3ct 363bc2b9ef fix(server): better error responses when creating users 2025-06-24 17:33:59 +00:00
perf3ct 3f3654c3cb fix(server): resolve lack of user isolation 2025-06-24 17:28:28 +00:00
perf3ct a0e75d4619 feat(server/client): implement feature of ignoring already deleted files, and add failed OCR queue tests 2025-06-24 17:20:33 +00:00
perf3ct 5510765035 feat(migrations): resolve migrations names and remove legacy migrations code 2025-06-23 21:08:43 +00:00
perf3ct 67d1e0ee2f feat(webdav): move etag parser to own function, create required migration 2025-06-23 19:39:39 +00:00
perf3ct 113f1d8315 fix(tests): fix broken parser, thanks for finding that, unit tests! 2025-06-23 19:14:31 +00:00
perf3ct b9847b8b6b feat(server): normalize etags from webdav to properly check for file changes 2025-06-23 19:03:24 +00:00
perf3ct 33ae814a43 fix(tests): also fix unit tests 2025-06-22 21:31:11 +00:00
perf3ct 1555b8bd4d feat(tests): resolve admin integration test issues 2025-06-22 17:28:45 +00:00
perf3ct 4ec4ecaa8d feat(ci): fix other tests, part 9000 2025-06-21 18:08:34 +00:00
perf3ct 679ad04274 fix(deletion): properly handle concurrent deletion requests 2025-06-20 18:40:24 +00:00
perf3ct 8ae976eda8 feat(tests): resolve failing and ignored tests 2025-06-20 18:37:52 +00:00
perf3ct 09b338685d fix(tests): repair the label tests 2025-06-20 18:10:27 +00:00
perf3ct 2c2d948aa2 fix(documents): remove old code in favor of document ingestion engine 2025-06-20 17:18:00 +00:00
perf3ct eec1072677 Merge branch 'main' into feat/document-deletion 2025-06-20 17:11:26 +00:00
perf3ct c4a9c51b98 feat(ingestion): have everything use the document ingestion engine 2025-06-20 16:53:06 +00:00
perf3ct ac069de5bc feat(ingestion): create ingestion engine to handle document creation, and centralize deduplication logic 2025-06-20 16:24:26 +00:00
aaldebs99 e3c276226a feat(tests): add deletion unit tests 2025-06-20 16:09:27 +00:00
aaldebs99 1507532083 feat(everything): Add document deletion 2025-06-20 03:49:16 +00:00
aaldebs99 b24bf2c7d9 Merge branch 'main' into feat/document-labels 2025-06-19 18:40:50 -07:00
aaldebs99 4dd9162415 fix(frontend): label writing and fetching logic 2025-06-20 01:32:32 +00:00
aaldebs99 aeb98acea8 fi(backend): migrate python code to rust lol 2025-06-20 01:32:05 +00:00
perf3ct 7f20e59aa6 feat(tests): resolve issue with 'source' tests 2025-06-19 20:29:35 +00:00
aaldebs99 bfb971adce fix(backend): lables handling 2025-06-19 19:47:49 +00:00
aaldebs99 2d518b40df fix(backend): labels 2025-06-19 18:58:00 +00:00
Jon Fuller 7873913759 Merge branch 'main' into feat/document-labels 2025-06-19 11:32:03 -07:00
aaldebs99 215704f881 fix(server): static file routes 2025-06-19 18:29:52 +00:00
Jon Fuller 4d0d9d16b6 Merge branch 'main' into feat/document-labels 2025-06-18 19:07:54 -07:00
aaldebs99 865c91db67 chore(server): remove unused system user 2025-06-19 00:41:01 +00:00
perf3ct d055e9f350 feat(server/client): implement labels for documents 2025-06-18 16:12:42 +00:00
perf3ct 4f36e40e38 feat(tests): resolve last test issues 2025-06-17 22:14:38 +00:00
perf3ct 14af90c657 feat(tests): fix the vast majority of both server and client tests 2025-06-17 22:06:12 +00:00
perf3ct f905c220e0 feat(tests): add actual images as part of e2e and testing 2025-06-17 21:26:39 +00:00
perf3ct 24e7dff9a5 feat(client): update failedOcr page for duplicates 2025-06-17 16:52:45 +00:00
perf3ct 80d58b0f28 feat(server/client): implement updated FailedOcrPage, duplicate management, and file hashing 2025-06-17 16:17:23 +00:00
perf3ct 58aaedf4a6 feat(server): add hash for documents 2025-06-17 15:41:42 +00:00
perf3ct babe5a6e46 fix(ocr_queue): don't slam the DB while we wait 2025-06-17 14:45:44 +00:00
perf3ct b2a7faaddb feat(server): create specific endpoint for fetching documents, fix client being served again 2025-06-17 04:05:57 +00:00
perf3ct 7eb036b153 feat(client/server): create endpoint for fetching individual files, and fix client not serving files 2025-06-17 03:38:16 +00:00
perf3ct 479c62a4f1 feat(client/server): advanced search, along with fixing build errors 2025-06-17 02:56:59 +00:00
perf3ct 4dda4d143d feat(client/server): implement a much better search 2025-06-17 02:41:16 +00:00
perf3ct bcd756ed20 feat(server/client): remove webdav feature from user's settings as it's in sources now 2025-06-17 01:57:56 +00:00
perf3ct fad6756c8c feat(server): stop image preprocessing in OCR 2025-06-17 00:35:03 +00:00
perf3ct 801038a26e feat(server): break up large db.rs file into multiple files, and add more PDF guardrails 2025-06-17 00:25:21 +00:00
perf3ct 54868cdc57 feat(server): try to resume syncs after server restart 2025-06-16 23:21:43 +00:00
perf3ct 0d3fe26074 feat(server): also generate thumbnails for non-images, and resolve failing unit/integration tests 2025-06-16 22:51:29 +00:00
perf3ct c43994e63c feat(server): put more guardrails around PDF OCR size, and image size OCR 2025-06-16 22:39:00 +00:00
perf3ct 13e60fa655 feat(server): if there's no sync even running, allow sync to be cancelled 2025-06-16 21:39:41 +00:00
perf3ct c656a96d91 feat(server): create folders within 'upload' path to manage thumbnails, processed images, etc. 2025-06-16 21:24:46 +00:00
perf3ct af7129da0a feat(server/client): fix thumbnails and quick search 2025-06-16 17:40:53 +00:00
perf3ct 4aa4359064 feat(server/client): update function used to display singular documents 2025-06-16 17:10:55 +00:00
perf3ct fe56ecdb00 feat(server): implement queue system for ocr process as well, to fight resource exhaustion 2025-06-16 01:20:13 +00:00
perf3ct bf7ec25dc1 feat(server): create more DB guardrails, and lots of missing tests 2025-06-15 22:14:02 +00:00
perf3ct 5b88c92937 feat(server/client): add pagination in client, resolve race condition in server 2025-06-15 21:48:59 +00:00
perf3ct b21f2684bc feat(server/client): add lots of OCR tweaks 2025-06-15 21:24:06 +00:00
perf3ct a39fc807fa feat(server): fix the sync scheduler for sources 2025-06-15 18:05:56 +00:00
perf3ct cebae12363 feat(client): also show settings for s3 and local sources in the client 2025-06-15 18:00:35 +00:00
perf3ct e5aaf31fdd feat(server/client): working s3 and local source types 2025-06-15 17:51:04 +00:00
perf3ct 11c68c3d9f feat(client): also update sources page and the various buttons 2025-06-15 17:06:38 +00:00
perf3ct 5dfc6e29f7 feat(async): create dedicated pools + runtime isolation for OCR 2025-06-15 16:47:55 +00:00