Docfd (Page WIP)

Introduction

Docfd is in a TUI program that allows you to fuzzy search for a phrase across multiple lines, across text files, PDFs, DOCX, etc.

While the README does a reasonable job briefing over what Docfd is, I still want to talk about Docfd in greater details. In part to document the engineering effort that went into Docfd, and in part as a portfolio building exercise.

Motivation

Docfd was born out of my personal frustrations with existing search tools in the context of human text search.

This is not to say I think the tools are not well implemented or not well designed, this is just me saying I don’t fit into the usage patterns that these tools are designed for. In fact, given the specific intended usage patterns and constraints, I think most of these tools are likely implemented as well as one could.

So the following are not necessarily criticisms, but more of why they didn’t fit my (perhaps admittedly niche) criteria of being able to run locally on a not very powerful machine while providing “good” search results:

grep, ripgrep, etc
- Great if I am searching for a single word
- Bad if I want to search for a phrase across potentially more than 1 line
  - Technically there is a way to encode the problem into a regex, e.g. “hello world” becomes hello.*world, hello.*\n.*world, hello.*\n.*\n.*world, and so on, up to a limit.
  - But then the number of regexes grows quickly with the number of words, especially if we allow reordering of words and desire typo tolerance.
fzf, skim, television
- Great for fuzzy finding within a single line
- There are workarounds for multiline by replacing the new line character then using the --read0 flag, e.g. vgc, but this does not allow searching across blank lines
- Technically not a problem if you are only interested in searching within paragraphs, but this assumes text is always well formed, which is not necessarily the case with text extracted from PDFs.
Paperless-ngx, Papra, sist2
- These are optimised for central storage of documents, where the storage is not necessarily designed to be accessible to external programs (which is fair enough, cause you want storage consistency).
- But I still want to use the file system as the main organisation mechanism to avoid vendor lock-in, and to allow usage of other tools. In principle I can use both the central storage and file system, but then I am using twice the storage space adn also need to keep things in sync somehow.
- That said, if I ever need to deploy for a use case for multi-user or concurrent access, then software with DB backed storage is still likely my top pick. But here I am just looking to use locally on a single machine myself.
Recoll, Baloo
- These (from my point of view) expect either a stable main collection of folders to scan from, or scan the whole home directory barring ones like Downloads. They also utilise background daemon to index periodically.
- However, I usually jump to a specific project folder to search rather than the whole home directory, and I want the index to be fully up-to-date with respect to the underlying files. I don’t want to wait for a background update to occur.
- While I can trigger the reindexing manually, I need to update the settings in order to just reindex the specific folder I am in.
- On a similar note, I also don’t want the periodic CPU and/or memory spike from reindexing of files I am not looking at.
Elasticsearch, Lucene, meilisearch, etc
- Nothing beats the search performance of actual search engines. But the setup and running cost of these are not trivial.
- Which are all fair enough, cause ultimately you need in-memory indices for hot data to serve at the scale needed, which takes time to build and they occupy memory space etc.
- But I don’t want anything long running. I just want to jump into a folder, do my search, and be done. I don’t want to wait for more than a few seconds just to do a quick search, so starting up a fresh search engine instance per session is a no-go.
- I also don’t want the tool to disrupt other deskop applications I am running, so the typical resource usage requirements of search engines also render them not viable for me.
- But if I am deploying onto a dedicated hardware/host, then these remain the natural first choices obviously.
Local LLMs, whether for chat or semantic search
- I personally believe these could be the gold standard for information retrieval, but they are still very computationally heavy, and there are a lot of fine-tuning needed in terms of chunking, etc, to yield optimal results. And even after all that, we still face the issue of hallucinations.
- Arguments of the energy efficiency aside, I simply don’t want to build my core workflow around needing very powerful hardware. Even CPU optimised LLMs run quite sluggishly on my laptop last I tried.
- If one day very powerful hardware is extremely prevalent and cheap, where computational inefficiency is simply strong-armed into irrelevance, then sure, local LLM why not. Or alternatively LLM becomes so efficient that we can use it on the most modest of machines while offering good results, that would suffice as well.
- But I am not confident we are close to that stage yet, so there are still very real needs for the traditional approaches to searching, especially since LLMs still hallucinate wildly, and even if not, cannot give guarantee for its exhaustiveness. For instance, if grep does not give any results for a search word, then I am confident the word does not exist in the documents, but I cannot say the same for LLMs.

How does Docfd address my complaints?

Docfd accomplishes multiline search through a straightforward combination of inverted index and proximity search between words.
- The inverted index (i.e. a mapping from a word to all its appearing positions) is first searched through once to find the matches for the first word in the search phrase.
- DFS is then used for the remaining words, where the word is searched within a specified distance from the previous word. In other words, the path in this DFS is the list/sequence of words in the document matching the search phrase.
Fuzzy matching is handled by using the Levenshtein distance as part of the matching criteria for each word. An automaton is computed for each word of the search phrase for optimised repeated matching.
Docfd only processes the current directory or the specified directories and files upon startup. Hashing is used to check if file has been previously indexed. This means there is no central storage requirement, and no background indexing.
- In principle this causes slower start-up time in the general case compared to programs with background indexing. But since the set of documents of interest is usually small (<1k documents), the start-up is often instantaneous.
- And if the set of documents is really large, then one can always fall back to having a long running session of docfd.
A common speed up tactic for search engines is to load the inverted indices into memory, as they are on the hottest paths. Docfd instead keeps it only in the on-disk SQLite DB to minimise memory usage, to avoid impacting performance of other desktop applications. The tradeoff is that at larger scale (say a few thousand documents, depending on the sizes), Docfd will noticeably struggle - results will take seconds instead of less than a second to show up.

Structure of Remaining Text

The rest of the Docfd text will revolve around showcasing the different vertical slices (the specific scenarios or workflows) of Docfd in the form of recordings, each followed by a technical write-up of how the implementation all worked together into delivering said vertical slice.

I believe this both delivers a nice feature exploration experience for new and existing users, and splits the technical writing naturally by tangible, focused scenarios.

Introduction

Motivation

How does Docfd address my complaints?

Structure of Remaining Text

Showcase