Search index implementations
2022/06/26 (539 words)

A living document where I will be putting information about search index implementations as I learn about them.

So there are a few talked about ways to search across content or build an index that I am aware of. Lets discuss each in term with its advantages and disadvantages

Brute Force

Brute Force, note that this isn’t an index strategy per say, but worth discussing anyway. Assuming you can get the entire corpus you are search into RAM it is possible to brute force search. I’m including it here so I have the advantages and disadvantages. While brute force isnt a solution to every problem, a lot of problems are pretty brute forcible given enough CPU and RAM.

Advantages

Disadvantages

Inverted Index

Inverted index. This is pretty much building a map of terms to documents and then intersecting them and ranking. One issue with this technique is that you end up with enormous term lists, known as posting lists for common terms.

Advantages

Disadvantages

Trie

Trie for example https://github.com/typesense/typesense which uses Adaptive Radix Tree https://stackoverflow.com/questions/50127290/data-structure-for-fast-full-text-search

Advantages

Disadvantages

Bit Signatures

This is something I remember reading about years ago, and found this link to prove I had not lost my mind https://www.stavros.io/posts/bloom-filter-search-engine/ At the time I thought it was neat but not very practical… However then it turns out that Bing has been using this technique over its entire web corpus http://bitfunnel.org/ https://www.youtube.com/watch?v=1-Xoy5w5ydM

Advantages

Disadvantages