Search index implementations | Ben E. C. Boyter

A living document where I will be putting information about search index implementations as I learn about them.

So there are a few talked about ways to search across content or build an index that I am aware of. Lets discuss each in term with its advantages and disadvantages

Brute Force

Brute Force, note that this isn’t an index strategy per say, but worth discussing anyway. Assuming you can get the entire corpus you are search into RAM it is possible to brute force search. I’m including it here so I have the advantages and disadvantages. While brute force isnt a solution to every problem, a lot of problems are pretty brute forcible given enough CPU and RAM.

Advantages

100% space efficient (for the index anyway) since there isn’t one
Easy to query/write
Can determine TF easily as its a positional “index”
Scales (assuming you are prepared to pay)

Disadvantages

Slow… at scale if you cannot fit your corpus in RAM
Harder to scale because gotta duplicate all the data
Writing high performance string searches is a hard problem

Inverted Index

Inverted index. This is pretty much building a map of terms to documents and then intersecting them and ranking. One issue with this technique is that you end up with enormous term lists, known as posting lists for common terms.

Advantages

Easy to query
Does not miss terms
Can store TF alongside terms creating a positional index
Fast and easy; intersecting posting lists is something you can hand off to your most junior developer (at least initially)
Query execution time is related to the number of returned results

Disadvantages

Harder to do wildcard queries OR use more space for them
Hard to update documents and avoid false positives, usually have to rebuild the index or do delta + merge
Need to implement skip lists or compression on the posting lists at scale
Queries can be slow due to the complexity of the list structures

Trie

Trie for example https://github.com/typesense/typesense which uses Adaptive Radix Tree https://stackoverflow.com/questions/50127290/data-structure-for-fast-full-text-search

Advantages

Constant lookup time
Potential to be space efficient if key terms share the same prefixes
Easy to do wildcard queries
Can store TF alongside terms as a positional index

Disadvantages

Can take up more space if lots of prefixes
Not friendly to GC due to the use of pointers (problem for Go/Java and other GC languages)
Need to implement skip lists or compression on the posting lists at scale

Bit Signatures

This is something I remember reading about years ago, and found this link to prove I had not lost my mind https://www.stavros.io/posts/bloom-filter-search-engine/ At the time I thought it was neat but not very practical… However then it turns out that Bing has been using this technique over its entire web corpus http://bitfunnel.org/ https://www.youtube.com/watch?v=1-Xoy5w5ydM

Advantages

Stupidly fast if careful (bitwise operations are almost free from a CPU point of view)
Space/Memory efficient as it’s compressed
Easy to update/delete modify the document in memory

Disadvantages

Cannot store TF information along terms as its a non-positional index
Produces false positive matches
Query execution time is linear in the collection size (bitfunnel is about reducing this to an extent)
Memory bandwidth of the machine limits how large the index can grow on a single machine

Ben E. C. Boyter's Blog

Search index implementations
2022/06/26 (539 words)

Brute Force

Inverted Index

Trie

Bit Signatures

Search index implementations 2022/06/26 (539 words)

Brute Force

Inverted Index

Trie

Bit Signatures

Search index implementations
2022/06/26 (539 words)