Abusing Go, AWS Lambda and bloom filters to make a true Australian serverless search engine

Who are you?

"Officially" technical lead/principal at Kablamo but a "code monkey" at heart.

I blog boyter.org code free software github/boyter/ I run searchcode.com also on the x.com @boyter activitypub @boyter@honk.boyter.org

https://boyter.org/

Everything is there, so feel free to go to sleep

My hobby; search engines

early termination, syntax highlighting, matching, snippet extraction, rate limiting, index, bot detection, ranking, distributed algorithms, caching, tokenization, string searching, regular expressions, data structures, line detection, duplicate detection, literal extraction, unicode, case insensitive matching etc...

searchcode.com

Lambda/Serverless Weaknesses

  • Persistance
  • Performance
Never do at runtime what you can do at compile time.

The idea!

Compile the index INTO the lambda and deploy that

  • 50 MB zipped binary
  • 75 GB of space to store all your lambda's
  • AWS Lambda 1,000,000 free requests per month
  • AWS Lambda 400,000 seconds of compute time per month

Proving the theory

Brute Force

package main

func main() {
	index := []string{
		"Memcached vs Redis - More Different Than You Would Expect",
		"You Don't Need a Library for File Walking in Go",
		// ... 99,997 more ...
		"Lessons Learnt Building for the Atlassian Marketplace",
	}

	for _, x := range index {
		strings.Contains(x, "searchterm")
	}
}

This does not work...

Several seconds per search...

Add an index

Structured list. Keys point at data so searches are faster.

Given a term, return id's which contain it.

Textbook 101 index


index = map[string][]int{
	"serverless": []int{3,1337},
	"days": []int{1,2,3,4,5,6,1337},
	"brimful": []int{7,45},
	"asha": []int{45},
}

not compressed, list walking can be slow

Add Complexity

My attempt

Two turkeys taped together does not make an eagle.

How about bloom filters?

A.K.A Bitsliced signatures

Bloom filter: Empty

16 boolean's in an array

Add

Hash the term 3 times and set the bits

Add second

3 more bits set

Overlapping bits

"big" and "dog" share 2 bits

Hit

hash "big" and check bits

Miss

one bit position is 0 so miss

False Positive

"big" and "yellow" supplied bits

Bloom filter: search

8 bit bloom filter

Check bit positions 1 and 7. Document 4 matches.

for each bloomfilter
	for each bit
		check if bit location in filter is set
	if all matching bits are set
		record possible match

Advantages

  • Compressed. Only using several bits per term!
  • Just arrays of bools, easy to embed in code

Problem

8x overhead per bit because of how languages work...

Illustration

Can we do better?

In theory; theory and practice are the same. In practice they arent.

Bitfunnel

Good enough for Dan Luu? Good enough for you.

Fixes

Rotate the filter. Documents now on columns not rows.

Fetch row 1 and 7 same as previous example

Logically & all rows

Pos 1 is true, so document 4 matches

Results?

This reduces the amount of RAM we need to access by a huge factor for larger bloom filters.

64 document 16 bit bloom filter

0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000

Example

16 bit bloom filter, with 32 documents added.

Filled right to left

0000000000000000000000000000000010100011111111111111111111111101
0000000000000000000000000000000001110000000000010000100100000000
0000000000000000000000000000000000000000000000000001000100010000
0000000000000000000000000000000000100011111100001010000000000000
0000000000000000000000000000000000100000000000111000000000011000
0000000000000000000000000000000011000000000100000100000100010000
0000000000000000000000000000000000000000011000100000000100000010
0000000000000000000000000000000000001000000000000001000000010010
0000000000000000000000000000000000111100000000000000000100010100
0000000000000000000000000000000000100001111100000001010101011110
0000000000000000000000000000000010000010000010001010101010100011
0000000000000000000000000000000001011000000000000000000000000000
0000000000000000000000000000000001101011111111111111111111111111
0000000000000000000000000000000000110000001000000001100000001000
0000000000000000000000000000000010100010000111111110101010100001
0000000000000000000000000000000001100011111100000000000000000000

Embed Code Example

Filter set to 2048 bits

50,000 * 2,048 = 102,400,000 ints


var bloomFilter = []uint64{1942, 1696, 1762, 496, 1776, 1954, 1970, 1536, 494, 134, 128, 1680, 0, 1536, 
2016, 1952, 2047, 296, 1600, 1536, 0, 64, 1664, 1985, 2046, 2032, 1760, 1536, 416, 1536, 360, 1568, 256, 
1920, 384, 0, 0, 1780, 1920, 1536, 256, 2032, 0, 1792, 1536, 1540, 1988, 0, 146, 1664, 2047, 288, 256, 1888, 
256, 2011, 128, 1778, 1904, 354, 0, 200, 1952, 496, 1920, 403, 1687, 384, 128, 1600, 1664, 0, 1600, 1990, 1760, 
1536, 256, 0, 0, 1664, 418, 1860, 1952, 256, 128, 162, 1736, 266, 64, 1922, 64, 1800, 0, 2003, 1920, 2016, 384, 
// snip lots of integers here
1644, 1864, 1920, 64}

Searching

The core loop.

func Search(queryBits []uint64) []int {
	var results []int
	var res uint64

	for i := 0; i < len(bloomFilter); i += 2048 {
		res = bloomFilter[queryBits[0]+uint64(i)]

		for j := 1; j < len(queryBits); j++ {
			res = res & bloomFilter[queryBits[j]+uint64(i)]

			if res == 0 { // important! skip shard if nothing!
				break
			}
		}

		if res != 0 {
			for j := 0; j < 64; j++ {
				if res&(1<<j) > 0 {
					results = append(results, 64*(i/2048)+j)
				}
			}
		}

	}

	return results
}

Searching: Visually

Perform & between each row, and if we see 0 skip to next block

~3 ms to run in lambda


Crawling

Lists of top domains exist, just download 3 or 4 and merge.

for y in listOfUrls:
	get y

Output


[{
    "url": "https://engineering.kablamo.com.au/",
    "title": [
        "Kablamo Engineering Blog"
    ],
    "h1": [
        "Lessons Learnt Building for the Atlassian Marketplace",
        "What I Wish I Knew About CSS When Starting Out As A Frontender",
        "How to model application flows in React with finite state machines and XState"
    ],
    "h2h3": [
        "Our Partners"
    ],
    "h4h5h6": null,
    "content": [
        "THE BLOG",
        "Insights from the Kablamo Team.",
        "28.7.2021 - By Ben Boyter"
    ],
    "rank": 0.65
}]

Query Ranking

Post query ranking + Pre ranking = happy user

// defaults for BM25 which provide a good level of damping
k1 := 1.2
b := 0.75

for word, wordCount := range res.matchWords {
    freq := documentFrequencies[word]

    tf := float64(wordCount) / words
    idf := math.Log10(float64(corpusCount) / float64(freq))

    step1 := idf * tf * (k1 + 1)
    step2 := tf + k1*(1-b+(b*words/averageDocumentWords))

    weight += step1 / step2
}

Snippet Extraction

https://github.com/boyter/cs/blob/master/snippet.go

features, noble mien, and the report which was in general circulation within five minutes after his entrance, of his having ten thousand a year. The gentlemen pronounced him to be a fine
it. Dear, dear Lizzy. A house in town! Every thing that is charming! Three daughters married! Ten thousand a year! Oh, Lord! What will become of me. I shall go distracted.”

Searching


Time to search

Problem solved.

2021-09-13T14:33:34.114+10:00	Duration: 142.89 ms Billed Duration: 143 ms
2021-09-13T14:34:26.427+10:00	Duration: 6.44 ms Billed Duration: 7 ms 
2021-09-13T14:35:15.851+10:00	Duration: 3.40 ms Billed Duration: 4 ms 
2021-09-13T14:35:28.738+10:00	Duration: 1.10 ms Billed Duration: 2 ms 
2021-09-13T14:35:44.979+10:00	Duration: 6.11 ms Billed Duration: 7 ms 
2021-09-13T14:36:15.089+10:00	Duration: 70.31 ms Billed Duration: 71 ms 

Design


Building Index

Iterate the crawled documents


sb.WriteString(fmt.Sprintf(`var averageDocumentLength float64 = %d`, averageDocumentLength))

sb.WriteString(`var documentFrequencies = map[string]uint32{`)
for k, v := range newFreq {
    sb.WriteString(fmt.Sprintf("\"%s\": %d,", k, v))
}
sb.WriteString("}")

sb.WriteString("var bloomFilter = []uint64{")
for _, v := range bloom {
    sb.WriteString(fmt.Sprintf("%d,", v))
}
sb.WriteString("}")

_, _ = file.WriteString(fmt.Sprintf(`{Url:"%s",Title:"%s",Content:"%s",Score:%.4f},`,
			res.Url,
			res.Title,
			res.Content,
			res.Score))

Write out to content.go.1 content.go.2 etc...

Deployment

		
for i in {1..1000}
do
  cp ./content.go."$i" ./content.go
  GOOS=linux GOARCH=amd64 go build -ldflags="-s -w" -o main && 
  	zip main.zip main && 
	aws lambda create-function --function-name t"$i" --runtime go1.x --zip-file fileb://main.zip \
		--handler main --timeout 10 --memory-size 1024 \
		--role arn:aws:iam::000000000000:role/aws-lambda-search-LambdaServiceRole-1NOLIFEIFREAD
done

aws lambda update-function-configuration --function-name aws-lambda-search-controller \
	--environment '{"Variables": {"WORKERS": "t0,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14,t15,t16,t17,t18,t19,t20,t21,t22,t23,t24,t25,t26,t27,t28,t29,t30,t31,t32,t33,t34,t35,t36,t37,t38,t39,t40,t41,t42,t43,t44,t45,t46,t47,t48,t49,t50,t51,t52,t53,t54,t55,t56,t57,t58,t59,t60,t61,t62,t63,t64,t65,t66,t67,t68,t69,t70,t71,t72,t73,t74,t75,t76,t77,t78,t79,t80,t81,t82,t83,t84,t85,t86,t87,t88,t89,t90,t91,t92,t93,t94,t95,t96,t97,t98,t99,t100,t101,t102,t103,t104,t105,t106,t107,t108,t109,t110,t111,t112,t113,t114,t115,t116,t117,t118,t119,t120,t121,t122,t123,t124,t125,t126,t127,t128,t129,t130,t131,t132,t133,t134,t135,t136,t137,t138,t139,t140,t141,t142,t143,t144,t145,t146,t147,t148,t149,t150,t151,t152,t153,t154,t155,t156,t157,t158,t159,t160,t161,t162,t163,t164,t165,t166,t167,t168,t169,t170,t171,t172,t173,t174,t175,t176,t177,t178,t179,t180,t181,t182,t183,t184,t185,t186,t187,t188,t189,t190,t191,t192,t193,t194,t195,t196,t197,t198,t199,t200,t201,t202,t203,t204,t205,t206,t207,t208,t209,t210,t211,t212,t213,t214,t215,t216,t217,t218,t219,t220,t221,t222,t223,t224,t225,t226,t227,t228,t229,t230,t231,t232,t233,t234,t235,t236,t237,t238,t239,t240,t241,t242,t243,t244,t245,t246,t247,t248,t249,t250,t251,t252,t253,t254,t255,t256,t257,t258,t259,t260,t261,t262,t263,t264,t265,t266,t267,t268,t269,t270,t271,t272,t273,t274,t275,t276,t277,t278,t279,t280,t281,t282,t283,t284,t285,t286,t287,t288,t289,t290,t291,t292,t293,t294,t295,t296,t297,t298,t299,t300,t301,t302,t303,t304,t305,t306,t307,t308,t309,t310,t311,t312,t313,t314,t315,t316,t317,t318,t319,t320,t321,t322,t323,t324,t325,t326,t327,t328,t329,t330,t331,t332,t333,t334,t335,t336,t337,t338,t339,t340,t341,t342,t343,t344,t345,t346,t347,t348,t349,t350,t351,t352,t353,t354,t355,t356,t357,t358,t359,t360,t361,t362,t363,t364,t365,t366,t367,t368,t369,t370,t371,t372,t373,t374,t375,t376,t377,t378,t379,t380,t381,t382,t383,t384,t385,t386,t387,t388,t389,t390,t391,t392,t393,t394,t395,t396,t397,t398,t399,t400,t401,t402,t403,t404,t405,t406,t407,t408,t409,t410,t411,t412,t413,t414,t415,t416,t417,t418,t419,t420,t421,t422,t423,t424,t425,t426,t427,t428,t429,t430,t431,t432,t433,t434,t435,t436,t437,t438,t439,t440,t441,t442,t443,t444,t445,t446,t447,t448,t449,t450,t451,t452,t453,t454,t455,t456,t457,t458,t459,t460,t461,t462,t463,t464,t465,t466,t467,t468,t469,t470,t471,t472,t473,t474,t475,t476,t477,t478,t479,t480,t481,t482,t483,t484,t485,t486,t487,t488,t489,t490,t491,t492,t493,t494,t495,t496,t497,t498,t499,t500,t501,t502,t503,t504,t505,t506,t507,t508,t509,t510,t511,t512,t513,t514,t515,t516,t517,t518,t519,t520,t521,t522,t523,t524,t525,t526,t527,t528,t529,t530,t531,t532,t533,t534,t535,t536,t537,t538,t539,t540,t541,t542,t543,t544,t545,t546,t547,t548,t549,t550,t551,t552,t553,t554,t555,t556,t557,t558,t559,t560,t561,t562,t563,t564,t565,t566,t567,t568,t569,t570,t571,t572,t573,t574,t575,t576,t577,t578,t579,t580,t581,t582,t583,t584,t585,t586,t587,t588,t589,t590,t591,t592,t593,t594,t595,t596,t597,t598,t599,t600,t601,t602,t603,t604,t605,t606,t607,t608,t609,t610,t611,t612,t613,t614,t615,t616,t617,t618,t619,t620,t621,t622,t623,t624,t625,t626,t627,t628,t629,t630,t631,t632,t633,t634,t635,t636,t637,t638,t639,t640,t641,t642,t643,t644,t645,t646,t647,t648,t649,t650,t651,t652,t653,t654,t655,t656,t657,t658,t659,t660,t661,t662,t663,t664,t665,t666,t667,t668,t669,t670,t671,t672,t673,t674,t675,t676,t677,t678,t679,t680,t681,t682,t683,t684,t685,t686,t687,t688,t689,t690,t691,t692,t693,t694,t695,t696,t697,t698,t699,t700,t701,t702,t703,t704,t705,t706,t707,t708,t709,t710,t711,t712,t713,t714,t715,t716,t717,t718,t719,t720,t721,t722,t723,t724,t725,t726,t727,t728,t729,t730,t731,t732,t733,t734,t735,t736,t737,t738,t739,t740,t741,t742,t743,t744,t745,t746,t747,t748,t749,t750,t751,t752,t753,t754,t755,t756,t757,t758,t759,t760,t761,t762,t763,t764,t765,t766,t767,t768,t769,t770,t771,t772,t773,t774,t775,t776,t777,t778,t779,t780,t781,t782,t783,t784,t785,t786,t787,t788,t789,t790,t791,t792,t793,t794,t795,t796,t797,t798,t799,t800"}}'

Results...

  • ~50 million documents indexed
  • ~1000 lambda functions
  • 23.6 GB (31% of 75 GB) FREE STORAGE!!!!
  • In fact the whole thing is FREE if not being called!
  • TRUE SERVERLESS SEARCH!

bonzamate.com.au





What is the sound of 1000 lambdas searching themselves?


Conclusions

Go. I doubt I could have done it in another language.

Bloom filters are great!

AWS is ripe for abuse like this, DO IT!

AMD64 vs ARM64

Conclusions

  • Lambda layers... pointless for this, CPU weak
  • Its... probably not a good idea at this scale
  • Probably a really good idea for static websites?

A minimal version of the bloom filter index is available for you to play with https://github.com/boyter/indexer

Why?

  • Australian Greens Party called for a publicly owned search engine to be created similar to ABC.
  • Previously this approach has failed in Europe and Russia with Quaero and Sputnik
  • Australia is an exporter of research talent
  • I would love there to be more competition in the search space, and I would love to see an Australian search engine be part of it.
  • If a single person, working in their spare time with zero budget can make a reasonable POC...