to search OR NOT to search

McPotato
7 min readFeb 26, 2021

Could you imagine what the internet was like before search engines ?

I wouldn’t dare give that a thought. In this article I’m going to go over the basics of Information Retrieval, an important & blossoming field of Computer Science.

Specifically, I will be talking about the main data structure behind IR and the Boolean Retrieval Model.

A bit of context

Recently at work, I started working with ElasticSearch.

I had multiple objectives :

  • Insert valuable business data (thanks Kafka Connect)
  • Monitor the behaviour of Elastic under load
  • Prepare some ground work for an analytical usage

That introduced me to ElasticSearch’s indexes, shards and all that technical mumbo jumbo. I got curious, I wanted to understand how queries were so efficient.

I’d stumbled on an article about Full Text Search by Artem Krylysov a while back and this was my Eureka moment. If I were to better understand ES’s internals, I should build my own search engine !

And since I’m a Rust enthusiast, I’m going to do that in Rust 😎

The fundamentals

First, let’s go through the basics.

When you search for information, you usually do it on documents. A document is a unit that can be a lot of different things such as a book page, a wiki page, a website page, etc. You build your IR system on a corpus or collection of documents.

To make the search efficient, we must split our document collection into smaller blocks called terms. A term is usually a word, but can be a small group of words that act as a unit of meaning.

The result of our splitting operation leads to the creation of a dictionary. The dictionary is the data structure containing a set of terms, also called vocabulary or lexicon.

a naive dictionary based on the abstract of the previously shown document

Before I introduce the fundamental data structure behind IR, I’d like to go over a last definition : a postings list. A postings list is the list of document IDs associated to a term. We can also have additional data in our postings list, like the position of a term in the document.

“in” ‘s posting list

Here, I introduced only 1 document. In practice, postings lists are much longer !

You were waiting for it, here it is :

Dictionary + postings list = Inverted Index

The Inverted Index is the main data structure behind IR — it’s what makes search so efficient.

One way to search for documents that contain a term, would be to go through the corpus and checking each document for the term. That would make the search O(n) complexity, where n is the number of documents, not very efficient.

With an Inverted Index, the search process becomes :

  • user wants documents containing British
  • we look up British in our dictionary and get the associated postings list
  • we return all the documents containing British to the user

With a data structure like a Hash Map, or Hash Table, representing the Inverted Index, that’s constant time, O(1) !

0.1-alpha

Scope 🎯

For the first version of my search engine, I wanted to implement a Boolean retrieval model.
The Boolean retrieval model is based on the boolean operators AND, OR and NOT. It allows someone to search for terms with boolean logic.

My goals were to :

  • be able to perform boolean queries
  • build a basic inverted index
  • persist the index

Result

funny that Burton’s Gerbil wiki page contains wild but not small :’)

And for comparaison, here’s a naive search :

It definitely feels slower, but we’d have to time it to measure how much of a difference there is.

Issues I faced during development

It took me so long to get the project & this article done because I got sidetracked into wanting to persist the data on disk.
For that, I decided to build a persistent KV store backed by SSTables. I didn’t realise how though this was and thought I’d have something running pretty quick. It took me about three weeks just to implement a SkipList.

I learned a lot trying to implement the KV store, but it was completely out of the scope for a v0.1-alpha eheh.
I’ll come to it eventually, it feels good to have given it a thought though. I’ll probably follow up on what I learned in one of the next articles of this series !

I also thought I was not going to implement the query parsing part, I almost gave up. I did come through in the end, I’m happy about myself. More on how I did it below !

The meat 🍖

Beware, Rust code ahead !

Corpus

I got the idea from Artem to base myself on the Wiki Corpus to exercise my search engine capabilities. Get the latest abstract here.

First, I needed to parse the documents. Being xml format, I found quick-xml to help me out. A cool feature is that the crate supports serde.
All you have to do to deserialise your documents is to create the struct to contain the data :

Then, to read the file’s content to a string and finally it’s as easy as this :

the xml variable contains the file’s content as a String

You can find the full source in src/wiki.rs.

Building the index

Once I had my corpus, I was ready to start working on the Inverted Index. I chose to implement the Inverted Index with a HashMap where the terms are the keys and the values are the postings lists.

You’ll notice the derive attribute here too. I didn’t save the index in the most optimised way I have to say, I just used serde-json. I wanted to persist my index because building it on a ~900 MiB corpus takes some time and as you can see in the result gif, it starts fairly quickly when loaded from the JSON file. Note that on startup the documents have to be loaded to memory too.

Next up, building the index. For that, I created an add_wiki_doc function that takes a document and its id as parameters. For simplicity, the doc_id is the position of the document in the Vec<WikiDoc>.

We base our search only on the abstracts of documents. We could also index the title, but I feel that in our context it’s pretty redundant.

An important step in making the search index as efficient as possible is to filter the terms (named tokens in the above fn). By filter I mean applying transformations that reduces the size of your dictionary and making the search easier and faster.
Since I decided I’ll implement everything myself for educational purposes, I’ve only implemented a lowercase filter.
Here are two things you could look into to make your dictionary more concise :

Querying your data

Once you’ve built your index it’s time to search it !

For that I started by implementing Dijkstra’s Shunting-Yard algorithm to parse my boolean query to a postfix notation. I won’t show the code here because it’s too long, but I’ll go over the steps of how it works :

  • trim spaces, tabulations and carriage returns from the query string on both sides
  • split the string into tokens
  • check tokens for parenthesis and add each parenthesis as a token in the tokens vector
  • Apply Shunting-Yard :)

What I can do though, is show you a passing test case :

Our final stop in this article is the evaluation of the postfix notation. To do so, we iterate through the postfix vector and do the following :

  • if we find a term, get its postings list from the HashMap and push it onto a stack
  • if we find a NOT operator, we run a not intersection between the list of all postings and the first postings list on the stack. Push the result on top of the stack.
Function’s source code
  • if we find an AND operator, pop the two first elements of the stack and intersect their postings list. Push the result on top of the stack.
Function’s source code
  • if we find an OR operator, pop the two first elements of the stack and merge their postings list. Push the result on top of the stack.
Function’s source code

There you have it ! Thanks for reading if you got this far, see you in part 2 👋

Potential improvements

  • Our NOT operator costs a lot because we intersect with the full list of postings. There must be a way to make this more efficient !
  • An easy improvement is to remove stopwords from the dictionary. I’d simply have to get a list, copy paste it into a constant array and voilà.
  • Find a format to store the index without JSON’s fluff.
  • There may be space for improvement on the document indexing part.

Sources

--

--

McPotato

Software Engineer @ 🤗. Rust & distributed systems enthusiast 🦀