In the last semester of my 1st year in uni, the final project on Data Structures required a team of 4 to create a search engine that could execute common search queries on a give data set.
You can clone the repository, build it with your C++ compiler, and run it. However remember to put the executable file inside the same folder as the data set. Link to source code on GitHub
At first, I was a little shocked at the request because I thought the way to do it was string and sub-string comparison on each items. Turned out, there was a way to store data such that we can find the location of a string and the article containing it in relatively fast speed. It was called a Trie tree. In this structure, each node contains information for a letter such as lists of positions in an article,a list of article’s IDs and links to the next letter. Traversing the tree will give us the word we want. After storing all the text files into trees, we can quickly look up the id of files containing the keywords we search for after traversing the word string only once. By doing it this way, we are essentially trading memory and initializing speed for processing speed. Since we are more likely to start the program once, but use it many times, and memory is cheaper than CPU, we are happy to take that trade-off. Below is an image to illustrate, courtesy of Wikipedia
Parsing data and queries
Before we can store these text files into the Trie tree we have to parse them for “words”. To do that, first we have to define what is word. We decided that word is a combination of letters wrapped by a set of tokens we defined. These tokens include blank spaces, line delimiter, braces, commas, … basically everything on the keyboard that isn’t a number or a exist in the alphabet. We parsed these files line by line, each line results in a vector of words and we store each of these words into the tree. Now here is what caused lots of the bugs: names of text files weren’t properly formatted and text files might be written using different encodings. This resulted in failed data storage of data and special characters that lead to correct results not showing. We were not allowed to make any changes to data files so we had to create a work around in our code using a lot of clunky IF statements that made the import function look like an abomination. You can view it in the source code, the function is called “Load_data”. As for parsing keywords and operators in the queries, it was the same as parsing for normal words, we stored everything in a vector and later use IF and switch-case statements to work on the operators.
For simple queries type such as AND, OR. We simple search each of the keywords then perform set operations on the set of results. For AND queries, we would take the intersection of results and take the union for OR queries. If there are no operators, we will assume it is a query of type OR. As for the more tricky query type, the incomplete match, we would use BFS to search for all results, starting at the last node of the operand. Individually, these operators didn’t pose much problems, but when combined into one complex query, we have to take into careful consideration the order of which we executes each operator and how we combine the results. Search history and autocomplete suggestions As a bonus feature, we managed to save the search history, show them and also use them to give autocomplete suggestions. The way we did was using a Trie tree to store the search history as well. Then use BFS again to give autocomplete suggestions every time the user inputs a character. This guaranteed the maximum score for our projects. Better UI While I was frantically testing with different combinations of queries and fixing bugs to ensure correct results as I was the team’s QC/QA specialist. My friends worked on enhancing the fluidity of our UI. Since it’s only a console application, most of the times, user interact by typing in and pressing enter. But with some research and determination, our application managed to interact directly with keyboard inputs in real time, which was not as easy as we thought. After typing in the query and pressing enter, users can navigate through the result pages by pressing “1” and “2” on the keyboard to move forward and backward between pages and the page numbers are shown at the bottom of the page as well.
This project was one of the best I ever had in college. The work was challenging and eye-opening. It felt extremely satisfying to pull off personally for me because I thought it was out of my league. I have always thought that you need to learn some advanced string comparison, and store data in a complex data structure using OOP to actually make a search engine that could execute queries in an amount time short enough for it to be useful. Turns out, I only lacked that one bit of knowledge about the existence of Trie tree to push through. Data structure was truly the key to this project.