• By Achin Gupta
  • In Non Tech
  • Posted September 1, 2017
Spread the love

 

Hey peeps, I have a firm belief that we might have been ignoring the importance of glossary. These are the words available at the end of several books corresponding to the page number they are present in.

A glossary is a tool which makes our extensive searching easier, faster and very smooth. A glossary, also known as a vocabulary or clavis, is an alphabetical list of terms in a particular domain of knowledge with the definitions for those terms. Traditionally, a glossary appears at the end of a book. It includes terms within that book that are either newly introduced, uncommon, or specialized. But who knew that someone would be using it to create one of the most provident technology that is available today on the market, capable of breaking barriers in the realm of asking questions and searching.

Similarly, Apache Solr powers the search and navigation features of many of the world’s largest websites and companies toggling billions of documents frequently. Humongous resources and capabilities are provided to us via the smart use of Glossary or as Solr says it, REVERSE INDEXING(inverted index).

For someone who’s sound technical knowledge is limited to the for and if loop in any programming language, like me, Reverse Indexing is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.

For example, let’s say we have two documents, each with a content field containing the following:

  1. Custom searching solution by SolrExperts is a must have an instrument for your enterprise.
  2. Many Enterprises have deployed SolrExperts searching solutions via the custom tool developed.

To create an inverted index, we first split the content field of each document into separate words (which we call terms or tokens), create a sorted list of all the unique terms, and then list in which document each term appears. The result looks something like this (the words in BOLD are stop words):

 

Inverted Index

 

 

There are a lot of challenges while implementing and deciding the rules for reverse indexing. Every user wants the query to analyse words in a different way. Some could want the upper cases to be treated are similar words but on the contrary, they do not need stemming. These aspects of an Inverted Index depends on your use case.

The following aspects have been kept in mind while creating a reversed index in this use case particularly :

  • Custom and custom do not appear as separate terms because the user probably thinks of them as the same word.
  • enterprise and enterprises are pretty similar; They share the same root word, so are treated as the same word.
  • instrument and tool, while not from the same root word, are similar in meaning. They are synonyms.

If we normalize the terms into a standard format, then we can find documents that contain terms that are not exactly the same as the user requested, but are similar enough to still be relevant. For instance:

  • Custom can be lowercased to become custom.
  • enterprises can be stemmed–reduced to its root form—to become an enterprise.
  • instrument and tool are synonyms and can be indexed as just the single term tool.

So now, while querying your data you will get somewhat desired results, not exactly precise. This is because we haven’t applied same normalization rules that we used in the content field to our query string.

Note

You can find only terms that exist in your index, so both the indexed text and the query string must be normalized into the same form. This would eventually match in both documents as:

  • Custom- custom
  • enterprises – enterprise
  • Instrument – tool

So when we query custom tool, we would obtain both the documents in which the aforementioned lines are present.

Apache Solr made use of a book’s technical aspect to create the world that is intuitive and responsive. Make use of this technology’s extent to look for the unexpected. You may never get tired of searching because each question leads to a different pathway.

Get lost, It’s too deep a forest and I challenge you to go inside. Be careful.

YOU MAY NEVER BE ABLE TO ESCAPE THE VAST VAST WORLD OF WORDS.

Achin Gupta


Leave a Reply

Your email address will not be published. Required fields are marked *