Ranking: The Secret Sauce for Searching the Deep Web

How does AMPLYFI’s deep-web search engine rank results?

Every search engine has particular algorithms that rank results in different ways, whether it is determined by the most recent, the page with the highest advertising bid, the most relevant page, the most visited page, or often a culmination of these factors. For example, Google’s PageRank works by counting the number and quality of links to a page to determine a rough approximation of how important the website is. It assumes that important websites are more likely to receive more links from other websites.

AMPLYFI’s search engine DeepResearch searches both the surface and the deep web. But how does it rank results and quantify relevance? And why is ranking important?

One of the most powerful features and benefits of DeepResearch is its ability to rank the results from the multitude of collections that might be included in a federated search (also known as deep-web search). This is useful for two reasons. First, it ranks results from sources that don’t otherwise rank them themselves. As a consequence, any search service that provides results from these sources and ranks its results, such as MedNar.com (product of Deep Web Technologies, which AMPLYFI acquired in 2020), adds tremendous value. Second, ranking federated search results also distills hundreds — or perhaps thousands — of results into a prioritised list that makes research easier and more efficient.

So, what is the secret sauce for ranking? And can one ranking method be better than another?

The following blog is an indication of what the secret sauce for AMPLYFI’s federated search actually is. When comparing deep-web-search tools, ranking is a very important element, but it can be up to personal opinion as to whether one ranking method is better than another.

AMPLYFI’s Ranking Algorithm

Ranking is based on relevance, with the most relevant results being ranked highest. We compute relevance by (1) creating root-words for the query terms and results, (2) conducting relevance weighting for a number of factors, and (3) using our proprietary algorithms to rank all the results from any given search.

(1) Creating Root-Words: Stemming

Stemming is the process of converting words to their base or root words. In the simplest case, it makes sure that a pluralised search term will find singular terms in the results, and vice-versa. This can be simply dropping “s” or “es” from words (in English), but can become more complex. For example “mouse/mice” and “person/people”. DeepResearch uses language specific stemming implementations based on the Snowball stemming framework.

For the most part, we do not need to stem search terms before submitting them to the collections we search. On occasion, we may need to explicitly indicate to a collection that we want to perform a stemmed search or an exact search.

(2) Conducting Relevance Weighting

We analyse search-term occurrence within a search result and assign weights for different factors. We look for the occurrence of exact terms and stem terms, and then assign relative weights to different results fields. We can assign higher weights to results from a specific collection based on importance or result dates. We also consider:

Search-Term Position – We examine where search terms appear within particular fields (i.e. title, author, precis) and afford special consideration for whether a search term occupies the first word position, last word position, or relative position to either.
Search-Term Density – We find quantity how often search terms appear within fields (i.e. individual fields and full record). Aside from counting the number of occurrences of search terms within fields, we consider the ratio of search-term length to result-field length. For example, a one-word title that is the same as the search term would be highly relevant.
Search-Term Proximity – We consider how close search terms occur relative to one another. When evaluating this, we look at the number of search terms within the query expression and the distance between recurring search terms. In returned results, this ratio, in conjunction with the length of the fields, can be significant.
Search-Term Ordinality – If search terms are in the same order as specified in the search expression, this can be significant and is afforded greater weight than if the order does not match the search expression. Likewise, multiple occurrences of ordinality are important.
Result Recency – Results that are more recent are given greater weighting than older results.

(3) Proprietary Algorithms

Having analysed the exact search terms and stemmed search terms against the factors above and assigned weights, we use our proprietary algorithms to assign an actual rank. These algorithms operate on the Boolean operators AND, OR, and NOT, with the search-query expression being evaluated from left to right. Exact phrases (contained within quotation marks) are not stemmed. If a date range is specified, the date is used as a constraining term, provided that a date is associated with a result. If there is no such date, the relevance for that result is assumed zero (i.e. it is not ranked). Note that such results may still show in the results list.

Finally, stop words are words considered irrelevant for searching purposes and are not evaluated. Stop words include terms such as: a, about, also, because, between, but, do, each, however, I, often, the, they, to, what, when and would.

In Summary

AMPLYFI’s proprietary ranking algorithm considers a number of factors and assigns relative weights to the relationship between the search terms and the results. For example, historic searches benefit from putting more weight on word relevancy than recency of result and vice versa for a current news search. In addition, the weighting of results can be modified to suit client needs. If a customer would like, for example, to boost the weighting of their internal documents, we can adapt the ranking algorithm as necessary. In all cases, the final ranking itself adds significant value to any deep-web search.

Updated by Louise O’Reilly from original publication at http://deepwebtechblog.com/ranking-the-secret-sauce-for-searching-the-deep-web/