Start Exploring Keyword Ideas

Use Serpstat to find the best keywords for your website

14861 92

SEO

– 19 min read –

November 4, 2020

Read later

Search Quality Metrics: What They Mean And How They Work

Stacy Mine

Editorial Head at Serpstat

Learning to Rank is used in the field of information retrieval for natural language processing and data mining. Since the inception of search engines, significant progress has been made in this area: from naive search to the most complex algorithms and machine learning.
This article will talk about what we know about the metrics that search engines use, fundamental problems, and existing learning approaches.

Contents

1. What is Learning to Rank
2. How Learning to Rank is implemented
2.1. Ranking attributes
3. Search ranking metrics
3.1 MRR: Mean Reciprocal Rank
3.2 MAP: Mean Average Precision
3.3 NDCG: Normalized Discounted Cumulative Gain
4. Conclusion

What is Learning to Rank

Let's start with the main point in the question - ranking. Ranking is a way of sorting items (sites, videos, images, news posts, etc.) based on their relevance.

By relevance, we mean the degree to which an object is related to a specific query. Suppose we have a request and several objects that correspond to it in one way or another. The higher the degree of compliance of the object with the request, the higher its relevance. The task of ranking is to return the most relevant object in response to a request. The higher the relevance, the higher the likelihood that the user will take the targeted action (go to the page, buy a product, watch a video, etc.).

With the development of information retrieval systems, the topic of ranking becomes more and more relevant. This problem arises everywhere: when distributing search results pages, recommending videos, news, music, goods, and more. Learning to Rank exists for this purpose.

Learning to Rank or machine-learned ranking (MLR) is a branch of machine learning that studies and develops self-learning ranking algorithms. Its main task is to determine the most effective algorithms and approaches based on their qualitative and quantitative assessment. Why did the problem of teaching ranking arise?

For example, let's take a page of an information resource - an article. The user enters a query into a search engine that already contains a set of files. In accordance with the request, the system extracts the corresponding files from the collection, ranks them, and gives them the highest relevance.

Ranking is performed based on the sorting model f (q, d), where q is the user's request, d is the file. The classical f (q, d) model works without self-learning and doesn't consider the connection between words (for example, Okapi BM25, Vector space model, BIR models). It calculates the file's relevance to the request based on the occurrence of the request words in each document. Obviously, with the current volume of files on the Internet, search results based on simple models may not be accurate enough.

“

The model calculates the relevance of a file to a query based on the occurrence of query words in each document. Under ideal conditions, when we analyze documents written to solve a given problem, such algorithms do their job well, which has worked successfully before.

However, such algorithms have one significant drawback - the initial data must be strictly subordinate to the rule, and the rule must strictly follow the author's task.

Suppose we set ourselves the task of manipulating the results of the algorithm's work. In that case, we can easily solve it since the algorithm's specifics initially assume that they will not be manipulated.

So the problem arose when the search began to monetize. As soon as it began to monetize, it stimulated not only to submit documents for analysis but to present them in such a way as to get preferences over competitors. Therefore, today search results based on simple models cannot have a sufficient level of accuracy.

Demi Murych, Reverse Engineering and Technical SEO Specialist

Therefore, the trends have changed. Machine learning has replaced the simple classical model to improve the quality of search. The use of machine learning methods made it possible to build a ranking model automatically. It considers many relevancy factors that previously could not be considered — for example, anchor texts, page authority, natural language analysis, keyword analysis of competitors, and page user experience.

Learning to Rank is currently one of the critical tasks of modern web search. Over time, the most common metrics for assessing the search quality have already been formed in this area.

How Artificial Intelligence Changes SEO: Asking Experts

How Learning to Rank is implemented

Learning to Rank is a complete learning task that includes training and testing.

Training data includes request, files, the degree of relevance.

Each request is associated with several files. It is not possible to check the relevance of all files, so the merge method is usually used - only a few top documents retrieved by existing ranking models are checked. Also, training data can be obtained automatically from Google SearchWiki or by analyzing the transition log, query chains.

The degree of compliance is determined by the file request in several ways. The most common approach assumes that the relevance of a document is based on several metrics. The higher the correspondence of one indicator, the higher the score for it. Relevance scores are derived from a set of search engine labeling that takes 5 values from 0 (irrelevant) to 5 (completely relevant). The estimates for all indicators are summed up.

As a result, the most relevant is the file, the sum of the ratings for all indicators being the highest. The learning data is used to create ranking algorithms that calculate the relevance of documents to real queries.

However, there is an important nuance here: user requests must be processed at high speed. And it is impossible to use complex scoring schemes in this variant (for each request). Therefore, the check is carried out in two stages:

Using simpler algorithms, a sample of a small number of relevant files is formed. This allows you to evaluate requests quickly.

This sample is then re-ranked using more complex models (machine learning), since they are more resource-intensive.

Ranking attributes

During the training and operation of MLR, the file-request pair is translated into a numerical vector from ranking factors and other signals. They characterize the relationship between the request and the file, as well as their properties.

Signs are divided into three groups:

Static attributes are independent of the request and refer to the file itself. For example, this is its volume or the authority of the page (PageRank). These signs are calculated during the indexing process and can be used for a static assessment of the file quality. This helps speed up the evaluation of the search query.

Query attributes are those that depend only on the query. For example, its length or what the request is about.

Dynamic attributes are those that depend on the file and the request. For example, the level to which the file matches the request.

This approach allows you to provide the user with the most accurate results on the SERP. Ranking features are collected in LETOR, a collection of benchmarks for research in learning to rank in information retrieval.

“

For a result to be considered relevant for a given search query, it must (1) provide a satisfactory amount of high-quality content*, (2) in a straightforward and organized manner (3) that addresses the correct or more likely intent/s of the query.

To learn more about content quality, I suggest reading Google's search quality evaluator guidelines.

It's also important that the result has few or no relevance issues. An example of issue occurs when the page loses its helpfulness for the query due to the passage of time. This is very common, for example, for queries with news intent whose results can quickly become stale if they don't include the latest developments of the target story.

I believe that understanding the qualitative aspect of how search relevance works is much more important for marketers than trying to understand the actual metrics and science behind information retrieval systems. To achieve that objective, here is what I suggest:

Get into the habit of putting yourself on the search engine's shoes – and, yes, I know that everybody talks about the user's shoes but I'm trying to offer another perspective. If you were in charge of Search at Google, how would you assess the level of Expertise, Authority and Trust of a given website? How would you determine the characteristics of the most relevant results for a given query? Does your content possess these characteristics for your target queries? Reading Google's guidelines and staying up to date with what's happening in SEO and Search can definitely be of help.

Danilo Godoy, founder of Search Evaluator

Search ranking metrics

Both binary (relevant/irrelevant) and multilevel (for example, relevance 0 to 5) scales are used to evaluate each file returned in response to a request. In practice, queries can be incorrect and have different shades of relevance. For example, in the query "dog" there is an ambiguity: the machine doesn't know if the user is looking for the planet Dog, an album by Blink-182, or the rapper Snoop Dogg.

In information retrieval theory, there are many metrics for assessing the performance of an algorithm with training data and comparing different ranking training algorithms. Open source states that they are created in relevance scoring sessions where judges evaluate search results' quality. However, common sense tells us that such an option is hardly possible, and here's why:

“

There are no judges in machine learning because even 100.5 million people cannot answer so many times to collect the necessary data pool for a high-quality forecast.
That is why we have systems that can:

Recognize cats: because during the Internet's existence, people have produced billions of ready-made data with cats.

Determine the naturalness of the language because we have digitized many books in this language, and we know for sure that it is natural.

But we cannot assess the relevance of the request to the site because we do not have such data. Even if you just put the assessors to click, they will not cope with the task since relevance is not just a match of the text to the request; it is hundreds of other factors that a person cannot yet assess in a reasonable time.

Demi Murych, Reverse Engineering and Technical SEO Specialist

Nevertheless, in information retrieval theory, there are well-established indicators of assessment described by authoritative sources. Here are some of them.

MRR: Mean Reciprocal Rank

It is a statistical measure of evaluating any process that generates a list of possible responses to a sample of queries ordered by the probability of being correct. The mean reciprocal rank is defined as the average of the inverse ranks for all Q queries:

Search Quality Metrics: What They Mean And How They Work 16261788441536

Where ranki is the position of the first relevant document for query i.

This is the simplest metric of the three: it tries to measure where is the first relevant item. It is closely linked to the binary relevance family of metrics.

This method is simple to compute and is easy to interpret; it focuses on the first relevant element of the list. It is best suited for targeted searches, such as users asking for the "best item for me." Suitable for known-item search such as navigational queries or looking for a fact.

The MRR metric doesn't evaluate the rest of the list of recommended items. It focuses on a single item from the list.

It gives a list with a single relevant item just a much weight as a list with many relevant items. It is okay if that is the target of the evaluation.

This might not be a fair evaluation metric for users that want a list of related items to browse. The goal of the users might be to compare multiple associated items.

“

From the formula mentioned above, we know that in calculating MRR, we value each click with the reciprocal of its list position. For instance, if the searcher clicks on the first result in the list and we'd value it as 1, the second result would be valued at 0.5, and so on.

Thus, we take all of those values and divide them by the total number to get the mean reciprocal rank.

Therefore, an MRR of 1 is ideal. It means that your search engines put the right answer at the top of the result list every time.

Andre Oentoro, Founder of Breadnbeyond

MAP: Mean Average Precision

MAP, or mean average precision, computes an entire query set's average relevance and ranks all the top ones highly. So, rather than assuming there is only one winner, MAP asks whether this is relevant and lists all the queries with a "yes" response.

MAP is ideal for ranking results when you are looking at five or more results. That makes it ideal for evaluating related recommendations, like on an eСommerce platform.

DCG (Discounted cumulative gain) and NDCG (Normalized Discounted Cumulative Gain)

If we want to understand the NDCG metric, we must first understand the CG (Cumulative Gain) and DCG (Discounted Cumulative Gain), and also understand the two assumptions we make when using DCG and related metrics:

highly relevant documents are more useful if they appear earlier in the search engine results list (have higher rankings).

highly relevant documents are more useful than non-relevant documents, which in turn are more useful than irrelevant documents.

If each recommendation has a relevance score associated with it, the CG is the sum of the relevance scores of all results in the list:

The cumulative gain at a specific position in the p ranking, where rel_i is an assessment of the relevance of the result at position i. Each relevance score is associated with a document.

The problem with CG is that it doesn't consider the position of the result set when determining its usefulness. In other words, if we change the order of the relevance scores, we won't be able to understand better the usefulness of the result set as the CG will remain the same.

For example:

Metric Set A: [3, 1, 2, 3, 2, 0] CG of Metric Set A: 11

Metric Set B: [3, 3, 2, 2, 1, 0] CG of Metric Set B: 11

Obviously, Metric Set B returns a much more useful result than Metric Set A, but the CG measure says they return equally good results.

To overcome this, we are introducing DCG. DCG punishes highly relevant documents that appear lower in the search results by decreasing the ranked relevance value, which is logarithmically proportional to the position of the result:

Search Quality Metrics: What They Mean And How They Work 16261788441537

But with DCG, a problem arises when we want to compare search engines' performance from one query to another because the list of search results can vary in length depending on the query provided. Therefore, by normalizing the cumulative gain in each position for the selected value by queries, we arrive at NDCG.

We accomplish this by sorting all the relevant documents in the corpus according to their relative relevance, getting the largest possible DCG through the p-position (also known as c).

Where, the ideal discounted cumulative gain is:

Conclusion

The topic of metrics for assessing search quality is relevant and important today. But, unfortunately, we can talk about it only from the point of view of theorizing. Metrics for evaluating the quality of search, in our case, are mostly fantasies. Here's why: we can't have all the data we have on Google.

This means that choosing a methodology that would make it possible to determine the instrument is more of fortune-telling since we cannot check anything here.

Yes, you can try to rely on patents or words published by this or that official. But 90% of all patents are rubbish that has nothing to do with programming. And phrases are only part of the puzzle, which is aggravated by the fact that all these people are tied by such a severe NDA that even the phrases that were supposedly accidentally thrown out are written in the contract.

And all we can do is analyze and combine data from various sources, resulting in a document that tells about the theoretical foundations of how search quality can be assessed.

Good luck and high positions to everyone!

To keep track of all the news from the Serpstat blog, subscribe to our newsletter. And also join our group on Facebook and follow our Twitter.

Speed up your search marketing growth with Serpstat!

Keyword and backlink opportunities, competitors' online strategy, daily rankings and SEO-related issues.

A pack of tools for reducing your time on SEO tasks.

Get free 7-day trial

The opinion of the guest post authors may not coincide with the opinion of the Serpstat editorial staff and specialists.

Rate the article on a five-point scale

The article has already been rated by 9 people on average 4.56 out of 5

Found an error? Select it and press Ctrl + Enter to tell us

Discover More SEO Tools

Backlink Cheсker

Backlinks checking for any site. Increase the power of your backlink profile

API for SEO

Search big data and get results using SEO API

Competitor Website Analytics

Complete analysis of competitors' websites for SEO and PPC

Keyword Rank Checker

Google Keyword Rankings Checker - gain valuable insights into your website's search engine rankings

Start Exploring Keyword Ideas

Search Quality Metrics: What They Mean And How They Work

Speed up your search marketing growth with Serpstat!

Discover More SEO Tools

Recommended posts