← Back

GRUEN's Outstanding Performance in LLM Quality Evaluation

Whether we’re building a simple regression model or a highly complex neural network, our success depends on its accuracy — so we need to be able to measure accuracy. Luckily, there are many metrics to choose from. The right one to use depends on the data, the product, and the performance criteria. We’ve already explored different ways to evaluate 

Metrics for numerical data are well established. However, evaluating natural language text presents a challenge. How do you assess an article’s linguistic quality? One option is to have people read and evaluate it. But in addition to the inevitable differences in opinion, this is a time-consuming and expensive process. It’s also not scalable, and therefore it’s not really a viable option.

With advancements in large language models (LLMs), we’re observing a huge increase in the number of applications that use LLMs for natural language generation (NLG) tasks. From writing emails to summarizing texts, and from chatbots to creating blog posts, LLMs have been adopted for a wide range of personal and commercial uses. So an automated way to assess the linguistic quality of AI-generated text — one that does not depend on human judgment or references — is necessary.

There are several metrics designed for evaluating AI-generated text. We’ve already explored 2 of them - ROUGE for the summarization tasks, and BLEU for machine translation tasks. The third one we’ll cover here is GRUEN, which outperformed the other common metrics according to an empirical evaluation of different types of NLP tasks, such as summarization and dialogue systems. In this article, we’ll learn what GRUEN is and how it measures linguistic quality. We’ll also examine its strengths, which make GRUEN a better choice than other available metrics.

What Is GRUEN?

Automatic evaluation of linguistic quality is not an easy task, and it’s even more challenging if you don’t rely on human references. GRUEN uses only the system output (that is, generated text) and does not require human references.

GRUEN is a combination of the following four metrics:

  1. Grammaticality
  2. Non-redundancy
  3. Focus
  4. Structure and coherence

The names of the metrics are self-explanatory, but let’s examine them in detail and provide some examples for each one.

Grammaticality

Grammaticality is simply a measure of whether sentences are grammatically correct; this is calculated by using the BERT language model. Incomplete sentences and grammar errors result in lower scores. 

Here are a couple of examples of sentences that would result in a low grammaticality score:

  • Jonathan given a ticket for exceeding the speed limit. (This sentence is grammatically incorrect: it’s missing the word “was” after “Jonathan.”)
  • The doctor told me the. (This is an incomplete sentence.)

Grammaticality score (yg) is calculated at the sentence level, so the system output is first tokenized into its component sentences. Then, for each sentence, sentence likelihood (li) and grammar acceptance (gi) scores are calculated by using the BERT model.

BERT is suitable for making predictions about a missing word in a sentence, but it cannot be used for finding the sentence likelihood. Therefore, the following unigram approximation is used to estimate the masked probability of each word:

li = ∑j log p(wi,j | wi, 1, ..., wi, j-1, wi,j+1, …, wi,k)

— where wi,j is the jth word in the ith sentence.

To find the grammar acceptance score, BERT is fine-tuned using the Corpus of Linguistic Acceptability (CoLA) dataset, which contains 10,657 sentences marked as grammatical or ungrammatical. The grammaticality score of a sentence is the linear combination of sentence likelihood and grammar acceptance scores. The overall grammaticality score of the system output is the average score of all the sentences.

yg = ∑i (li + gi) / n

Non-redundancy

Non-redundancy aims to detect repeated (that is, redundant) pieces of text in sentences in the system output. For instance, when a name can be represented with a pronoun (such as he or she), writing the name multiple times in succession is redundant. In some cases, AI-generated text repeats the exact same phrases. 

The non-redundancy score is composed of four measures:

  1. Length of the longest common part
  2. Number of words in the longest common part
  3. Edit distance
  4. Number of common words

For each pair of sentences (si , sj), these four measures are compared against predefined threshold values. The ones that exceed the threshold values are counted (mi,j) to get the final non-redundancy score (yr) based on the following formula:

yr = -0.1 * ∑i, j mi, j 

Let’s go over an intuitive example. Consider the following two system outputs:

  • Max Verstappen, three-time F1 world champion, made a great start to the new season by winning the first two races. Max Verstappen, three-time F1 world champion, is the strongest championship candidate this season.
  • Max Verstappen, three-time F1 world champion, made a great start to the new season by winning the first two races. He is the strongest championship candidate this season.

In the first example, the first part of the second sentence is definitely redundant. “Max Verstappen, three-time F1 world champion,” should be replaced with “He” to eliminate this redundancy, as shown in the second example. The first example will be penalized for this and have a lower non-redundancy score than the second example.

Focus

Focus is a measure of semantic relatedness within adjacent sentences. When we read something, we prefer to stay focused on the subject and be guided by smooth transitions. Reading becomes annoying and difficult if the writer frequently changes the topic without a purposeful flow. So with AI-generated text, focus is another marker of high linguistic quality. For instance, in the following passage, the sentence in the middle can’t be considered closely related to the other two sentences, although they all mention health issues. This text will get a low focus score.

  • Chickenpox is a very contagious disease caused by varicella zoster virus. Vaccines play a crucial role in preventing the spread of viruses. Chickenpox is more severe in adults than in children. 

The focus score (yf) is based on the Word Mover Similarity calculation of each pair of adjacent sentences (si , si+1). If the similarity is less than the threshold of 0.05, then a penalty score of -0.1 is applied to the focus score. A focused output should have a focus score of 0. 

The structure and coherence measure is related to the order of sentences within a text, which should be logical and easy to follow. Linguistic quality suffers when sentences are unorganized or in the wrong order, even if all the sentences include information about the same subject. The structure and coherence score is calculated by generating all possible consecutive pairs of sentences and then using a metric called sentence-order prediction (SOP) loss on these pairs.

GRUEN’s final linguistic quality score is the linear combination of the four metrics we discussed above. It evaluates a text from many different perspectives, creating reliable and consistent assessments.

What Makes GRUEN Stand Out from the Rest?

Plenty of metrics have been introduced to measure the linguistic quality of AI-generated text. Compared with others, GRUEN performs better on many types of tasks and datasets.

In text evaluation, the source of truth, at least for now, is human judgments. Therefore, automatic evaluation metrics should have a high correlation with human references. GRUEN achieved the highest correlation score in different tasks. Correlation with human judgments is more challenging on high-quality text. Most existing metrics fail to assign high scores to good outputs. GRUEN’s assessment achieves decent correlation with human scores on such examples as well.

GRUEN needs only the system output (that is, the generated text) for evaluation, whereas most other metrics require human references. For instance, ROUGE, a commonly used metric for evaluating text-summarization algorithms, requires human references. This is costly, time-consuming, and a huge blocker in terms of scalability.

Another advantage of GRUEN over other metrics is its generalization capacity. Most metrics in this domain are text-specific. For instance, BLEU, a metric based on n-gram overlaps, is mainly used for evaluating machine-translated texts. On the other hand, as proved with several datasets, GRUEN is applicable to a variety of tasks. Also, in contrast to most other metrics, GRUEN is deterministic, which is of great importance with regards to producing consistent outputs.

Conclusion

For automatically measuring linguistic quality, GRUEN seems to achieve the best results. The main success criterion is the correlation with human annotations. Being applicable to different types of tasks, being unsupervised and not requiring human references, and being deterministic are the other factors that make GRUEN a better choice than the other options. If you’d like to get a more detailed overview of the empirical study and comparison results, We recommend going over this paper by the creators of GRUEN.

References

  1. GRUEN for Evaluating Linguistic Quality of Generated Text (arXiv:2010.02498v1 [cs.CL])