Lessons Learned from Using LLMs to Evaluate LLMs

These days, new LLMs (large language models) are being developed and released at a fast pace. Deploying an LLM, however, comes at a price — either monetary or computational — and that price can differ from model to model. So we need a way to evaluate how well LLMs perform various tasks, so we can choose the best model for each task.

Indeed, a recent survey of Traceloop customers indicates that quantifying the impact of changing models and/or prompts on the performance of standard tasks (such as text summarization) is a common requirement. In particular, two frequent questions are: “Does the improvement in performance gained by replacing model X with model Y justify the price of the replacement?” and “Does replacing prompt X with prompt Y improve the performance?”

To answer those questions, we first need to define an evaluation methodology. That is, what metrics or procedures will be used to compare different models?

In the absence of methodological guidelines, several practices and habits have emerged, and they can sometimes lead to incorrect conclusions and poor decisions.

In this article, we highlight some of the drawbacks and shortcomings of existing practices and discuss some trustworthy alternative approaches.

Using LMMs to Evaluate an LLM’s Performance — a Double-Edged Sword

As previously mentioned, evaluating the performance of a new prompt or model is a common task. A commonly used approach is asking an LMM to score an LMM’s generated text (whether the text was generated by the same LLM or a different one). And this approach may seem tempting, as LMMs can not only score the text, but also explain the reasons for the score.

Nevertheless, while explanations might help increase our trust in the models, the question of comparing different LMMs, and quantifying gaps in performance, remains.

And this is clearly a quantitative rather than qualitative question. So for this approach to be reliable, it should meet a few criteria — for example, consistency of grading. That is, we expect the LLM to always score the same text the same way, even if we ask it to grade that text 100 times.

You might argue that this requirement is unrealistic. Indeed, even when experimenting with a thermometer to measure a consistent temperature, we wouldn’t expect the exact same result if we repeated the experiment 100 times (because every measuring tool is prone to some error). But we do expect the measurement noise to be small, relative to the actual temperature.

Clearly the level of measurement noise can be quantified by the variance among repeated tests. Therefore, a relevant question is whether scores generated by LMMs have a small variance. Another (somewhat related) question is whether different LLMs have similar noise levels.

Next, we’ll explore the answers to these two questions.

GPT as a Scoring Tool — Performance Evaluation

We used GPT-4 to summarize one of the examples from the billsum dataset:

“The Federal Agency Protection of Privacy Act is a law that would require federal agencies to assess and anticipate the impacts of their proposed rules on individual privacy rights. The act mandates that agencies perform both an initial privacy impact analysis for proposed rules and a final privacy analysis for enacted rules. These analyses must examine things like the types of personal information to be collected, and how it'll be used and secured. The act would also allow for judicial review of agency compliance with these requirements and make clear that personally identifiable information includes things like a person's name, address, telephone number, social security number, and any other information that can be used to identify an individual. The bill includes a clause that will subject agency compliance with these provisions to judicial review.”

We then asked GPT-3.5 to score the level of redundancy in the generated text. We repeated the process 100 times and plotted the distribution of the resulting score. As shown in the graph below, the variation in the resulting score is high — ranging from 22 to 100. Next we repeated this process with GPT-4, using the same input text. As with GPT-3.5, the resulting score ranges from 20 to 90, and the probability of the score being less than or equal to 70 is the same as the probability of the score being greater than 70.

Note that while in this specific example we asked GPT to score the redundancy of the text, we witnessed qualitatively similar results when asking about relevancy to given subjects, and other textual properties of interest.

These results raise a question: What score should we believe? Or, more generally, can we trust GPT’s measurement ability?

Given the fact that asking GPT to grade GPT-generated text (for example, text summaries) has become a common practice — both in order to assess the performance of a single model or prompt and in order to compare different models or prompts — addressing those questions is crucial.

Possible Statistical Mitigations

At a high level, there are (at least) two ways to bypass the problem we’ve just described. The first is to design statistical tests that take the variability of the resulting score into account. For example, assume that we aim to compare the performance of two models, and consider the standard two-sided (unpaired) t-test. In this case, the test statistic consists of the difference between the sample means normalized by the square root of the sum of the variances of the means.

In our scenario, the variance of the sample means consists of two components: the variance that can be attributed to the fact that we use a finite sample of input texts (also known as “between-text” variance), and a variance component that can be attributed to the variability of GPT as a scoring tool (also known as “within-text” variance).

It’s instructive to note that this decomposition fits a 2-way ANOVA structure. Just as with ANOVA, to account for within-text variability, one needs to use the model to score each input text several times, calculate the “within-text” variance term per input text, and run (pairwise) statistical tests (for example, Tukey’s post-hoc tests).

But this approach has several drawbacks: First, the additional variability implies that more samples and repetitions per input text are needed to gain a desired statistical significance level. Second, repeated scoring per input might result in a computational bottleneck and take a lot of time.

Also, note that unlike with the standard ANOVA setting, where it is assumed that repetitive measures are independent, in our scenario, this assumption is violated due to the nature of the generation process (that is., using the same LLM to score the input text). We will delve deeper into the details of this approach in a later post, so stay tuned!

Standard Text Metrics

An alternative approach is to resort to well-established text-quality metrics, of which there are plenty. These metrics are backed up by years of research and, importantly, are deterministic for a given input text. In fact, designing metrics to capture desired textual properties has been the subject of ongoing research. In light of recent LLM developments, they might become handy. For example, in recent blog posts, we have introduced several such metrics (for example, GRUEN, BLEU, and ROUGE).

Clearly, enabling those metrics is less intuitive than asking GPT to score your text’s redundancy. Being able to talk to GPT using plain, conversational language is appealing. Nevertheless, those metrics provide a reliable alternative that guarantees to measure the desired property and do so deterministically, which is a crucial element of running valid statistical procedures — especially when important decisions are being based on the outcomes of those procedures.

Final Thoughts

Asking LLMs to score input text with respect to a variety of properties (for example, redundancy) has become common, so it’s crucial to ask whether LLMs are reliable for this task. Just like any other measuring tool, scale, or metric, an LLM must meet specific criteria for its results to be considered valid. One fundamental property we expect is consistency. In simple terms, consistency means that if we measure the same thing several times, the resulting spread of results should be small.

We tested the measuring ability of GPT-3.5 and GPT-4 using text summaries. Specifically, we asked GPT to score the text’s redundancy on a scale of 0 to 100. Unfortunately, as shown in many examples, the distribution of the resulting score spans the entire range and has high variance. We therefore conclude that LLMs are not reliable as a measuring tool.

While they are outside the scope of this post, we pinpointed two ways to bypass the problem. The first consists of adjusting existing statistical tests. We illustrated this idea using the popular t-test, where repeated measures allow us to account for the “within-text” variation component.

A second approach to addressing the problem of inconsistency is to use some well-defined and established metrics. These metrics have the advantage of being deterministic given the input text, and therefore could be used in a straightforward manner. Not less important is the fact that they rely on years of research, where different metrics were designed to capture different properties. Those metrics’ theoretical properties are well studied, and their performance has been tested.

While there are caveats in using GPT for scoring, it’s worth mentioning that using GPT for binary classification (for example, assessing whether a given output text has or does not have a certain property) can provide satisfactory results. We will discuss this further in future posts.

‍