DIY observability for LLMs with OpenTelemetry

With the rise of LLMs (large language models) in almost every area of technology, keeping an eye on how your LLMs are performing is increasingly important. With non-AI apps, monitoring is primarily focused on ensuring availability and low latencies. But the shaky nature of LLMs can result in inaccuracies, which can be as bad for your business as poor performance.

LLMs can be both self-hosted and managed. Providers like OpenAI offer a variety of managed models that you can choose from to get started with LLMs quickly. If you’re looking for a more customized LLM, you might want to start from scratch, build one that matches your needs, and host it yourself.

There are many reasons to host your own LLM; however, doing so involves a lot of manual work, and an LLM has a lot of internal aspects that you need to keep an eye on. For this reason, many users will want to go with a managed option such as OpenAI's GPTs. Be sure you understand your requirements before deciding on one or the other.

This article will show you how to set up LLM observability using OpenTelemetry for managed LLMs such as OpenAI's offerings. You’ll also learn about OpenLLMetry, an open-source, automated approach to this tedious manual process!

Why LLM Observability Is Important

As mentioned earlier, there are many reasons to monitor your LLMs. In addition to availability and latency, you should also track accuracy and costs.

For instance, one key way to use LLM observability is to improve your LLM’s responses by identifying prompt-response groups that have low evaluation scores. Then you can compare those groups with successful prompt-response groups, to learn how you can improve your prompts through prompt engineering and how to improve the quality of your responses, without having access to the LLM's internal characteristics (as is the case with managed LLMs).

And LLM observability has a host of other benefits:

Performance optimization: Managed LLMs communicate via APIs, so there may be API-related issues like latency, hitting rate limits, or reduced throughput in such integrations. Setting up LLM observability can help you keep an eye on these potential issues and optimize your app to better handle them.
Resource utilization: You can track resource consumption through token usage to gain insights into the resource requirements of your LLMs. Understanding resource requirements helps in cost management, so you can ensure optimal performance without unnecessary financial overhead.
Analyzing usage patterns: Observing LLM usage helps in monitoring usage-related concerns, such as attacks or unauthorized model access. By analyzing model usage, you can identify patterns that might indicate potential threats such as jailbreak attempts, prompt injection, and prompts triggering refusal of service responses.

While LLM observability is somewhat similar to traditional software observability, getting the most out of your monitoring setup requires that you keep some specific points in mind when implementing it. This includes choosing the right metrics and deciding between manual and automatic instrumentation. We’ll discuss these metrics in the following sections.

Important Metrics

While the performance metrics for LLMs are somewhat similar to those for traditional apps, you need to take special care with LLM-specific metrics such as token usage and error rate. Here’s a detailed list of metrics you should keep an eye on when starting with LLM observability:

Latency: This is the time taken for the LLM API to handle a request. Low latency is usually preferable, particularly in real-time use cases such as chatbots. You can track this metric across models if you use multiple models. This can help you evaluate the performance of various types of models.
Throughput: This is the number of inference requests processed by the LLM API per unit of time. This metric helps you understand the API's capacity to handle a certain volume of requests, influencing scalability and resource allocation. If your LLM API sets a low rate limit and you observe that you’re often getting close to or exceeding it, you might want to request a higher rate limit from the LLM provider or set up limits within your app to respect the rate limits.
Error rate: This is the percentage of inference requests that result in incorrect or incomplete responses. Monitoring error rates helps you assess the model's behavior and take measures to cover such cases and improve user experience.
Token usage: Keeping an eye on token usage can help you with cost planning and optimization.
Response-time-to-prompt-token ratio: This involves monitoring the number of tokens processed by the model and the time that it takes to process them. An anomalous value might indicate issues with the quality of your prompts.

You can always add more metrics to this list, depending on your needs. However, it’s important to identify the right metrics for your use case and focus on them instead of trying to capture everything around your models.

Additionally, you might also want to log prompt and response samples to better understand the quality of the LLM's responses. This needs to be handled with care, though, as the prompts might sometimes contain sensitive data from your users and might be subject to privacy laws.

Instrumenting OpenAI

As you probably already know, OpenAI is the leading provider of top LLMs, such as GPT-3 and the all-new GPT-4, with more than 100 million weekly active users as of November 2023. OpenAI provides its models through its API, which allows you to interact with the models of your choice.

While OpenAI is responsible for maintaining the quality and performance of the LLM itself, it provides users with a few usage-related statistics, such as token counts and model used. You can combine this with other information, such as request duration and error rate (based on user feedback), to understand, for instance, how the size of prompts or frequency of requests affects the performance and accuracy of the model, which directly affects your app’s user experience.

Using the collected data, you can play around with variables like temperature and response length to improve quality. If your app uses a lot of fixed prompts, you can improve your prompts by modifying them and observing the effect of the modifications on the quality of responses. Even if you don’t have absolute control over the LLM in this case, there’s a lot that you can do by implementing LLM observability.

When it comes to instrumenting OpenAI’s APIs, you can either choose a manual approach that involves instrumenting the API using OpenTelemetry, or use an open-source tool like OpenLLMetry that can help you auto-instrument the API and provide you with a host of features relevant to LLM monitoring and management.

Instrumenting OpenAI with OpenTelemetry

Instrumenting OpenAI with OpenTelemetry is similar to manually instrumenting your app with OpenTelemetry. To make things cleaner, you can consider taking a monkey patching approach. However, for simpler projects, you can safely go with a naive instrumentation approach, such as manually instrumenting your Python app using OpenTelemetry. To do that, you’ll first need to install the OpenTelemetry instrumentation package by running the following pip command:

pip install opentelemetry-sdk

Now you can use the following snippet to instrument a Chat Completions API call:

from openai import OpenAI
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.trace import SpanKind
from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
    ConsoleSpanExporter,
)

# Initialize the OpenTelemetry TracerProvider
provider = TracerProvider()

# Set up a ConsoleSpanExporter to view the created spans in your terminal (optional)
processor = BatchSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Set your OpenAI API key
client = OpenAI(
    api_key='your-api-key'
)

# Start a span manually
with trace.get_tracer(__name__).start_span("OpenAI_API_Request", kind=SpanKind.CLIENT) as span:
   
    # Define the conversation as a list of messages
    prompt = "What's the meaning of lorem ipsum?"
    model = "gpt-3.5-turbo"
    conversation = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]

    # Make the API call
    response = client.chat.completions.create(
        model=model,
        messages=conversation
    )

    # Print the assistant's reply
    print(response.choices[0].message.content)

    # Add attributes to the span (optional)
    span.set_attribute("openai.model", model)
    span.set_attribute("openai.prompt", prompt)
    span.set_attribute("openai.response", response.choices[0].message.content)

This logs the model, the request, and the response in your OpenTelemetry data store. You can record more information, such as the usage object and the request duration, and build the metrics you need in your analytics tool. You can collect traces as well, but you will need to set them up manually.

‍

This approach is great for people looking to get started with a small set of metrics. It requires a lot of manual maintenance, though, so it might not be suitable for smaller teams looking to expand on observability later. In such cases, the next option might be more viable.

Automatic Instrumenting with OpenLLMetry

OpenLLMetry is an open-source project that helps auto-instrument LLM apps using OpenTelemetry. It works in a non-intrusive way, similar to manual instrumentation. It also gathers traces quite easily with minimal maintenance required from the user. The data that it collects can be sent to a wide range of destinations including Traceloop, Dynatrace, SigNoz, and OpenTelemetry Collector.

‍

To use OpenLLMetry to instrument the same script from before, you’ll first need to install the OpenLLMetry library, by installing the Traceloop SDK using the following command:

pip install traceloop-sdk

You can then use the following script:

from openai import OpenAI

# Import and initialize the traceloop tracer
from traceloop.sdk import Traceloop
Traceloop.init(disable_batch=True)

# Set your OpenAI API key
client = OpenAI(
    api_key='your-api-key'
)

# Define the conversation as a list of messages
prompt = "What's the meaning of lorem ipsum?"
model = "gpt-3.5-turbo"
conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

# Make the API call
response = client.chat.completions.create(
    model=model,
    messages=conversation
)

# Print the assistant's reply
print(response.choices[0].message.content)

‍

Next, configure your OpenTelemetry collector by pointing the output of the Traceloop SDK to it. Do this by setting the TRACELOOP_BASE_URL to something similar to the following:

TRACELOOP_BASE_URL=https://<opentelemetry-collector-hostname>:4318

During development, you can skip setting the TRACELOOP_BASE_URL and use the temporary dashboard automatically created by Traceloop. Here’s how the traces collected from the auto-instrumented script would look:

‍

‍

You can try clicking on a trace to see more details about it.

‍

‍

You’ll notice that OpenLLMetry automatically collects data related to prompts and their responses. You can try clicking on the LLM Data and Details tabs to view more data collected as part of the trace. The LLM Data tab for the same trace looks like this:

‍

‍

Finally, you can also view the regular span attributes such as the Span Id, Span Kind, and Span Name, in the Details tab.

‍

‍

OpenLLMetry emits standard OTLP HTTP, so you’ll be able to access all these data points in any observability platform that you choose.

The auto-instrumentation option works well for teams who are serious about LLM observability and are looking to expand on it in the future. If your organization already has an observability setup that you use for other applications, you can easily integrate your LLMs into the same setup.

Final Thoughts

Setting up effective observability for your LLMs is indispensable for their sustained success. Monitoring the metrics discussed in this article, from accuracy and precision to resource utilization and concurrency, is vital for ensuring the reliability and efficiency of LLMs.

This article explained why you need observability for your LLMs, the metrics you should keep an eye on, and how to decide between manual instrumentation and a dedicated service that might also offer auto-instrumentation. Keeping these points in mind, you’re now ready to start instrumenting your LLMs!