Compute Trustworthiness Scores for any LLM
Before this tutorial, we recommend first completing the Quickstart Tutorial.
Most of the TLM examples shown throughout this documentation (including the quickstart tutorial) demonstrate how to use Cleanlab’s Trustworthy Language Model, which is powered by a predefined set of underlying LLMs you can choose from via the model
key in TLMOptions.
TLM is actually a general-purpose wrapper technology that can make any LLM more trustworthy. This tutorial demonstrates how to produce TLM trustworthiness scores for your own LLM. Here we demonstrate this for Ollama, a open-source implementation of the Llama-3 model you can run locally on your own laptop. You can replace Ollama with any other LLM and still follow this tutorial.
Setup
This tutorial requires a TLM API key. Get one here.
The Python packages required for this tutorial can be installed using pip (we used langchain-community
version 0.0.34):
%pip install --upgrade cleanlab-studio langchain-community
from langchain_community.llms import Ollama
from cleanlab_studio import Studio
import pandas as pd
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
Here’s how you would regularly use your Ollama LLM:
my_llm = Ollama(model="llama3")
# Example prompt, feel free to ask other questions.
prompt = "What is the 3rd month if we list all the months of the year in alphabetical order?"
response = my_llm.invoke(prompt)
print(response)
Here’s how to compute the corresponding TLM trustworthiness score for this Llama-3 LLM response:
from cleanlab_studio import Studio
studio = Studio("<API key>") # Get your free API key from: https://tlm.cleanlab.ai/
cleanlab_tlm = studio.TLM() # See Advanced Tutorial for optional TLM configurations to get better/faster results
trustworthiness_score = cleanlab_tlm.get_trustworthiness_score(prompt, response=response)
print(trustworthiness_score["trustworthiness_score"])
Let’s define an object we can prompt and get both the LLM response and associated trustworthiness score:
class TrustworthyLanguageModel:
""" Class that returns responses from your LLM and associated trustworthiness scores. """
def __init__(self, response_llm, score_tlm):
self.response_llm = response_llm
self.score_tlm = score_tlm
def prompt(self, prompt: str):
"""
Returns a dict with keys: 'response' and 'trustworthiness_score'
where response is produced by your response_llm.
This implementation assumes response_llm has a invoke(prompt) method.
"""
output = {}
output['response'] = self.response_llm.invoke(prompt)
output['trustworthiness_score'] = self.score_tlm.get_trustworthiness_score(prompt, response=output['response'])["trustworthiness_score"]
return output
def prompt_batch(self, prompts: list[str]):
"""
Version of prompt() where you can also pass in a list of many prompts
and get lists of responses and trustworthiness scores back.
This implementation assumes response_llm has a batch(prompts) method.
"""
outputs = {}
outputs['response'] = self.response_llm.batch(prompts)
trustworthiness_scores = self.score_tlm.get_trustworthiness_score(prompts, response=outputs['response'])
outputs['trustworthiness_score'] = [score['trustworthiness_score'] for score in trustworthiness_scores]
return outputs
my_tlm = TrustworthyLanguageModel(my_llm, cleanlab_tlm)
output = my_tlm.prompt(prompt)
print(f"Response: `{output['response']}` \n \n Trustworthiness: `{output['trustworthiness_score']}`")
This allows you to easily obtain responses and associated trustworthiness scores for any LLM!
Running our custom TLM over a dataset of prompts
Firstly let’s load in an example query dataset. Consider a dataset of various Space-related trivia questions. We can use each question as a prompt for our custom TLM and record its response and associated trustworthiness score.
wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/solar_system_dataset.csv'
wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/solar_system_example_prompts.csv'
df = pd.read_csv("solar_system_example_prompts.csv")
df.head()
Lets use TLM to get the trustworthiness score of all prompt, LLM response pairs.
results = df.copy(deep=True)
outputs = my_tlm.prompt_batch(results["prompt"].to_list())
results["response"] = outputs["response"]
results["trustworthiness_score"] = outputs["trustworthiness_score"]
To see which LLama-3 LLM responses are least trustworthy, let’s sort the data by the trustworthiness scores. Observe how model responses with the lowest trustworthiness scores convey unhelpful or factually inaccurate answers, sometimes referred to as hallucinations.
results.sort_values(by="trustworthiness_score").head()
On the other hand, the model responses with the highest scores are direct, accurate, and appropriately detailed.
results.sort_values(by="trustworthiness_score", ascending=False)
Now we can automatically estimate which LLM responses are confidently good vs. not. And we can do this for any LLM!
How to use these scores? If you have time/resources, your team can manually review low-trustworthiness responses and provide a better human response instead. If not, you can determine a trustworthiness threshold below which responses seem untrustworthy, and automatically append a warning statement to any response whose trustworthiness falls below the threshold.
threshold = 0.5 # chose by inspecting responses around different trustworthiness ranges
if trustworthiness_score < threshold:
response = response + "\n CAUTION: THIS ANSWER HAS BEEN FLAGGED AS POTENTIALLY UNTRUSTWORTHY"
The overall magnitude/range of the trustworthiness scores may differ between datasets, so we recommend selecting any thresholds to be application-specific. First consider the relative trustworthiness levels between different data points before considering the overall magnitude of these scores for individual data points.
Use your own LLM to compute the trustworthiness scores too
Note that in this tutorial: the trustworthiness scores for Ollama model responses are calculated internally using the LLMs powering Cleanlab’s TLM, not using your Ollama model.
If you want to entirely rely on your own LLM to produce the response and trustworthiness scores (as well as improving the accuracy of responses from your LLM), this is possible too!
Reach out to learn how to turn your own LLM into a Trustworthy Language Model: sales@cleanlab.ai