Scoring the trustworthiness of Tool Calls
This tutorial demonstrates how to score the trustworthiness of every tool call (i.e. function call) made by your LLM, in order to automatically detect/prevent incorrect tool calls in real-time.
Here we focus on Tool Calls made using OpenAI’s Chat Completions API, but the same approach works for any LLM model/API. With minimal changes to your existing code, you can score the trustworthiness of every tool call.
We’ll consider a customer service AI Assistant to show how Cleanlab identifies potentially problematic tool calls before they execute. In cases where your LLM emits a Tool Call and Cleanlab’s trustworthiness score is low, your AI system might fallback to one of these actions:
- escalate the interaction to a human employee
- ask user to confirm the tool call before you execute it
- direct your LLM to ask a follow-up question to get more information, then re-generate the tool call
- replace the tool call LLM output with an pre-written abstention response like “Sorry I’m unsure how to help with that”.
Setup
This tutorial requires a TLM API key. Get one here.
The Python packages required for this tutorial can be installed using pip:
%pip install --upgrade cleanlab-tlm openai
import os
os.environ["CLEANLAB_TLM_API_KEY"] = "<Cleanlab TLM API key>" # Get your free API key from: https://tlm.cleanlab.ai/
os.environ["OPENAI_API_KEY"] = "<OpenAI API key>" # for using OpenAI client library
Example Application: Customer Service AI
Here we build a customer support AI assistant, which has access to several tools to help customers with various requests.
CUSTOMER_SERVICE_PROMPT = """You are a helpful customer service AI assistant for TechCorp.
Help customers with their requests using the available tools.
If a request is unclear or cannot be handled by the available tools, transfer the customer to a human agent.
Always be polite and professional.
"""
Here are tools that our AI can call: check order status, search products, schedule callbacks, and transfer to human agent.
customer_service_tools = [
{
"type": "function",
"function": {
"name": "check_order_status",
"description": "Check the status of a customer's order",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The order ID to check"
}
},
"required": ["order_id"]
}
}
},
{
"type": "function",
"function": {
"name": "schedule_callback",
"description": "Schedule a callback for the customer. Assume the customer's contact details were provided in the chat initiation.",
"parameters": {
"type": "object",
"properties": {
"preferred_time": {
"type": "string",
"description": "Preferred callback time. Available times are every 15 minutes from 9am to 5pm on weekdays."
},
"topic": {
"type": "string",
"description": "Topic for the callback if the customer provides one."
}
},
"required": ["preferred_time"]
}
}
},
{
"type": "function",
"function": {
"name": "transfer_to_human",
"description": "Transfer the customer to a human agent. Use this as the default option for any requests outside the assistant's defined capabilities, complex technical issues, billing disputes, or when the customer explicitly requests human assistance.",
"parameters": {
"type": "object",
"properties": {
"reason": {
"type": "string",
"description": "Reason for transferring to human agent"
},
"urgency": {
"type": "string",
"enum": ["low", "medium", "high"],
"description": "Urgency level of the issue"
}
},
"required": ["reason", "urgency"]
}
}
},
{
"type": "function",
"function": {
"name": "search_products",
"description": "Search for products in the catalog",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search query for products"
},
"category": {
"type": "string",
"enum": ["electronics", "clothing", "home", "books", "all"],
"description": "Product category to search in. Default is 'all'."
}
},
"required": ["query"]
}
}
}
]
Set up Tool-Calling LLM and Trust Scoring
We’ll use OpenAI’s API to generate tool calls, then score their trustworthiness with TLM. This follows the pattern from our Chat Completions tutorial.
Here we run TLM with default settings, you can achieve better/faster results via optional configurations outlined in the TLMChatCompletion documentation.
from openai import OpenAI
from cleanlab_tlm.utils.chat_completions import TLMChatCompletion
# Initialize OpenAI and TLM clients
openai_client = OpenAI()
# See Advanced Tutorial for optional TLM configurations to get better/faster results
tlm = TLMChatCompletion(quality_preset="medium", options={"log": ["explanation"]})
Let’s define a helper function that generates tool calls with OpenAI and then scores them with TLM. For simplicity, we’ll just consider single-turn prompts.
def get_customer_service_response_with_score(user_request, model="gpt-4.1-mini", verbose=True):
"""
Get an AI response with trustworthiness score using TLM.
Args:
user_request (str): The customer's request
model (str): The model to use
verbose (bool): Whether to print results
Returns:
dict: Contains the ChatCompletion response and the TLMScore result
"""
openai_kwargs = {
"model": model,
"messages": [
{"role": "system", "content": CUSTOMER_SERVICE_PROMPT},
{"role": "user", "content": user_request}
],
"tools": customer_service_tools,
"tool_choice": "auto"
}
response = openai_client.chat.completions.create(**openai_kwargs)
score_result = tlm.score(response=response, **openai_kwargs)
if verbose:
from cleanlab_tlm.utils.chat import form_response_string_chat_completions_api
print(f"Customer Request: {user_request}")
print(f"TLM Score: {score_result['trustworthiness_score']:.3f}")
print(f"Response Message:\n\n{form_response_string_chat_completions_api(response=response.choices[0].message)}\n")
if "log" in score_result and "explanation" in score_result["log"]:
print("-"*100)
print(f"TLM Explanation:\n\n{score_result['log']['explanation']}\n")
return {
"response": response,
"score_result": score_result,
}
Example Scenarios
Let’s see how TLM works across different LLM tool calling scenarios.
When a customer request clearly maps to a specific tool with all required parameters provided, TLM typically assigns high trustworthiness scores (as shown below). These scenarios represent ideal automation candidates where you can trust your LLM to take the right action.
result1 = get_customer_service_response_with_score("Has my order TC-12345 shipped yet?")
Sometimes your LLM needs to infer parameters or make reasonable assumptions about tool usage. Below, we see how TLM scores cases where the model must interpret implicit requirements from user input.
result2 = get_customer_service_response_with_score("What are your cheapest headphones? I need wireless ones.")
When customer requests lack specific details, the AI must make assumptions to fill in missing parameters. TLM scores reflect the uncertainty inherent in these gap-filling scenarios. Ideally, these assumptions fall within the constraints of the available tools.
result3 = get_customer_service_response_with_score("I want to confirm something about my warranty. Whenever tomorrow morning is best to call.")
Some customer requests require human expertise or fall outside the AI’s capabilities. In these cases, many systems rely on tool-calling to handoff to human agents, and TLM helps identify when the model correctly recognizes the need for escalation versus attempting to handle requests inappropriately.
result4 = get_customer_service_response_with_score("I got a defective laptop from you and I need a refund immediately.")
Sometimes the AI chooses not to use any tools and provides a direct response instead. TLM can also score the trustworthiness of these non-tool responses.
result5 = get_customer_service_response_with_score(user_request="I want to buy a new phone")
Understanding lower TLM Scores for Tool Calls
TLM’s real value lies in detecting problematic tool calls that could lead to the wrong actions. Let’s explore scenarios where the AI makes questionable decisions and see how TLM scores reflect these issues.
Here the customer provides an email instead of an order ID, but the AI attempts to use the order status tool anyway with invalid input.
result6 = get_customer_service_response_with_score("Check status of order under my id: john_doe@email.com.")
When the customer’s request could reasonably map to multiple tools, the AI might choose the less appropriate one. This scenario often reveals tool selection confusion.
result7 = get_customer_service_response_with_score("I'm really frustrated with my order and need to speak with someone right away. Can you have someone call me back?")
When answering questions related to unavailable information, such as the return policy in the example below, the AI might provide a response that could contain inaccurate information rather than transferring to a human or acknowledging its limitations.
result8 = get_customer_service_response_with_score("What is your return policy?")
Sometimes the AI calls a tool when a simple informational response would be more appropriate, leading to unnecessary complexity.
result9 = get_customer_service_response_with_score("Hi there! What's on offer today?")
Strategies to handle untrustworthy tool calls
After integrating Cleanlab into your tool-calling LLM application, you can automatically determine which tool calls are untrustworthy by comparing trustworthiness scores against a fixed threshold (say 0.75).
Here are fallback options you can consider when trust scores are low:
-
Escalate untrustworthy tool calls for human (or user) approval before execution.
-
Replace the untrustworthy tool call with an abstention response such as: “Sorry I am unsure how to help with that.”
-
Direct your LLM to ask a follow-up question to get more information, then re-generate the tool call.
-
Along with your final LLM response, also show your user the raw tool call that was made, plus a disclaimer like: “CAUTION: This action was executed, but flagged as potentially untrustworthy”.
Next Steps
Learn more about using TLM with the Chat Completions API. Beyond scoring Tool Call outputs, Cleanlab can also score the trustworthiness of any other type of LLM output.