Skip to main content

Scoring the trustworthiness of Tool Calls

Run in Google ColabRun in Google Colab

This tutorial demonstrates how to score the trustworthiness of every tool call (i.e. function call) made by your LLM, in order to automatically detect/prevent incorrect tool calls in real-time.

Here we focus on Tool Calls made using OpenAI’s Chat Completions API, but the same approach works for any LLM model/API. With minimal changes to your existing code, you can score the trustworthiness of every tool call.

We’ll consider a customer service AI Assistant to show how Cleanlab identifies potentially problematic tool calls before they execute. In cases where your LLM emits a Tool Call and Cleanlab’s trustworthiness score is low, your AI system might fallback to one of these actions:

  • escalate the interaction to a human employee
  • ask user to confirm the tool call before you execute it
  • direct your LLM to ask a follow-up question to get more information, then re-generate the tool call
  • replace the tool call LLM output with an pre-written abstention response like “Sorry I’m unsure how to help with that”.

Setup

This tutorial requires a TLM API key. Get one here.

The Python packages required for this tutorial can be installed using pip:

%pip install --upgrade cleanlab-tlm openai
import os

os.environ["CLEANLAB_TLM_API_KEY"] = "<Cleanlab TLM API key>" # Get your free API key from: https://tlm.cleanlab.ai/
os.environ["OPENAI_API_KEY"] = "<OpenAI API key>" # for using OpenAI client library

Example Application: Customer Service AI

Here we build a customer support AI assistant, which has access to several tools to help customers with various requests.

CUSTOMER_SERVICE_PROMPT = """You are a helpful customer service AI assistant for TechCorp.
Help customers with their requests using the available tools.
If a request is unclear or cannot be handled by the available tools, transfer the customer to a human agent.
Always be polite and professional.
"""

Here are tools that our AI can call: check order status, search products, schedule callbacks, and transfer to human agent.

customer_service_tools = [
{
"type": "function",
"function": {
"name": "check_order_status",
"description": "Check the status of a customer's order",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The order ID to check"
}
},
"required": ["order_id"]
}
}
},
{
"type": "function",
"function": {
"name": "schedule_callback",
"description": "Schedule a callback for the customer. Assume the customer's contact details were provided in the chat initiation.",
"parameters": {
"type": "object",
"properties": {
"preferred_time": {
"type": "string",
"description": "Preferred callback time. Available times are every 15 minutes from 9am to 5pm on weekdays."
},
"topic": {
"type": "string",
"description": "Topic for the callback if the customer provides one."
}
},
"required": ["preferred_time"]
}
}
},
{
"type": "function",
"function": {
"name": "transfer_to_human",
"description": "Transfer the customer to a human agent. Use this as the default option for any requests outside the assistant's defined capabilities, complex technical issues, billing disputes, or when the customer explicitly requests human assistance.",
"parameters": {
"type": "object",
"properties": {
"reason": {
"type": "string",
"description": "Reason for transferring to human agent"
},
"urgency": {
"type": "string",
"enum": ["low", "medium", "high"],
"description": "Urgency level of the issue"
}
},
"required": ["reason", "urgency"]
}
}
},
{
"type": "function",
"function": {
"name": "search_products",
"description": "Search for products in the catalog",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search query for products"
},
"category": {
"type": "string",
"enum": ["electronics", "clothing", "home", "books", "all"],
"description": "Product category to search in. Default is 'all'."
}
},
"required": ["query"]
}
}
}
]

Set up Tool-Calling LLM and Trust Scoring

We’ll use OpenAI’s API to generate tool calls, then score their trustworthiness with TLM. This follows the pattern from our Chat Completions tutorial.

Here we run TLM with default settings, you can achieve better/faster results via optional configurations outlined in the TLMChatCompletion documentation.

from openai import OpenAI
from cleanlab_tlm.utils.chat_completions import TLMChatCompletion

# Initialize OpenAI and TLM clients
openai_client = OpenAI()

# See Advanced Tutorial for optional TLM configurations to get better/faster results
tlm = TLMChatCompletion(quality_preset="medium", options={"log": ["explanation"]})

Let’s define a helper function that generates tool calls with OpenAI and then scores them with TLM. For simplicity, we’ll just consider single-turn prompts.

def get_customer_service_response_with_score(user_request, model="gpt-4.1-mini", verbose=True):
"""
Get an AI response with trustworthiness score using TLM.

Args:
user_request (str): The customer's request
model (str): The model to use
verbose (bool): Whether to print results

Returns:
dict: Contains the ChatCompletion response and the TLMScore result
"""
openai_kwargs = {
"model": model,
"messages": [
{"role": "system", "content": CUSTOMER_SERVICE_PROMPT},
{"role": "user", "content": user_request}
],
"tools": customer_service_tools,
"tool_choice": "auto"
}

response = openai_client.chat.completions.create(**openai_kwargs)

score_result = tlm.score(response=response, **openai_kwargs)

if verbose:
from cleanlab_tlm.utils.chat import form_response_string_chat_completions_api

print(f"Customer Request: {user_request}")
print(f"TLM Score: {score_result['trustworthiness_score']:.3f}")
print(f"Response Message:\n\n{form_response_string_chat_completions_api(response=response.choices[0].message)}\n")
if "log" in score_result and "explanation" in score_result["log"]:
print("-"*100)
print(f"TLM Explanation:\n\n{score_result['log']['explanation']}\n")

return {
"response": response,
"score_result": score_result,
}

Example Scenarios

Let’s see how TLM works across different LLM tool calling scenarios.

When a customer request clearly maps to a specific tool with all required parameters provided, TLM typically assigns high trustworthiness scores (as shown below). These scenarios represent ideal automation candidates where you can trust your LLM to take the right action.

result1 = get_customer_service_response_with_score("Has my order TC-12345 shipped yet?")
Customer Request: Has my order TC-12345 shipped yet?
TLM Score: 1.000
Response Message:

<tool_call>
{
"name": "check_order_status",
"arguments": {
"order_id": "TC-12345"
}
}
</tool_call>

----------------------------------------------------------------------------------------------------
TLM Explanation:

Did not find a reason to doubt trustworthiness.

Sometimes your LLM needs to infer parameters or make reasonable assumptions about tool usage. Below, we see how TLM scores cases where the model must interpret implicit requirements from user input.

result2 = get_customer_service_response_with_score("What are your cheapest headphones? I need wireless ones.")
Customer Request: What are your cheapest headphones? I need wireless ones.
TLM Score: 0.883
Response Message:

<tool_call>
{
"name": "search_products",
"arguments": {
"query": "wireless headphones",
"category": "electronics"
}
}
</tool_call>

----------------------------------------------------------------------------------------------------
TLM Explanation:

Did not find a reason to doubt trustworthiness.

When customer requests lack specific details, the AI must make assumptions to fill in missing parameters. TLM scores reflect the uncertainty inherent in these gap-filling scenarios. Ideally, these assumptions fall within the constraints of the available tools.

result3 = get_customer_service_response_with_score("I want to confirm something about my warranty. Whenever tomorrow morning is best to call.")
Customer Request: I want to confirm something about my warranty. Whenever tomorrow morning is best to call.
TLM Score: 0.883
Response Message:

<tool_call>
{
"name": "schedule_callback",
"arguments": {
"preferred_time": "9:00",
"topic": "Warranty inquiry"
}
}
</tool_call>

----------------------------------------------------------------------------------------------------
TLM Explanation:

Did not find a reason to doubt trustworthiness.

Some customer requests require human expertise or fall outside the AI’s capabilities. In these cases, many systems rely on tool-calling to handoff to human agents, and TLM helps identify when the model correctly recognizes the need for escalation versus attempting to handle requests inappropriately.

result4 = get_customer_service_response_with_score("I got a defective laptop from you and I need a refund immediately.")
Customer Request: I got a defective laptop from you and I need a refund immediately.
TLM Score: 0.983
Response Message:

<tool_call>
{
"name": "transfer_to_human",
"arguments": {
"reason": "Customer received a defective laptop and is requesting an immediate refund",
"urgency": "high"
}
}
</tool_call>

----------------------------------------------------------------------------------------------------
TLM Explanation:

Did not find a reason to doubt trustworthiness.

Sometimes the AI chooses not to use any tools and provides a direct response instead. TLM can also score the trustworthiness of these non-tool responses.

result5 = get_customer_service_response_with_score(user_request="I want to buy a new phone")
Customer Request: I want to buy a new phone
TLM Score: 0.917
Response Message:

I can help you with that. Could you please specify any preferences you have for the phone? For example, any particular brand, features, or price range you are looking for?

----------------------------------------------------------------------------------------------------
TLM Explanation:

Did not find a reason to doubt trustworthiness.

Understanding lower TLM Scores for Tool Calls

TLM’s real value lies in detecting problematic tool calls that could lead to the wrong actions. Let’s explore scenarios where the AI makes questionable decisions and see how TLM scores reflect these issues.

Here the customer provides an email instead of an order ID, but the AI attempts to use the order status tool anyway with invalid input.

result6 = get_customer_service_response_with_score("Check status of order under my id: john_doe@email.com.")
Customer Request: Check status of order under my id: john_doe@email.com.
TLM Score: 0.133
Response Message:

<tool_call>
{
"name": "check_order_status",
"arguments": {
"order_id": "john_doe@email.com"
}
}
</tool_call>

----------------------------------------------------------------------------------------------------
TLM Explanation:

The user asked to check the status of an order using their ID, which they provided as an email address ("john_doe@email.com"). The assistant responded by calling the "check_order_status" function with the order_id parameter set to the email address.

However, the function "check_order_status" expects an order ID, which is typically a unique alphanumeric code or number assigned to a purchase, not an email address. The user's input is ambiguous because they said "Check status of order under my id: john_doe@email.com," which could mean they want to check an order associated with their email, but the function does not support searching by email, only by order ID.

Since the assistant cannot assume the email is the order ID, the pproach would be to ask the user for the actual order ID or transfer to a human agent if the request is unclear or cannot be handled by the available tools. Using the email as the order ID is likely to fail or return no results.

Therefore, the assistant's answer is factually incorrect because it misuses the function parameter and does not follow the instructions to avoid assumptions about values to plug into functions.

A better answer would be to politely ask the user to provide the actual order ID or transfer to a human agent.

Hence, the confidence in the correctness of the answer is very low.

When the customer’s request could reasonably map to multiple tools, the AI might choose the less appropriate one. This scenario often reveals tool selection confusion.

result7 = get_customer_service_response_with_score("I'm really frustrated with my order and need to speak with someone right away. Can you have someone call me back?")
Customer Request: I'm really frustrated with my order and need to speak with someone right away. Can you have someone call me back?
TLM Score: 0.433
Response Message:

<tool_call>
{
"name": "schedule_callback",
"arguments": {
"preferred_time": "09:00",
"topic": "Customer frustrated with order and wants to speak with someone right away"
}
}
</tool_call>

----------------------------------------------------------------------------------------------------
TLM Explanation:

The user expresses frustration with their order and requests to speak with someone "right away." The assistant responds by scheduling a callback at 9:00 AM with a topic describing the user's frustration. While scheduling a callback is a reasonable response, the user explicitly wants to speak with someone immediately, which implies urgency. The available tools specify that the assistant should transfer the customer to a human agent for urgent or complex issues or when the customer explicitly requests human assistance. The user’s request to "speak with someone right away" fits this criterion, indicating a high urgency and a direct request for human interaction.

Scheduling a callback at 9:00 AM may not meet the user's immediate need, especially if the current time is later or outside business hours. The assistant should have transferred the user to a human agent with a high urgency reason rather than scheduling a callback, which could delay resolution and increase user frustration.

Therefore, the assistant's answer is not fully aligned with the instructions and the user's expressed urgency. A better response would be to call the "transfer_to_human" function with a reason indicating the user's frustration and urgency.

Given this, the answer is partially correct in that it attempts to help, but it does not follow the guidelines for urgent or explicit human assistance requests.

Score: low to moderate confidence in correctness.

When answering questions related to unavailable information, such as the return policy in the example below, the AI might provide a response that could contain inaccurate information rather than transferring to a human or acknowledging its limitations.

result8 = get_customer_service_response_with_score("What is your return policy?")
Customer Request: What is your return policy?
TLM Score: 0.567
Response Message:

Our return policy at TechCorp typically allows customers to return products within 30 days of purchase for a full refund or exchange, provided the items are in their original condition and packaging. Certain items may have specific return guidelines or restrictions. For the most accurate information, could you please specify the product category or provide more details about your purchase?

----------------------------------------------------------------------------------------------------
TLM Explanation:

The assistant's response provides a general description of a return policy, stating a 30-day return window for full refund or exchange, with conditions about original condition and packaging, and notes that some items may have specific restrictions. This is a common and plausible return policy for many retailers, but the assistant does not cite any specific TechCorp policy details or official documentation. Since the prompt does not provide any explicit information about TechCorp’s return policy or a tool to retrieve it, the assistant is relying on a generic, reasonable assumption rather than verified company-specific data.

The assistant also appropriately asks for more details to provide a more accurate answer, which is a good practice given the lack of specific information. However, without access to TechCorp’s actual return policy, the assistant cannot guarantee the accuracy of the stated 30-day period or other conditions.

Given the lack of explicit data or tools to confirm the policy, the answer is a reasonable, polite, and professional approximation but not guaranteed factually correct for TechCorp. The assistant should ideally have transferred the user to a human agent or stated that it does not have the exact policy details if it cannot verify them.

Therefore, the answer is likely correct in a general sense but not confirmed for TechCorp specifically, so the confidence should be moderate but not high.

Sometimes the AI calls a tool when a simple informational response would be more appropriate, leading to unnecessary complexity.

result9 = get_customer_service_response_with_score("Hi there! What's on offer today?")
Customer Request: Hi there! What's on offer today?
TLM Score: 0.733
Response Message:

<tool_call>
{
"name": "search_products",
"arguments": {
"query": "offer"
}
}
</tool_call>

----------------------------------------------------------------------------------------------------
TLM Explanation:

The user asked, "What's on offer today?" which is a general inquiry about current promotions or deals. The assistant responded by calling the "search_products" function with the query "offer." This is a reasonable approach because the assistant does not have direct knowledge of current promotions and the "search_products" function is designed to search the product catalog based on a query. Using "offer" as the search term could potentially return products that are on sale or featured as special offers.

However, there are some limitations. The query "offer" might not be the best or most precise keyword to find current deals or promotions, as it depends on how the product catalog is tagged or described. The assistant could have also asked a clarifying question or mentioned that it is searching for current offers. Alternatively, if the system had a dedicated function for promotions or deals, that would be preferable, but no such function is listed.

Given the available tools, the assistant's choice to use "search_products" with the query "offer" is a reasonable and factually correct way to attempt to fulfill the user's request. It is not guaranteed to return perfect results, but it is the best available option.

Therefore, the answer is factually nd appropriate given the constraints.

Strategies to handle untrustworthy tool calls

After integrating Cleanlab into your tool-calling LLM application, you can automatically determine which tool calls are untrustworthy by comparing trustworthiness scores against a fixed threshold (say 0.75).

Here are fallback options you can consider when trust scores are low:

  1. Escalate untrustworthy tool calls for human (or user) approval before execution.

  2. Replace the untrustworthy tool call with an abstention response such as: “Sorry I am unsure how to help with that.

  3. Direct your LLM to ask a follow-up question to get more information, then re-generate the tool call.

  4. Along with your final LLM response, also show your user the raw tool call that was made, plus a disclaimer like: “CAUTION: This action was executed, but flagged as potentially untrustworthy”.

Next Steps

Learn more about using TLM with the Chat Completions API. Beyond scoring Tool Call outputs, Cleanlab can also score the trustworthiness of any other type of LLM output.