Leverage Historical Support Tickets to Improve your AI
This tutorial walks you through the process of transforming historical customer support conversations into high-quality question-answer (QA) pairs that can be ingested into Codex and served as expert answers for similar future queries. After adding your historical support tickets into Codex, your AI assistant will be able to accurately/helpfully answer significantly more queries from your customers.
We assume you have a database of historical customer support tickets, in which your human support employees previously provided high-quality assistance to customers via a recorded chat.
Alternatively if your documentation has a FAQ section, you can also follow this tutorial to ingest it into Codex expert answers. This enables FAQs to be served to your users more precisely/accurately (in particular, overcoming issues in your current AI system perhaps stemming from data formatting, document chunking, suboptimal retrieval, …).
import os
from cleanlab_codex.client import Client
import shutil
os.environ["CODEX_API_KEY"] = "<CODEX_API_KEY>" # Get your free API key from: https://codex.cleanlab.ai/
os.environ["CLEANLAB_TLM_API_KEY"] = "<TLM_API_KEY>" # Get your free API key from: https://tlm.cleanlab.ai/
codex_client = Client()
Form Question-Answer Pairs from Message Histories
We’ll first transform raw customer support conversations into structured question-answer (QA) pairs. Later we’ll filter those QA pairs to remove answers that are low-quality or likely outdated.
Steps to follow:
- Pre-process your data: Ensuring it is organized as conversations, where each conversation is a sequence of User and Assistant messages.
- Deduplicate conversations: To save subsequent processing time.
- Extract QA pairs: From each conversation, extract the user’s primary initial question and the corresponding assistant answer.
- Deduplicate queries: To save subsequent processing time.
In this tutorial, we’ll demonstrate this process using a subset of the ABCD dataset.
import pandas as pd
df = pd.read_csv("https://cleanlab-public.s3.us-east-1.amazonaws.com/Datasets/abcd_subsampled.csv")
df.head(1)
messages | metadata | |
---|---|---|
0 | Assistant: Thanks for contacting AcmeBrands, h... | Background Information:\n The conversation sta... |
Let’s look at the formatted data. Below we have a conversation history between a User (aka customer) and an Assistant (aka your human customer service representative, whose answers we can trust – not an AI model!). We expect the history to be in messages
list format with each conversation turn being indicated by role
:content
and being on a new line.
idx = 72
print(df.iloc[idx]['messages'])
Deduplicate message histories to ensure uniqueness
print("initial size", df.shape)
df = df.drop_duplicates(subset=["messages"])
print("final size after deduplication:", df.shape)
(Optional) Add Background Information into Messages
Optionally, you can prepend any background information (metadata) that is relevant to each conversation (date, user info, etc). This helps ensure that Cleanlab’s AI receives all relevant context about this conversation, when it is processing it to extract expert answers.
df['messages'] = df['metadata']+'\n' + df['messages']
print(df['messages'][idx])
Extract initial Question and Answer from message history
We’ll use Cleanlab’s TLM to extract one structured question and answer pair from each conversation. Cleanlab will analyze the conversation (including any metadata), and identify the user’s primary initial question and the assistant’s corresponding answer to this question.
from cleanlab_tlm import TLM
# Optional configurations to control runtimes/costs:
TLM_QUALITY_PRESET = "base"
tlm = TLM(quality_preset=TLM_QUALITY_PRESET, options={"model": "gpt-4.1-nano"})
SAVE_DIR = "scores_run" # Temp directory to save scores
if os.path.exists(SAVE_DIR):
print(f"Warning: Directory '{SAVE_DIR}' already exists. Existing filenames can cause issues if not removed.")
response = input(f"Do you want to remove the directory '{SAVE_DIR}'? (y/n): ").strip().lower()
if response == 'y':
print(f"Removing existing directory: {SAVE_DIR}")
shutil.rmtree(SAVE_DIR)
else:
print(f"Keeping existing directory: {SAVE_DIR}. Be aware that existing files may cause issues.")
Optional: Helper functions for extracting query-answer pair from conversation
import re
from typing import Tuple, Optional
from typing import Tuple, Optional, Dict, List, Any
import json
TLM prompt use for QA pair extraction:
PARSE_QUERY = """Analyze the following Conversation, and try to identify: overall what is the primary initial Query asked by the User (based on messages from the User) and what was a complete Answer to this Query provided by the assistant (based on messages from the assistant).
<conversation>
{message_history}
</conversation>
# Instructions
First, determine what overall is the primary initial Query asked by the User in this Conversation.
Next, determine a complete and accurate Answer to this self-contained Query, based on the information provided by the assistant in the Conversation.
If a Topic is specified in the Conversation, make sure it is mentioned/obvious in the Query you write.
When writing the Query:
- Ensure that it self-contained and answerable without seeing the full Conversation.
- If the User raised multiple concerns in their messages, focus the Query on summarizing their initial concern into a self-contained question.
When writing the Answer:
- Ensure that it is accurate, and would be helpful to other Users who had the same Query.
- Only rely on information from the assistant's responses that directly address the Query!
- Do not include any non-core information that is not necessary to answer the Query.
- Omit any conversational/phatic statements and apologies from the assistant.
- Anticipate follow-up questions that the User raised and the assistant answered and try to include them in your Answer if other Users will likely have the same follow-ups.
- If the assistant did not provide a complete and helpful answer to the Query, write the Answer as an empty string ("").
- Do not output an empty string Answer if the assistant's primary helpful response was a URL, instead your Answer should include the URL.
- If the person is asking to speak with somebody, but the assistant said nobody is available, then your Answer should be the next alternative suggested by the assistant.
Your output must strictly follow this plain text format:
query: [self-contained Query that summarizes the User's main initial request]
answer: [complete self-contained Answer to this Query]"""
PARSE_QUERY_TRUSTWORTHINESS_THRESHOLD = 0.8 # higher values will extract less Question-Answer pairs from conversations, we only keep those whose trust score exceeds this threshold.
def batch_prompt(
tlm: TLM,
input_path: str,
output_path: str,
prompt_col_name: str,
answer_col_name: Optional[str] = None,
batch_size: int = 1000,
constrain_outputs: Optional[List[str]] = None,
):
if os.path.exists(output_path):
start_idx = len(pd.read_csv(output_path))
else:
start_idx = 0
df_batched = pd.read_csv(input_path, chunksize=batch_size)
curr_idx = 0
for curr_batch in df_batched:
if curr_idx + len(curr_batch) <= start_idx:
curr_idx += len(curr_batch)
continue
elif curr_idx < start_idx:
curr_batch = curr_batch[start_idx - curr_idx :]
curr_idx = start_idx
if answer_col_name:
prompts = curr_batch[prompt_col_name].to_list()
answers = curr_batch[answer_col_name].to_list()
results = tlm.get_trustworthiness_score(prompts, answers)
else:
prompts = curr_batch[prompt_col_name].to_list()
results = tlm.prompt(prompts, constrain_outputs=constrain_outputs)
results_df = pd.DataFrame(results)
if "log" in results_df.columns:
results_df["log"] = results_df["log"].apply(json.dumps)
results_df.to_csv(
output_path, mode="a", index=False, header=not os.path.exists(output_path)
)
curr_idx += len(curr_batch)
def extract_tags(text: str | None) -> Tuple[Optional[str], Optional[str]]:
"""
Extracts 'query:' and 'answer:' values from plain text format.
Returns a tuple of (query, answer) or (None, None) if not found.
"""
if not isinstance(text, str):
return None, None
query_match = re.search(r"query:\s*(.*)", text)
answer_match = re.search(r"answer:\s*(.*)", text)
query = query_match.group(1).strip() if query_match else None
answer = answer_match.group(1).strip() if answer_match else None
return query, answer
def get_parse_prompts(messages: list[list[str]]) -> list[str]:
"""Formats message histories into prompts for parsing."""
return [
PARSE_QUERY.format(message_history=history)
for history in messages
]
def prompt_tlm_to_parse_message_history(
tlm,
prompts: list[str],
save_dir: str
) -> list[str | None]:
"""Uses batch_prompt and returns parsed responses if they meet the trustworthiness threshold.
If no response meets the threshold, returns None."""
os.makedirs(save_dir, exist_ok=True)
input_path = os.path.join(save_dir, "input_extract_qa.csv")
output_path = os.path.join(save_dir, "results.csv")
pd.DataFrame({"prompt": prompts}).to_csv(input_path, index=False)
batch_prompt(
tlm=tlm,
input_path=input_path,
output_path=output_path,
prompt_col_name="prompt",
batch_size=1000,
)
df = pd.read_csv(output_path)
parsed_messages_list = []
for _, row in df.iterrows():
score = row.get("trustworthiness_score")
if pd.notnull(score) and score >= PARSE_QUERY_TRUSTWORTHINESS_THRESHOLD:
parsed_messages_list.append(row.get("response", None))
else:
parsed_messages_list.append(None)
return parsed_messages_list
def parse_all_messages(tlm, messages: list[list[str]], save_dir: str = '/tmp/') -> list[str | None]:
"""Parses message history returning a list of (initial) queries and answers or None if parsing failed."""
prompts = get_parse_prompts(messages)
parsed_messages_list = prompt_tlm_to_parse_message_history(tlm, prompts, save_dir)
parsed_queries_and_answers = [extract_tags(r) for r in parsed_messages_list]
return parsed_queries_and_answers
df[["query", "answer"]] = parse_all_messages(tlm, df['messages'].tolist(), save_dir=SAVE_DIR)
Deduplicate QA pairs so all queries are unique and have answers
print("initial size", df.shape)
df = df[
df["query"].notna() & df["answer"].notna() & # remove NaN
df["query"].astype(str).str.strip().ne("") & # remove empty/blank strings
df["answer"].astype(str).str.strip().ne("")
].reset_index(drop=True)
print("size after removing rows with NaN queries or answers:", df.shape)
df = df.drop_duplicates(subset=["query"])
print("final size after deduplication:", df.shape)
df.to_csv("question_answer_pairs.csv", index=False)
Filter out bad Question-Answer pairs
Let’s load the Question-Answer pairs extracted above. The data file should include the following components:
- Query: The query made by the user.
- Answer: The expert answer provided in response to the user’s query.
- Message History (Optional): A record of the conversation history, which may provide additional context.
- Custom Metadata (Optional): Any additional information that may be relevant to the data, such as timestamps or user info.
Note: If your original customer support tickets are single-turn Q&A rather than multi-turn chats, or you are instead ingesting a documentation FAQ, then you can ignore the conversation-processing code above and start at this part of the tutorial.
import pandas as pd
df = pd.read_csv("question_answer_pairs.csv")
print(len(df))
df.head(1)
messages | metadata | query | answer | |
---|---|---|---|---|
0 | Background Information:\n The conversation sta... | Background Information:\n The conversation sta... | How can I regain access to my account after lo... | To regain access to your account after losing ... |
Define Filters
We apply a sequence of filters to remove low-quality or outdated question-answer pairs before we ingest this data into Codex. Different filters target specific issues like PII exposure, non-questions, outdated or non-informative content. You can easily add your own filters here too!
Note: All TLM-based filters have a configurable trustworthiness_threshold
which determine how many examples meet the filter criteria (how stringent the filter is). All examples where the filter’s TLM trustworthiness score falls below the trustworthiness_threshold
are removed.
Optional: Helper methods for defining and running filters
import numpy as np
import string
import copy
from datetime import datetime
def print_idx(row, columns):
"""
Print the index and values of the specified columns from a pandas Series or namedtuple row.
Args:
row: A pandas Series or namedtuple representing a row from a DataFrame.
columns: List of column names (for Series) or attribute names (for namedtuple) to print.
"""
for col in columns:
value = (
row[col]
if isinstance(row, dict) or isinstance(row, pd.Series)
else getattr(row, col, None)
)
print(f"{col}: {value}")
def display_filter_results(df, concern, n=20, columns=None, ascending=True):
"""
Print the top n rows sorted by the given concern column, showing specified columns and the concern.
Args:
df: DataFrame to print from.
concern: The column name to sort by and display.
n: Number of rows to print (default 20).
columns: List of columns to display (default ['query', 'answer']).
ascending: Whether to sort in ascending order (default True).
"""
if columns is None:
columns = ["query", "answer"]
display_columns = columns + [concern] # Make a new list every call
for i, row in df.sort_values(by=concern, ascending=ascending).head(n).iterrows():
print_idx(row, display_columns)
print("-" * 100)
class BaseFilter:
def __init__(
self, name: str, cost: str = "low", hyperparameters: Optional[Dict] = None
):
self.name = name
self.cost = cost # 'low', 'med', or 'high'
self.hyperparameters = hyperparameters or {}
def apply(self, df: pd.DataFrame) -> Tuple[Optional[np.ndarray], np.ndarray]:
"""
Apply filter to dataframe.
Returns:
scores (np.ndarray or None): A float score per row (or None if not applicable)
keep_mask (np.ndarray): A boolean array indicating which rows to keep
"""
raise NotImplementedError
class KeywordFilter(BaseFilter):
def __init__(self, keywords: list[str], strip_punctuation: bool = False, **kwargs):
super().__init__(name="KeywordFilter", cost="low", **kwargs)
self.keywords = set(k.lower() for k in keywords)
self.strip_punctuation = strip_punctuation
def apply(self, df: pd.DataFrame) -> Tuple[None, np.ndarray]:
def tokenize(text: str) -> set[str]:
if self.strip_punctuation:
text = text.translate(
str.maketrans({p: " " for p in string.punctuation})
)
return set(word.lower() for word in text.split())
keep_mask = (
df["answer"]
.fillna("")
.apply(lambda x: not any(k in tokenize(x) for k in self.keywords))
.to_numpy()
)
return None, keep_mask
class ExactMatchFilter(BaseFilter):
def __init__(self, exact_answers: list[str], ignore_case: bool = False, **kwargs):
super().__init__(name="ExactMatchFilter", cost="low", **kwargs)
self.ignore_case = ignore_case
if self.ignore_case:
self.exact_answers = set(ans.lower() for ans in exact_answers)
else:
self.exact_answers = set(exact_answers)
def apply(self, df: pd.DataFrame) -> Tuple[None, np.ndarray]:
if self.ignore_case:
answers = df["answer"].fillna("").str.lower()
keep_mask = ~answers.isin(self.exact_answers)
else:
keep_mask = ~df["answer"].isin(self.exact_answers)
return None, keep_mask.to_numpy()
class LengthFilter(BaseFilter):
def __init__(self, min_word_count: int = 3, **kwargs):
super().__init__(name="LengthFilter", cost="low", **kwargs)
self.min_word_count = min_word_count
def apply(self, df: pd.DataFrame) -> Tuple[None, np.ndarray]:
word_count = df["answer"].fillna("").str.split().str.len()
keep_mask = word_count >= self.min_word_count
return None, keep_mask.to_numpy()
class TLMBinaryFilter(BaseFilter):
def __init__(
self,
prompt_template: str,
name: str = "TLMBinaryFilter",
tlm_api_key: Optional[str] = None,
trustworthiness_threshold: float = 0.5,
tlm_kwargs: dict[str, Any] = None,
batch_size: int = 1000,
save_dir: str = "/tmp",
):
if tlm_api_key is None:
tlm_api_key = os.getenv("CLEANLAB_TLM_API_KEY")
if not tlm_api_key:
raise ValueError("TLM API key must be provided.")
super().__init__(name=name, cost="high")
self.prompt_template = prompt_template
self.tlm_client = TLM(api_key=tlm_api_key, **(tlm_kwargs or {}))
self.trustworthiness_threshold = trustworthiness_threshold
self.batch_size = batch_size
self.name = name
self.save_dir = save_dir
formatter = string.Formatter()
self.template_fields = {
fname for _, fname, _, _ in formatter.parse(prompt_template) if fname
}
def get_responses(self, df: pd.DataFrame) -> list:
missing_fields = self.template_fields - set(df.columns)
if missing_fields:
raise ValueError(f"Missing fields in DataFrame: {missing_fields}")
df_copy = df.copy()
df_copy["__prompt__"] = df_copy.apply(
lambda row: self.prompt_template.format(
**{f: row[f] for f in self.template_fields}
),
axis=1,
)
temp_input_path = os.path.join(self.save_dir, f"tlm_binary_{self.name}_input.csv")
temp_output_path = os.path.join(self.save_dir, f"tlm_binary_{self.name}_output.csv")
df_copy[["__prompt__"]].to_csv(temp_input_path, index=False)
batch_prompt(
tlm=self.tlm_client,
input_path=temp_input_path,
output_path=temp_output_path,
prompt_col_name="__prompt__",
batch_size=self.batch_size,
constrain_outputs=["Yes", "No"],
**(self.tlm_client.kwargs if hasattr(self.tlm_client, "kwargs") else {}),
)
return pd.read_csv(temp_output_path).to_dict(orient="records")
def apply(self, df: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
responses = self.get_responses(df)
scores = []
for r in responses:
if isinstance(r, dict) and r.get("response", "").strip().lower() == "yes":
score = r.get("trustworthiness_score", 0.0)
else:
score = 0.0
scores.append(score)
scores = np.array(scores)
keep_mask = scores >= self.trustworthiness_threshold
return scores, keep_mask
class TLMCustomEvalFilter(BaseFilter):
def __init__(
self,
criteria_name: str,
criteria_instruction: str,
tlm_api_key: Optional[str] = None,
trustworthiness_threshold: float = 0.5,
tlm_kwargs: dict[str, Any] = None,
batch_size: int = 1000,
save_dir: str = "/tmp",
**kwargs,
):
if tlm_api_key is None:
tlm_api_key = os.getenv("CLEANLAB_TLM_API_KEY")
if not tlm_api_key:
raise ValueError(
"TLM API key must be provided either as an argument or through the environment variable CLEANLAB_TLM_API_KEY."
)
if tlm_kwargs is None:
tlm_kwargs = {}
if "custome_eval_criteria" in tlm_kwargs.get("options", {}):
raise ValueError(
"The 'custome_eval_criteria' option is already set in the TLM options. Please define the eval through *criteria_name* and *criteria_instruction* parameters in TLMCustomEvalFilter."
)
name = f"TLMCustomEval_{criteria_name}"
super().__init__(name=name, cost="high", **kwargs)
tlm_kwargs = copy.deepcopy(tlm_kwargs) if tlm_kwargs else {}
opts = dict(tlm_kwargs.pop("options", {}))
opts["custom_eval_criteria"] = [
{"name": criteria_name, "criteria": criteria_instruction}
]
tlm_kwargs["options"] = opts
self.tlm_client = TLM(api_key=tlm_api_key, **tlm_kwargs)
self.trustworthiness_threshold = trustworthiness_threshold
self.batch_size = batch_size
self.save_dir = save_dir
def apply(self, df: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
input_path = os.path.join(self.save_dir, f"{self.name}_input.csv")
output_path = os.path.join(self.save_dir, f"{self.name}_output.csv")
df[["query", "answer"]].to_csv(input_path, index=False)
if os.path.exists(output_path):
os.remove(output_path)
batch_prompt(
tlm=self.tlm_client,
input_path=input_path,
output_path=output_path,
prompt_col_name="query",
answer_col_name="answer",
batch_size=self.batch_size,
constrain_outputs=None,
)
output_df = pd.read_csv(output_path)
output_df["log"] = output_df["log"].apply(json.loads)
custom_eval_scores = (
output_df["log"]
.apply(lambda x: x["custom_eval_criteria"][0]["score"])
.values
)
keep_mask = custom_eval_scores >= self.trustworthiness_threshold
return custom_eval_scores, keep_mask
class TLMOutdatedAnswerFilter(BaseFilter):
def __init__(
self,
prompt_template: str,
name: str = "TLMOutdatedAnswerFilter",
tlm_api_key: Optional[str] = None,
trustworthiness_threshold: float = 0.5,
tlm_kwargs: dict[str, Any] = None,
batch_size: int = 1000,
save_dir: str = "/tmp",
**kwargs,
):
if tlm_api_key is None:
tlm_api_key = os.getenv("CLEANLAB_TLM_API_KEY")
if not tlm_api_key:
raise ValueError(
"TLM API key must be provided either as an argument or through the environment variable CLEANLAB_TLM_API_KEY."
)
super().__init__(name=name, cost="high", **kwargs)
self.tlm_client = TLM(api_key=tlm_api_key, **(tlm_kwargs or {}))
self.trustworthiness_threshold = trustworthiness_threshold
self.batch_size = batch_size
self.save_dir = save_dir
if not prompt_template or not isinstance(prompt_template, str):
raise ValueError("prompt_template must be a non-empty string.")
self.prompt_template = prompt_template
def apply(self, df: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
if not all(col in df.columns for col in ["query", "answer", "metadata"]):
raise ValueError(
"DataFrame must contain 'query', 'answer', and 'metadata' columns."
)
today_str = datetime.today().strftime("%Y-%m-%d")
# Create prompt from template and save to input CSV
prompts = [
self.prompt_template.format(
today_str=today_str,
query=row["query"],
answer=row["answer"],
meta_data=row["metadata"],
)
for _, row in df.iterrows()
]
input_df = pd.DataFrame({"prompt": prompts})
input_path = os.path.join(self.save_dir, f"{self.name}_input.csv")
output_path = os.path.join(self.save_dir, f"{self.name}_output.csv")
input_df.to_csv(input_path, index=False)
if os.path.exists(output_path):
os.remove(output_path)
batch_prompt(
tlm=self.tlm_client,
input_path=input_path,
output_path=output_path,
prompt_col_name="prompt",
batch_size=self.batch_size,
constrain_outputs=["Yes", "No"],
)
output_df = pd.read_csv(output_path)
scores = np.array([
row["trustworthiness_score"] if str(row["response"]).strip().lower() == "yes" else 0.0
for _, row in output_df.iterrows()
])
keep_mask = ~(scores > self.trustworthiness_threshold)
return scores, keep_mask
class RunFilters:
def __init__(self, filters: List[BaseFilter], save_dir: str = None):
# Sort filters by cost priority (low -> high)
cost_order = {"low": 0, "med": 1, "high": 2}
self.filters = sorted(filters, key=lambda f: cost_order.get(f.cost, 3))
self.save_dir = save_dir
print(
f"Initialized with {len(self.filters)} filters sorted by cost priority: {[f.name for f in self.filters]}"
)
def run(self, df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
self.original_df = df.copy()
keep_mask = np.ones(len(df), dtype=bool)
all_scores = pd.DataFrame(index=df.index)
for filt in self.filters:
filtered_df = df[keep_mask]
print(f"Running {filt.name}")
scores, local_keep = filt.apply(filtered_df)
print(f"sum of {filt.name}: {local_keep.sum()}")
full_local_keep = np.zeros(len(df), dtype=bool)
full_local_keep[keep_mask] = local_keep
if scores is not None:
score_series = pd.Series(np.nan, index=df.index, dtype=float)
score_series[keep_mask] = scores
all_scores[f"{filt.name}_score"] = score_series
score_series = pd.Series(np.nan, index=df.index, dtype=object)
score_series[keep_mask] = local_keep
all_scores[filt.name] = score_series
if self.save_dir is not None:
os.makedirs(self.save_dir, exist_ok=True)
out_path = os.path.join(self.save_dir, "all_scores.csv")
all_scores.to_csv(out_path, index=False)
keep_mask &= full_local_keep
self.filtered_df = df[keep_mask].reset_index(drop=True)
all_scores["filtered_mask"] = keep_mask == False
self.all_filtered_scores = all_scores
print(f"Final size after applying filters: {self.filtered_df.shape}")
return self.filtered_df, all_scores
Optional: Defining custom filters
tlm_kwargs = {
"quality_preset": "base",
"options": {"model": "gpt-4.1-nano"},
}
tlm_filter_pii_protection = TLMCustomEvalFilter(
criteria_name="pii_protection",
criteria_instruction="""
Determine if the answer contains or exposes sensitive personally identifiable information (PII).
A bad answer:
- Contains private data such as credit card numbers, Social Security numbers, or other government-issued IDs.
- Includes full personal addresses, private phone numbers, or personal email addresses.
- Reveals sensitive account information such as bank details, usernames, or passwords.
- Repeats or reflects any confidential personal details shared by the user.
- Mentions information not typically accessible to the public.
A great answer:
- Contains no sensitive or private PII.
- May include **publicly available contact information** such as:
- Government helplines
- City service phone numbers
- Support emails published on official websites
- Does **not** leak or infer sensitive user-specific details.
Only flag answers that contain **private, confidential, or user-specific** personal data.
""",
tlm_kwargs=tlm_kwargs,
trustworthiness_threshold=0.5,
save_dir=SAVE_DIR
)
tlm_binary_query_is_good = TLMBinaryFilter(
prompt_template="""Is the message a query?
query: {query}
--
Respond with "Yes" if the message is a question, request or search term, otherwise respond with "No".
Please respond with only "Yes" or "No" with no leading or trailing text. Do not return anything else.
""",
trustworthiness_threshold=0.5,
name="TLMBinary_query_is_good",
tlm_kwargs=tlm_kwargs,
save_dir=SAVE_DIR
)
tlm_filter_thankyou_or_question_only = TLMCustomEvalFilter(
criteria_name="thankyou_or_question_only",
criteria_instruction="""Determine if the answer fails to address the user's query by either:
1. Only including salutations or closing phrases (such as "thank you", "goodbye", etc.).
2. Only asking a clarifying or follow-up question without providing a useful answer.
3. Giving a vague, off-topic, or unhelpful reply that does not meaningfully respond to the query.
A bad answer:
- Provides no informational content beyond a greeting or farewell.
- Ends the interaction without addressing the user's query.
- Only asks a question and does not answer the original query.
- Gives an answer that appears unrelated or avoids the core of the user’s query.
A great answer:
- Answers the query clearly and directly.
- May include greetings or closings, but also contains relevant, helpful information.
""",
tlm_kwargs=tlm_kwargs,
trustworthiness_threshold=0.5,
save_dir=SAVE_DIR
)
tlm_filter_non_informative_answer = TLMCustomEvalFilter(
criteria_name="non_informative_answer",
criteria_instruction="""Determine if the answer fails to provide a meaningful or useful answer to the query.
A bad answer:
- States it cannot answer or lacks the information, without offering help.
- Summarizes a solution without giving specific, actionable details.
- Avoids addressing the actual query.
- References resources or solutions but doesn't explain or elaborate.
- Assumes facts not stated, leading to potentially inaccurate or misleading advice.
- Requests personalized, time-bound, or specific updates without sufficient context.
- Provides an answer tailored to one user’s specific circumstance rather than offering generalizable guidance.
A great answer:
- Directly answers the query with clear, specific information.
- Provides actionable steps, examples, or guidance.
- May include follow-up questions or clarifications, but only after delivering a substantive answer.
""",
tlm_kwargs=tlm_kwargs,
trustworthiness_threshold=0.25,
save_dir=SAVE_DIR
)
tlm_filter_too_specific = TLMCustomEvalFilter(
criteria_name="too_specific",
criteria_instruction="""Determine if the query is too specific to the user's personal or narrow situation, making it unlikely to help others with similar questions.
A bad query:
- Asks about uncommon, highly personalized, or overly detailed scenarios.
- Limits the usefulness of any answer to a wider audience.
- Relies on information that is time-sensitive or valid only for a short period.
- Often involves specific items, account actions, or events relevant only to one user.
- Lacks sufficient context to be broadly interpretable or reusable by others.
A good query:
- Asks about a general topic or situation relevant to many users.
- May include details, but still invites broadly useful information.
Give a low score for a bad query.
""",
tlm_kwargs=tlm_kwargs,
trustworthiness_threshold=0.5,
save_dir=SAVE_DIR
)
tlm_outdated_answer_filter = TLMOutdatedAnswerFilter(prompt_template= """
Today's date is {today_str}.
{meta_data}
<query>
{query}
</query>
<answer>
{answer}
</answer>
Your task is to determine:
Is this answer likely outdated today? Yes or No.
Only say “Yes” if the information is likely no longer useful, accurate, or valid due to changes over time, such as:
- Obsolete procedures
- Outdated rates or policies
- Time-sensitive answers (e.g., booked dates or one-time confirmations)
- Temporary time-bound information (e.g., restrictions in effect only during specific past dates)
- Mentions a specific date or event (e.g., a sale) that has already passed relative to today
+ Contains vague or relative status updates that may no longer be accurate (e.g., “soon”, “a few days”, “recently processed”, “in progress”)
Answer "No" if:
- Provides general information, definitions, or explanations that are likely still valid (e.g., what "manual watering" means)
- Describes standard processes or requirements for making a request (e.g., setting up inspection)
Respond with Yes or No only.
""",
trustworthiness_threshold=0.9,
save_dir=SAVE_DIR
)
# Optionally improve outdated filter via days-since-conversation calculation:
def extract_date_from_metadata(metadata):
date_match = re.search(r'\d{4}-\d{2}-\d{2}', metadata)
return date_match.group(0) if date_match else None
metadata = df['metadata'][0]
df['conv_date'] = pd.to_datetime(df['metadata'].apply(extract_date_from_metadata))
from datetime import datetime
# Calculate the number of days passed since conv_date
df['days_passed'] = (datetime.now() - df['conv_date']).dt.days
df['metadata'] = df.apply(
lambda row: row['metadata'].replace(
'The conversation started',
f'The conversation started {row["days_passed"]} days ago'
),
axis=1
)
Now that they’re defined, let’s apply our filters to the QA pairs data. You can either use the filters provided here directly and/or define your own filters using the above classes. Here we filter out bad examples based on: keywords, exact string matching, response-length, bad queries, potentially outdated answers, answers that contain PII, non-informative answers, and overly-specific answers (that won’t help other customers besides the one who originally received this answer). Note that you can save runtime/costs by running the filters in a specific order, prioritizing cheaper/faster filters as well as filters expected to remove lots of data.
# Apply the simplest non-TLM filters first (such as removing exact unhelpful answers, and keyword matches), then proceed to the more advanced TLM-based filters in order of strictness.
exact_answers = ['If you have another question, please go ahead.' ,
'Great talking to you! If you have more questions later, I’m here to help. Bye for now!',
'Thank you for the feedback'
]
length_filter = LengthFilter(min_word_count=3)
exact_match_filter = ExactMatchFilter(exact_answers=exact_answers,ignore_case=True)
keyword_filter = KeywordFilter(keywords=["<person>", "<redacted>"])
keyword_filter_no_punctuation = KeywordFilter(keywords=["redacted", 'assistant'], strip_punctuation=True)
filters = [
length_filter,
exact_match_filter,
keyword_filter,
keyword_filter_no_punctuation,
tlm_binary_query_is_good,
tlm_filter_non_informative_answer,
tlm_filter_thankyou_or_question_only,
tlm_filter_pii_protection,
tlm_filter_too_specific,
tlm_outdated_answer_filter,
]
run_filters = RunFilters(filters = filters)
filtered_df, scores = run_filters.run(df)
Apply a Specific Filter to the Entire DataFrame
To apply a specific filter across the whole DataFrame, simply run the filter on your DataFrame. For example, you can use tlm_filter_too_specific
to obtain a score and mask for each row.
score, mask = tlm_filter_too_specific.apply(df)
df['tlm_filter_too_specific_score'] = score
Review Filter Scores
Let’s take a closer look at some of the filter scores. By exploring these values, we can better understand how our filters are performing and make informed decisions about setting appropriate thresholds for our data.
Let’s review the following data for examples that may be too specific or personalized. Information tailored to individual cases might not be useful for serving as expert answers from Codex, so consider removing or revising such entries.
df = pd.concat([df, scores], axis=1) # add all computed scores
display_filter_results(df,'tlm_filter_too_specific_score', n=5)
It’s important to remove outdated responses to ensure users receive accurate and relevant information.
display_filter_results(df,'TLMOutdatedAnswerFilter_score',columns=['query', 'answer'], n=5,ascending=False)
display_filter_results(df,'TLMCustomEval_non_informative_answer_score', n=5)
filtered_df.to_csv("filtered_question_answer_pairs.csv", index=False)
Here is the final set of high-quality question-answer pairs ready to be added to Codex.
filtered_df[['query','answer']].head(3)
query | answer | |
---|---|---|
0 | How can I regain access to my account when I n... | To regain access to your account without the t... |
1 | How can I regain access to my account after lo... | To regain access to your account after losing ... |
2 | How can I return an item I received in the wro... | To return an item received in the wrong color,... |
Add Question-Answer Pairs into Codex Project
Finally, we initialize a Codex Project and load our high-quality QA pairs into it, so they can be served as expert answers. This is a great way to hot-start any Codex Project with a large set of expert answers, which didn’t require any human work to obtain!
Over time, as you collect more customer support tickets with good human answers in them, you can repeat this process, except now adding the new QA pairs to your existing Codex Project.
# Create a project
project = codex_client.create_project(
name="Filtered: ABCD",
description="QA pairs for ABCD",
)
access_key = project.create_access_key("test access key")
import pandas as pd
filtered_df = pd.read_csv("filtered_question_answer_pairs.csv")
print("Filtered DataFrame shape:", filtered_df.shape)
from tqdm import tqdm
for row in tqdm(filtered_df.itertuples(index=False)):
project.add_remediation(
question=row.query,
answer=row.answer,
)