Leverage Historical Support Tickets to Improve your AI

Run in Google Colab

This tutorial walks you through the process of transforming historical customer support conversations into high-quality question-answer (QA) pairs that can be ingested into Codex and served as expert answers for similar future queries. After adding your historical support tickets into Codex, your AI assistant will be able to accurately/helpfully answer significantly more queries from your customers.

We assume you have a database of historical customer support tickets, in which your human support employees previously provided high-quality assistance to customers via a recorded chat.

Alternatively if your documentation has a FAQ section, you can also follow this tutorial to ingest it into Codex expert answers. This enables FAQs to be served to your users more precisely/accurately (in particular, overcoming issues in your current AI system perhaps stemming from data formatting, document chunking, suboptimal retrieval, …).

import os 
from cleanlab_codex.client import Client
import shutil

os.environ["CODEX_API_KEY"] = "<CODEX_API_KEY>" # Get your free API key from: https://codex.cleanlab.ai/
os.environ["CLEANLAB_TLM_API_KEY"] = "<TLM_API_KEY>" # Get your free API key from: https://tlm.cleanlab.ai/

codex_client = Client()

Form Question-Answer Pairs from Message Histories

We’ll first transform raw customer support conversations into structured question-answer (QA) pairs. Later we’ll filter those QA pairs to remove answers that are low-quality or likely outdated.

Steps to follow:

Pre-process your data: Ensuring it is organized as conversations, where each conversation is a sequence of User and Assistant messages.
Deduplicate conversations: To save subsequent processing time.
Extract QA pairs: From each conversation, extract the user’s primary initial question and the corresponding assistant answer.
Deduplicate queries: To save subsequent processing time.

In this tutorial, we’ll demonstrate this process using a subset of the ABCD dataset.

import pandas as pd

df = pd.read_csv("https://cleanlab-public.s3.us-east-1.amazonaws.com/Datasets/abcd_subsampled.csv")

df.head(1)

	messages	metadata
0	Assistant: Thanks for contacting AcmeBrands, h...	Background Information:\n The conversation sta...

Let’s look at the formatted data. Below we have a conversation history between a User (aka customer) and an Assistant (aka your human customer service representative, whose answers we can trust – not an AI model!). We expect the history to be in messages list format with each conversation turn being indicated by role:content and being on a new line.

idx = 72
print(df.iloc[idx]['messages'])

Assistant: Hello, how can i help you?
User: hi I have a general question about returns, What is your policy if I buy something and want to return it. I'm just using a guest Membership now.
User: I'm curious about what different membership levels might afford me regarding returns, what's best?
Assistant: Of course, id be happy to tell you.
Assistant: The membership that allow the most for returns is the gold membership. With it you are allowed unlimited returns.
User: Ok, what about with just my guest membership now?
Assistant: With guest you are only allowed to return within the last 30 days of purchase.
User: ok, Great, that's a lot to consider and I appreciate the information. Goodbye.
Assistant: Do you need anything else?
User: no, I just wondered if I got the wrong item what happens, you answered it. thanks.
Assistant: Have a great day.

Deduplicate message histories to ensure uniqueness

print("initial size", df.shape)
df = df.drop_duplicates(subset=["messages"])
print("final size after deduplication:", df.shape)

initial size (100, 2)
final size after deduplication: (100, 2)

(Optional) Add Background Information into Messages

Optionally, you can prepend any background information (metadata) that is relevant to each conversation (date, user info, etc). This helps ensure that Cleanlab’s AI receives all relevant context about this conversation, when it is processing it to extract expert answers.

df['messages'] = df['metadata']+'\n' + df['messages']
print(df['messages'][idx])

Background Information:
The conversation started on 2025-06-17 and the user’s account type is basic.

Assistant: Hello, how can i help you?
User: hi I have a general question about returns, What is your policy if I buy something and want to return it. I'm just using a guest Membership now.
User: I'm curious about what different membership levels might afford me regarding returns, what's best?
Assistant: Of course, id be happy to tell you.
Assistant: The membership that allow the most for returns is the gold membership. With it you are allowed unlimited returns.
User: Ok, what about with just my guest membership now?
Assistant: With guest you are only allowed to return within the last 30 days of purchase.
User: ok, Great, that's a lot to consider and I appreciate the information. Goodbye.
Assistant: Do you need anything else?
User: no, I just wondered if I got the wrong item what happens, you answered it. thanks.
Assistant: Have a great day.

Extract initial Question and Answer from message history

We’ll use Cleanlab’s TLM to extract one structured question and answer pair from each conversation. Cleanlab will analyze the conversation (including any metadata), and identify the user’s primary initial question and the assistant’s corresponding answer to this question.

from cleanlab_tlm import TLM

# Optional configurations to control runtimes/costs:
TLM_QUALITY_PRESET = "base"
tlm = TLM(quality_preset=TLM_QUALITY_PRESET, options={"model": "gpt-4.1-nano"})

SAVE_DIR = "scores_run" # Temp directory to save scores

if os.path.exists(SAVE_DIR):
    print(f"Warning: Directory '{SAVE_DIR}' already exists. Existing filenames can cause issues if not removed.")
    response = input(f"Do you want to remove the directory '{SAVE_DIR}'? (y/n): ").strip().lower()
    if response == 'y':
        print(f"Removing existing directory: {SAVE_DIR}")
        shutil.rmtree(SAVE_DIR)
    else:
        print(f"Keeping existing directory: {SAVE_DIR}. Be aware that existing files may cause issues.")

Optional: Helper functions for extracting query-answer pair from conversation

import re
from typing import Tuple, Optional
from typing import Tuple, Optional, Dict, List, Any
import json

TLM prompt use for QA pair extraction:
PARSE_QUERY =  """Analyze the following Conversation, and try to identify: overall what is the primary initial Query asked by the User (based on messages from the User) and what was a complete Answer to this Query provided by the assistant (based on messages from the assistant). 

<conversation> 
{message_history}
</conversation>

# Instructions

First, determine what overall is the primary initial Query asked by the User in this Conversation.

Next, determine a complete and accurate Answer to this self-contained Query, based on the information provided by the assistant in the Conversation.

If a Topic is specified in the Conversation, make sure it is mentioned/obvious in the Query you write.

When writing the Query:
- Ensure that it self-contained and answerable without seeing the full Conversation.
- If the User raised multiple concerns in their messages, focus the Query on summarizing their initial concern into a self-contained question.

When writing the Answer:
- Ensure that it is accurate, and would be helpful to other Users who had the same Query.
- Only rely on information from the assistant's responses that directly address the Query!
- Do not include any non-core information that is not necessary to answer the Query.
- Omit any conversational/phatic statements and apologies from the assistant.
- Anticipate follow-up questions that the User raised and the assistant answered and try to include them in your Answer if other Users will likely have the same follow-ups.
- If the assistant did not provide a complete and helpful answer to the Query, write the Answer as an empty string ("").
- Do not output an empty string Answer if the assistant's primary helpful response was a URL, instead your Answer should include the URL.
- If the person is asking to speak with somebody, but the assistant said nobody is available, then your Answer should be the next alternative suggested by the assistant.

Your output must strictly follow this plain text format:

query: [self-contained Query that summarizes the User's main initial request]  
answer: [complete self-contained Answer to this Query]"""

PARSE_QUERY_TRUSTWORTHINESS_THRESHOLD = 0.8  # higher values will extract less Question-Answer pairs from conversations, we only keep those whose trust score exceeds this threshold.

def batch_prompt(
    tlm: TLM,
    input_path: str,
    output_path: str,
    prompt_col_name: str,
    answer_col_name: Optional[str] = None,
    batch_size: int = 1000,
    constrain_outputs: Optional[List[str]] = None,
):
    if os.path.exists(output_path):
        start_idx = len(pd.read_csv(output_path))
    else:
        start_idx = 0

    df_batched = pd.read_csv(input_path, chunksize=batch_size)
    curr_idx = 0

    for curr_batch in df_batched:
        if curr_idx + len(curr_batch) <= start_idx:
            curr_idx += len(curr_batch)
            continue
        elif curr_idx < start_idx:
            curr_batch = curr_batch[start_idx - curr_idx :]
            curr_idx = start_idx

        if answer_col_name:
            prompts = curr_batch[prompt_col_name].to_list()
            answers = curr_batch[answer_col_name].to_list()
            results = tlm.get_trustworthiness_score(prompts, answers)
        else:
            prompts = curr_batch[prompt_col_name].to_list()
            results = tlm.prompt(prompts, constrain_outputs=constrain_outputs)

        results_df = pd.DataFrame(results)
        if "log" in results_df.columns:
            results_df["log"] = results_df["log"].apply(json.dumps)
        results_df.to_csv(
            output_path, mode="a", index=False, header=not os.path.exists(output_path)
        )
        curr_idx += len(curr_batch)

        
def extract_tags(text: str | None) -> Tuple[Optional[str], Optional[str]]:
    """
    Extracts 'query:' and 'answer:' values from plain text format.
    Returns a tuple of (query, answer) or (None, None) if not found.
    """
    if not isinstance(text, str):
        return None, None
    
    query_match = re.search(r"query:\s*(.*)", text)
    answer_match = re.search(r"answer:\s*(.*)", text)

    query = query_match.group(1).strip() if query_match else None
    answer = answer_match.group(1).strip() if answer_match else None

    return query, answer

def get_parse_prompts(messages: list[list[str]]) -> list[str]:
    """Formats message histories into prompts for parsing."""
    return [
        PARSE_QUERY.format(message_history=history)
        for history in messages
    ]

def prompt_tlm_to_parse_message_history(
    tlm, 
    prompts: list[str], 
    save_dir: str
) -> list[str | None]:
    """Uses batch_prompt and returns parsed responses if they meet the trustworthiness threshold.
    If no response meets the threshold, returns None."""
    
    os.makedirs(save_dir, exist_ok=True)
    input_path = os.path.join(save_dir, "input_extract_qa.csv")
    output_path = os.path.join(save_dir, "results.csv")

    pd.DataFrame({"prompt": prompts}).to_csv(input_path, index=False)

    batch_prompt(
        tlm=tlm,
        input_path=input_path,
        output_path=output_path,
        prompt_col_name="prompt",
        batch_size=1000,
    )

    df = pd.read_csv(output_path)
    parsed_messages_list = []

    for _, row in df.iterrows():
        score = row.get("trustworthiness_score")
        if pd.notnull(score) and score >= PARSE_QUERY_TRUSTWORTHINESS_THRESHOLD:
            parsed_messages_list.append(row.get("response", None))
        else:
            parsed_messages_list.append(None)

    return parsed_messages_list

def parse_all_messages(tlm, messages: list[list[str]], save_dir: str = '/tmp/') -> list[str | None]:
    """Parses message history returning a list of (initial) queries and answers or None if parsing failed."""
    prompts = get_parse_prompts(messages)
    parsed_messages_list = prompt_tlm_to_parse_message_history(tlm, prompts, save_dir)
    parsed_queries_and_answers = [extract_tags(r) for r in parsed_messages_list]
    return parsed_queries_and_answers

df[["query", "answer"]] = parse_all_messages(tlm, df['messages'].tolist(), save_dir=SAVE_DIR)

If this progress bar appears frozen, TLM is still processing your dataset so just continue waiting.

Querying TLM... 100%|██████████|

Deduplicate QA pairs so all queries are unique and have answers

print("initial size", df.shape)
df = df[
    df["query"].notna() & df["answer"].notna() &  # remove NaN
    df["query"].astype(str).str.strip().ne("") &  # remove empty/blank strings
    df["answer"].astype(str).str.strip().ne("")
].reset_index(drop=True)
print("size after removing rows with NaN queries or answers:", df.shape)
df = df.drop_duplicates(subset=["query"])
print("final size after deduplication:", df.shape)

initial size (100, 4)
size after removing rows with NaN queries or answers: (100, 4)
final size after deduplication: (99, 4)

df.to_csv("question_answer_pairs.csv", index=False)

Filter out bad Question-Answer pairs

Let’s load the Question-Answer pairs extracted above. The data file should include the following components:

Query: The query made by the user.
Answer: The expert answer provided in response to the user’s query.
Message History (Optional): A record of the conversation history, which may provide additional context.
Custom Metadata (Optional): Any additional information that may be relevant to the data, such as timestamps or user info.

Note: If your original customer support tickets are single-turn Q&A rather than multi-turn chats, or you are instead ingesting a documentation FAQ, then you can ignore the conversation-processing code above and start at this part of the tutorial.

import pandas as pd

df = pd.read_csv("question_answer_pairs.csv")

print(len(df))
df.head(1)

	messages	metadata	query	answer
0	Background Information:\n The conversation sta...	Background Information:\n The conversation sta...	How can I regain access to my account after lo...	To regain access to your account after losing ...

Define Filters

We apply a sequence of filters to remove low-quality or outdated question-answer pairs before we ingest this data into Codex. Different filters target specific issues like PII exposure, non-questions, outdated or non-informative content. You can easily add your own filters here too!

Note: All TLM-based filters have a configurable trustworthiness_threshold which determine how many examples meet the filter criteria (how stringent the filter is). All examples where the filter’s TLM trustworthiness score falls below the trustworthiness_threshold are removed.

Optional: Helper methods for defining and running filters

import numpy as np
import string
import copy
from datetime import datetime

def print_idx(row, columns):
    """
    Print the index and values of the specified columns from a pandas Series or namedtuple row.

    Args:
        row: A pandas Series or namedtuple representing a row from a DataFrame.
        columns: List of column names (for Series) or attribute names (for namedtuple) to print.
    """
    for col in columns:
        value = (
            row[col]
            if isinstance(row, dict) or isinstance(row, pd.Series)
            else getattr(row, col, None)
        )
        print(f"{col}: {value}")

def display_filter_results(df, concern, n=20, columns=None, ascending=True):
    """
    Print the top n rows sorted by the given concern column, showing specified columns and the concern.

    Args:
        df: DataFrame to print from.
        concern: The column name to sort by and display.
        n: Number of rows to print (default 20).
        columns: List of columns to display (default ['query', 'answer']).
        ascending: Whether to sort in ascending order (default True).
    """
    if columns is None:
        columns = ["query", "answer"]
    display_columns = columns + [concern]  # Make a new list every call

    for i, row in df.sort_values(by=concern, ascending=ascending).head(n).iterrows():
        print_idx(row, display_columns)
        print("-" * 100)

class BaseFilter:
    def __init__(
        self, name: str, cost: str = "low", hyperparameters: Optional[Dict] = None
    ):
        self.name = name
        self.cost = cost  # 'low', 'med', or 'high'
        self.hyperparameters = hyperparameters or {}

    def apply(self, df: pd.DataFrame) -> Tuple[Optional[np.ndarray], np.ndarray]:
        """
        Apply filter to dataframe.
        Returns:
            scores (np.ndarray or None): A float score per row (or None if not applicable)
            keep_mask (np.ndarray): A boolean array indicating which rows to keep
        """
        raise NotImplementedError

class KeywordFilter(BaseFilter):
    def __init__(self, keywords: list[str], strip_punctuation: bool = False, **kwargs):
        super().__init__(name="KeywordFilter", cost="low", **kwargs)
        self.keywords = set(k.lower() for k in keywords)
        self.strip_punctuation = strip_punctuation

    def apply(self, df: pd.DataFrame) -> Tuple[None, np.ndarray]:
        def tokenize(text: str) -> set[str]:
            if self.strip_punctuation:
                text = text.translate(
                    str.maketrans({p: " " for p in string.punctuation})
                )
            return set(word.lower() for word in text.split())

        keep_mask = (
            df["answer"]
            .fillna("")
            .apply(lambda x: not any(k in tokenize(x) for k in self.keywords))
            .to_numpy()
        )
        return None, keep_mask

class ExactMatchFilter(BaseFilter):
    def __init__(self, exact_answers: list[str], ignore_case: bool = False, **kwargs):
        super().__init__(name="ExactMatchFilter", cost="low", **kwargs)
        self.ignore_case = ignore_case
        if self.ignore_case:
            self.exact_answers = set(ans.lower() for ans in exact_answers)
        else:
            self.exact_answers = set(exact_answers)

    def apply(self, df: pd.DataFrame) -> Tuple[None, np.ndarray]:
        if self.ignore_case:
            answers = df["answer"].fillna("").str.lower()
            keep_mask = ~answers.isin(self.exact_answers)
        else:
            keep_mask = ~df["answer"].isin(self.exact_answers)
        return None, keep_mask.to_numpy()

class LengthFilter(BaseFilter):
    def __init__(self, min_word_count: int = 3, **kwargs):
        super().__init__(name="LengthFilter", cost="low", **kwargs)
        self.min_word_count = min_word_count

    def apply(self, df: pd.DataFrame) -> Tuple[None, np.ndarray]:
        word_count = df["answer"].fillna("").str.split().str.len()
        keep_mask = word_count >= self.min_word_count
        return None, keep_mask.to_numpy()

class TLMBinaryFilter(BaseFilter):
    def __init__(
        self,
        prompt_template: str,
        name: str = "TLMBinaryFilter",
        tlm_api_key: Optional[str] = None,
        trustworthiness_threshold: float = 0.5,
        tlm_kwargs: dict[str, Any] = None,
        batch_size: int = 1000,
        save_dir: str = "/tmp",
    ):
        if tlm_api_key is None:
            tlm_api_key = os.getenv("CLEANLAB_TLM_API_KEY")
            if not tlm_api_key:
                raise ValueError("TLM API key must be provided.")
        super().__init__(name=name, cost="high")

        self.prompt_template = prompt_template
        self.tlm_client = TLM(api_key=tlm_api_key, **(tlm_kwargs or {}))
        self.trustworthiness_threshold = trustworthiness_threshold
        self.batch_size = batch_size
        self.name = name
        self.save_dir = save_dir

        formatter = string.Formatter()
        self.template_fields = {
            fname for _, fname, _, _ in formatter.parse(prompt_template) if fname
        }

    def get_responses(self, df: pd.DataFrame) -> list:
        missing_fields = self.template_fields - set(df.columns)
        if missing_fields:
            raise ValueError(f"Missing fields in DataFrame: {missing_fields}")

        df_copy = df.copy()
        df_copy["__prompt__"] = df_copy.apply(
            lambda row: self.prompt_template.format(
                **{f: row[f] for f in self.template_fields}
            ),
            axis=1,
        )

        temp_input_path = os.path.join(self.save_dir, f"tlm_binary_{self.name}_input.csv")
        temp_output_path = os.path.join(self.save_dir, f"tlm_binary_{self.name}_output.csv")
        df_copy[["__prompt__"]].to_csv(temp_input_path, index=False)

        batch_prompt(
            tlm=self.tlm_client,
            input_path=temp_input_path,
            output_path=temp_output_path,
            prompt_col_name="__prompt__",
            batch_size=self.batch_size,
            constrain_outputs=["Yes", "No"],
            **(self.tlm_client.kwargs if hasattr(self.tlm_client, "kwargs") else {}),
        )
        return pd.read_csv(temp_output_path).to_dict(orient="records")

    def apply(self, df: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
        responses = self.get_responses(df)
        scores = []
        for r in responses:
            if isinstance(r, dict) and r.get("response", "").strip().lower() == "yes":
                score = r.get("trustworthiness_score", 0.0)
            else:
                score = 0.0
            scores.append(score)
        scores = np.array(scores)
        keep_mask = scores >= self.trustworthiness_threshold
        return scores, keep_mask

class TLMCustomEvalFilter(BaseFilter):
    def __init__(
        self,
        criteria_name: str,
        criteria_instruction: str,
        tlm_api_key: Optional[str] = None,
        trustworthiness_threshold: float = 0.5,
        tlm_kwargs: dict[str, Any] = None,
        batch_size: int = 1000,
        save_dir: str = "/tmp",
        **kwargs,
    ):
        if tlm_api_key is None:
            tlm_api_key = os.getenv("CLEANLAB_TLM_API_KEY")
            if not tlm_api_key:
                raise ValueError(
                    "TLM API key must be provided either as an argument or through the environment variable CLEANLAB_TLM_API_KEY."
                )

        if tlm_kwargs is None:
            tlm_kwargs = {}

        if "custome_eval_criteria" in tlm_kwargs.get("options", {}):
            raise ValueError(
                "The 'custome_eval_criteria' option is already set in the TLM options. Please define the eval through *criteria_name* and *criteria_instruction* parameters in TLMCustomEvalFilter."
            )

        name = f"TLMCustomEval_{criteria_name}"
        super().__init__(name=name, cost="high", **kwargs)

        tlm_kwargs = copy.deepcopy(tlm_kwargs) if tlm_kwargs else {}
        opts = dict(tlm_kwargs.pop("options", {}))

        opts["custom_eval_criteria"] = [
            {"name": criteria_name, "criteria": criteria_instruction}
        ]

        tlm_kwargs["options"] = opts
        self.tlm_client = TLM(api_key=tlm_api_key, **tlm_kwargs)
        self.trustworthiness_threshold = trustworthiness_threshold
        self.batch_size = batch_size
        self.save_dir = save_dir

    def apply(self, df: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:

        input_path = os.path.join(self.save_dir, f"{self.name}_input.csv")
        output_path = os.path.join(self.save_dir, f"{self.name}_output.csv")

        df[["query", "answer"]].to_csv(input_path, index=False)

        if os.path.exists(output_path):
            os.remove(output_path)

        batch_prompt(
            tlm=self.tlm_client,
            input_path=input_path,
            output_path=output_path,
            prompt_col_name="query",
            answer_col_name="answer",
            batch_size=self.batch_size,
            constrain_outputs=None,
        )
        output_df = pd.read_csv(output_path)
        output_df["log"] = output_df["log"].apply(json.loads)
        custom_eval_scores = (
            output_df["log"]
            .apply(lambda x: x["custom_eval_criteria"][0]["score"])
            .values
        )

        keep_mask = custom_eval_scores >= self.trustworthiness_threshold
        return custom_eval_scores, keep_mask

class TLMOutdatedAnswerFilter(BaseFilter):
    def __init__(
        self,
        prompt_template: str,
        name: str = "TLMOutdatedAnswerFilter",
        tlm_api_key: Optional[str] = None,
        trustworthiness_threshold: float = 0.5,
        tlm_kwargs: dict[str, Any] = None,
        batch_size: int = 1000,
        save_dir: str = "/tmp",
        **kwargs,
    ):
        if tlm_api_key is None:
            tlm_api_key = os.getenv("CLEANLAB_TLM_API_KEY")
            if not tlm_api_key:
                raise ValueError(
                    "TLM API key must be provided either as an argument or through the environment variable CLEANLAB_TLM_API_KEY."
                )

        super().__init__(name=name, cost="high", **kwargs)
        self.tlm_client = TLM(api_key=tlm_api_key, **(tlm_kwargs or {}))
        self.trustworthiness_threshold = trustworthiness_threshold
        self.batch_size = batch_size
        self.save_dir = save_dir

        if not prompt_template or not isinstance(prompt_template, str):
            raise ValueError("prompt_template must be a non-empty string.")
        self.prompt_template = prompt_template

    def apply(self, df: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
        if not all(col in df.columns for col in ["query", "answer", "metadata"]):
            raise ValueError(
                "DataFrame must contain 'query', 'answer', and 'metadata' columns."
            )

        today_str = datetime.today().strftime("%Y-%m-%d")

        # Create prompt from template and save to input CSV
        prompts = [
            self.prompt_template.format(
                today_str=today_str,
                query=row["query"],
                answer=row["answer"],
                meta_data=row["metadata"],
            )
            for _, row in df.iterrows()
        ]
        input_df = pd.DataFrame({"prompt": prompts})
        input_path = os.path.join(self.save_dir, f"{self.name}_input.csv")
        output_path = os.path.join(self.save_dir, f"{self.name}_output.csv")

        input_df.to_csv(input_path, index=False)

        if os.path.exists(output_path):
            os.remove(output_path)

        batch_prompt(
            tlm=self.tlm_client,
            input_path=input_path,
            output_path=output_path,
            prompt_col_name="prompt",
            batch_size=self.batch_size,
            constrain_outputs=["Yes", "No"],
        )

        output_df = pd.read_csv(output_path)

        
        scores = np.array([
            row["trustworthiness_score"] if str(row["response"]).strip().lower() == "yes" else 0.0
            for _, row in output_df.iterrows()
        ])

        keep_mask = ~(scores > self.trustworthiness_threshold)
        return scores, keep_mask

class RunFilters:
    def __init__(self, filters: List[BaseFilter], save_dir: str = None):
        # Sort filters by cost priority (low -> high)
        cost_order = {"low": 0, "med": 1, "high": 2}
        self.filters = sorted(filters, key=lambda f: cost_order.get(f.cost, 3))
        self.save_dir = save_dir
        print(
            f"Initialized with {len(self.filters)} filters sorted by cost priority: {[f.name for f in self.filters]}"
        )

    def run(self, df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
        self.original_df = df.copy()
        keep_mask = np.ones(len(df), dtype=bool)
        all_scores = pd.DataFrame(index=df.index)

        for filt in self.filters:
            filtered_df = df[keep_mask]
            print(f"Running {filt.name}")
            scores, local_keep = filt.apply(filtered_df)
            print(f"sum of {filt.name}: {local_keep.sum()}")
            full_local_keep = np.zeros(len(df), dtype=bool)
            full_local_keep[keep_mask] = local_keep

            if scores is not None:
                score_series = pd.Series(np.nan, index=df.index, dtype=float)
                score_series[keep_mask] = scores
                all_scores[f"{filt.name}_score"] = score_series

            score_series = pd.Series(np.nan, index=df.index, dtype=object)
            score_series[keep_mask] = local_keep
            all_scores[filt.name] = score_series
            if self.save_dir is not None:

                os.makedirs(self.save_dir, exist_ok=True)
                out_path = os.path.join(self.save_dir, "all_scores.csv")
                all_scores.to_csv(out_path, index=False)
            keep_mask &= full_local_keep

        self.filtered_df = df[keep_mask].reset_index(drop=True)
        all_scores["filtered_mask"] = keep_mask == False
        self.all_filtered_scores = all_scores

        print(f"Final size after applying filters: {self.filtered_df.shape}")

        return self.filtered_df, all_scores

Optional: Defining custom filters

tlm_kwargs = {
    "quality_preset": "base",
    "options": {"model": "gpt-4.1-nano"},
}

tlm_filter_pii_protection = TLMCustomEvalFilter(
    criteria_name="pii_protection",
    criteria_instruction="""
Determine if the answer contains or exposes sensitive personally identifiable information (PII).

A bad answer:
- Contains private data such as credit card numbers, Social Security numbers, or other government-issued IDs.
- Includes full personal addresses, private phone numbers, or personal email addresses.
- Reveals sensitive account information such as bank details, usernames, or passwords.
- Repeats or reflects any confidential personal details shared by the user.
- Mentions information not typically accessible to the public.

A great answer:
- Contains no sensitive or private PII.
- May include **publicly available contact information** such as:
  - Government helplines
  - City service phone numbers
  - Support emails published on official websites
- Does **not** leak or infer sensitive user-specific details.

Only flag answers that contain **private, confidential, or user-specific** personal data.
""",
    tlm_kwargs=tlm_kwargs,
    trustworthiness_threshold=0.5,
    save_dir=SAVE_DIR
)

tlm_binary_query_is_good = TLMBinaryFilter(
    prompt_template="""Is the message a query?
    query: {query}

    --
    Respond with "Yes" if the message is a question, request or search term, otherwise respond with "No".
    Please respond with only "Yes" or "No" with no leading or trailing text. Do not return anything else.
    """,
    trustworthiness_threshold=0.5,
    name="TLMBinary_query_is_good",
    tlm_kwargs=tlm_kwargs,
    save_dir=SAVE_DIR

)

tlm_filter_thankyou_or_question_only = TLMCustomEvalFilter(
    criteria_name="thankyou_or_question_only",
    criteria_instruction="""Determine if the answer fails to address the user's query by either:

1. Only including salutations or closing phrases (such as "thank you", "goodbye", etc.).
2. Only asking a clarifying or follow-up question without providing a useful answer.
3. Giving a vague, off-topic, or unhelpful reply that does not meaningfully respond to the query.

A bad answer:
- Provides no informational content beyond a greeting or farewell.
- Ends the interaction without addressing the user's query.
- Only asks a question and does not answer the original query.
- Gives an answer that appears unrelated or avoids the core of the user’s query.

A great answer:
- Answers the query clearly and directly.
- May include greetings or closings, but also contains relevant, helpful information.
""",
    tlm_kwargs=tlm_kwargs,
    trustworthiness_threshold=0.5,
    save_dir=SAVE_DIR

)

tlm_filter_non_informative_answer = TLMCustomEvalFilter(
    criteria_name="non_informative_answer",
    criteria_instruction="""Determine if the answer fails to provide a meaningful or useful answer to the query.

A bad answer:
- States it cannot answer or lacks the information, without offering help.
- Summarizes a solution without giving specific, actionable details.
- Avoids addressing the actual query.
- References resources or solutions but doesn't explain or elaborate.
- Assumes facts not stated, leading to potentially inaccurate or misleading advice.
- Requests personalized, time-bound, or specific updates without sufficient context.
- Provides an answer tailored to one user’s specific circumstance rather than offering generalizable guidance.

A great answer:
- Directly answers the query with clear, specific information.
- Provides actionable steps, examples, or guidance.
- May include follow-up questions or clarifications, but only after delivering a substantive answer.
""",
    tlm_kwargs=tlm_kwargs,
    trustworthiness_threshold=0.25,
    save_dir=SAVE_DIR

)

tlm_filter_too_specific = TLMCustomEvalFilter(
    criteria_name="too_specific",
    criteria_instruction="""Determine if the query is too specific to the user's personal or narrow situation, making it unlikely to help others with similar questions.

A bad query:
- Asks about uncommon, highly personalized, or overly detailed scenarios.
- Limits the usefulness of any answer to a wider audience.
- Relies on information that is time-sensitive or valid only for a short period.
- Often involves specific items, account actions, or events relevant only to one user.
- Lacks sufficient context to be broadly interpretable or reusable by others.

A good query:
- Asks about a general topic or situation relevant to many users.
- May include details, but still invites broadly useful information.

Give a low score for a bad query.
""",
    tlm_kwargs=tlm_kwargs,
    trustworthiness_threshold=0.5,
    save_dir=SAVE_DIR

)

tlm_outdated_answer_filter = TLMOutdatedAnswerFilter(prompt_template= """
Today's date is {today_str}.
{meta_data}

<query>
{query}
</query>

<answer>
{answer}
</answer>

Your task is to determine:

Is this answer likely outdated today? Yes or No.

Only say “Yes” if the information is likely no longer useful, accurate, or valid due to changes over time, such as:
  - Obsolete procedures
  - Outdated rates or policies
  - Time-sensitive answers (e.g., booked dates or one-time confirmations)
  - Temporary time-bound information (e.g., restrictions in effect only during specific past dates)
  - Mentions a specific date or event (e.g., a sale) that has already passed relative to today
  + Contains vague or relative status updates that may no longer be accurate (e.g., “soon”, “a few days”, “recently processed”, “in progress”)

                                                     
Answer "No" if:
  - Provides general information, definitions, or explanations that are likely still valid (e.g., what "manual watering" means)
  - Describes standard processes or requirements for making a request (e.g., setting up inspection)

Respond with Yes or No only.
""",
trustworthiness_threshold=0.9,
save_dir=SAVE_DIR
)

# Optionally improve outdated filter via days-since-conversation calculation:
def extract_date_from_metadata(metadata):
    date_match = re.search(r'\d{4}-\d{2}-\d{2}', metadata)
    return date_match.group(0) if date_match else None

metadata = df['metadata'][0]
df['conv_date'] = pd.to_datetime(df['metadata'].apply(extract_date_from_metadata))
from datetime import datetime

# Calculate the number of days passed since conv_date
df['days_passed'] = (datetime.now() - df['conv_date']).dt.days
df['metadata'] = df.apply(
    lambda row: row['metadata'].replace(
        'The conversation started',
        f'The conversation started {row["days_passed"]} days ago'
    ),
    axis=1
)

Now that they’re defined, let’s apply our filters to the QA pairs data. You can either use the filters provided here directly and/or define your own filters using the above classes. Here we filter out bad examples based on: keywords, exact string matching, response-length, bad queries, potentially outdated answers, answers that contain PII, non-informative answers, and overly-specific answers (that won’t help other customers besides the one who originally received this answer). Note that you can save runtime/costs by running the filters in a specific order, prioritizing cheaper/faster filters as well as filters expected to remove lots of data.

# Apply the simplest non-TLM filters first (such as removing exact unhelpful answers, and keyword matches), then proceed to the more advanced TLM-based filters in order of strictness.
exact_answers = ['If you have another question, please go ahead.' ,
 'Great talking to you! If you have more questions later, I’m here to help. Bye for now!',
 'Thank you for the feedback'
]

length_filter = LengthFilter(min_word_count=3)
exact_match_filter = ExactMatchFilter(exact_answers=exact_answers,ignore_case=True)
keyword_filter = KeywordFilter(keywords=["<person>", "<redacted>"])
keyword_filter_no_punctuation = KeywordFilter(keywords=["redacted", 'assistant'], strip_punctuation=True)
filters = [
    length_filter,
    exact_match_filter,
    keyword_filter,
    keyword_filter_no_punctuation,
    tlm_binary_query_is_good, 
    tlm_filter_non_informative_answer,
    tlm_filter_thankyou_or_question_only,
    tlm_filter_pii_protection,
    tlm_filter_too_specific,
    tlm_outdated_answer_filter,
]
run_filters = RunFilters(filters = filters)

filtered_df, scores = run_filters.run(df)

Initialized with 10 filters sorted by cost priority: ['LengthFilter', 'ExactMatchFilter', 'KeywordFilter', 'KeywordFilter', 'TLMBinary_query_is_good', 'TLMCustomEval_non_informative_answer', 'TLMCustomEval_thankyou_or_question_only', 'TLMCustomEval_pii_protection', 'TLMCustomEval_too_specific', 'TLMOutdatedAnswerFilter']
Running LengthFilter
sum of LengthFilter: 99
Running ExactMatchFilter
sum of ExactMatchFilter: 99
Running KeywordFilter
sum of KeywordFilter: 99
Running KeywordFilter
sum of KeywordFilter: 87
Running TLMBinary_query_is_good
If this progress bar appears frozen, TLM is still processing your dataset so just continue waiting.

Querying TLM... 100%|██████████|

sum of TLMBinary_query_is_good: 87
Running TLMCustomEval_non_informative_answer
If this progress bar appears frozen, TLM is still processing your dataset so just continue waiting.

Querying TLM... 100%|██████████|

sum of TLMCustomEval_non_informative_answer: 66
Running TLMCustomEval_thankyou_or_question_only
If this progress bar appears frozen, TLM is still processing your dataset so just continue waiting.

Querying TLM... 100%|██████████|

sum of TLMCustomEval_thankyou_or_question_only: 32
Running TLMCustomEval_pii_protection
If this progress bar appears frozen, TLM is still processing your dataset so just continue waiting.

Querying TLM... 100%|██████████|

sum of TLMCustomEval_pii_protection: 32
Running TLMCustomEval_too_specific
If this progress bar appears frozen, TLM is still processing your dataset so just continue waiting.

Querying TLM... 100%|██████████|

sum of TLMCustomEval_too_specific: 12
Running TLMOutdatedAnswerFilter
If this progress bar appears frozen, TLM is still processing your dataset so just continue waiting.

Querying TLM... 100%|██████████|

sum of TLMOutdatedAnswerFilter: 10
Final size after applying filters: (10, 6)

Apply a Specific Filter to the Entire DataFrame

To apply a specific filter across the whole DataFrame, simply run the filter on your DataFrame. For example, you can use tlm_filter_too_specific to obtain a score and mask for each row.

score, mask = tlm_filter_too_specific.apply(df)
df['tlm_filter_too_specific_score'] = score

If this progress bar appears frozen, TLM is still processing your dataset so just continue waiting.

Querying TLM... 100%|██████████|

Review Filter Scores

Let’s take a closer look at some of the filter scores. By exploring these values, we can better understand how our filters are performing and make informed decisions about setting appropriate thresholds for our data.

Let’s review the following data for examples that may be too specific or personalized. Information tailored to individual cases might not be useful for serving as expert answers from Codex, so consider removing or revising such entries.

df = pd.concat([df, scores], axis=1) # add all computed scores

display_filter_results(df,'tlm_filter_too_specific_score', n=5)

query: How can I reset my account password and check my order information?
answer: To reset your account password, you need to provide your username and security information such as your mother's maiden name or security answer. In this case, after providing your username (normanbouchard461) and security answer (rachel), the assistant generated a new password for you: wkmnhf5enq8. You can use this password to access your account and check your order information.
tlm_filter_too_specific_score: 0.011776311443693182
----------------------------------------------------------------------------------------------------
query: What are my options for returning or exchanging a purchase made more than 30 days ago?
answer: Since the purchase was made on July 21, 2019, and returns are only possible within 30 days of purchase, you are unable to return or exchange the item through the standard process.
tlm_filter_too_specific_score: 0.019290918929105706
----------------------------------------------------------------------------------------------------
query: How can I resolve the issue of receiving an email confirmation for an order that shows I ordered two items when I only ordered one?
answer: The order was a mistake by the company. The assistant offered a refund and confirmed that the unwanted item had been removed from the order, with the item valued at $49.
tlm_filter_too_specific_score: 0.02135825803012212
----------------------------------------------------------------------------------------------------
query: Can I get an extension on my subscription payment due to financial hardship caused by a farm fire?
answer: Because you are a prized gold member, you will be able to get a subscription extension.
tlm_filter_too_specific_score: 0.02902835087734005
----------------------------------------------------------------------------------------------------
query: Can I get an extension on my premium subscription fee due to financial hardship?
answer: Unfortunately, I cannot give you an extension on your premium subscription fee due to your membership level.
tlm_filter_too_specific_score: 0.03931668614064791
----------------------------------------------------------------------------------------------------

It’s important to remove outdated responses to ensure users receive accurate and relevant information.

display_filter_results(df,'TLMOutdatedAnswerFilter_score',columns=['query', 'answer'], n=5,ascending=False)

query: What is the status of my refund and will it be credited to my original payment method?
answer: Your refund is currently in progress and will be completed by tomorrow. The original payment method will be credited for the amount of the refund.
TLMOutdatedAnswerFilter_score: 0.9999923903446042
----------------------------------------------------------------------------------------------------
query: Is there any trouble with the website today that is causing it to run very slow?
answer: The website is experiencing slow performance, and a report has been sent to the web team to investigate the issue.
TLMOutdatedAnswerFilter_score: 0.9999875564808334
----------------------------------------------------------------------------------------------------
query: How can I regain access to my account when I no longer have access to the two-factor authentication phone number?
answer: To regain access to your account without the two-factor authentication phone number, you can request a reset by verifying your identity through your registered email address. The process involves providing your full name and email address, after which a reset code will be sent to your email. Once you receive the code, you can use it to access your account and update your security settings.
TLMOutdatedAnswerFilter_score: 0.0
----------------------------------------------------------------------------------------------------
query: How can I regain access to my account after losing access to my two-factor authentication device?
answer: To regain access to your account after losing your two-factor authentication device, you can request a reset by providing your account email to receive instructions for disabling two-factor authentication. Once you receive the reset email, follow the instructions to disable two-factor authentication and then verify your login to ensure access.
TLMOutdatedAnswerFilter_score: 0.0
----------------------------------------------------------------------------------------------------
query: How can I return an item I received in the wrong color?
answer: To return an item received in the wrong color, you can request a return shipping label by providing your shipping address and choosing to return by mail. You should receive the return shipping label shortly to send back the product.
TLMOutdatedAnswerFilter_score: 0.0
----------------------------------------------------------------------------------------------------

display_filter_results(df,'TLMCustomEval_non_informative_answer_score', n=5)

query: How can I return an item I purchased, and what are the steps involved?
answer: To return an item, you need to provide your full name, account ID, username, email address, order ID, membership level, and shipping address. You can choose to process the return by mail, in store, or at a drop-off center. If you are a gold member, you are eligible for unlimited returns.
TLMCustomEval_non_informative_answer_score: 0.0055800121961026876
----------------------------------------------------------------------------------------------------
query: When does the annual sale end?
answer: This year's sale ended on January 31st.
TLMCustomEval_non_informative_answer_score: 0.01743464663794691
----------------------------------------------------------------------------------------------------
query: What is the width of the Guess boots priced at $59, and are they considered wide?
answer: The boots are generally 1/6 wider than the usual U.S. system, so if you're unsure whether they will fit, it's recommended to try them on and return if there's a problem.
TLMCustomEval_non_informative_answer_score: 0.021357812195540303
----------------------------------------------------------------------------------------------------
query: How can I return an item that is the wrong size?
answer: To return an item that is the wrong size, you need to provide your full name or account ID, your email address, order ID, membership level, and full address. You can then choose to process the return by mail, in store, or at a drop-off center.
TLMCustomEval_non_informative_answer_score: 0.023656485392906296
----------------------------------------------------------------------------------------------------
query: What are the differences between the silver and gold membership levels in the premium subscription service?
answer: Silver members can cancel orders at any time and have an agent make purchases for them, while gold members can make unlimited refunds and are trusted more.
TLMCustomEval_non_informative_answer_score: 0.02620633900709706
----------------------------------------------------------------------------------------------------

filtered_df.to_csv("filtered_question_answer_pairs.csv", index=False)

Here is the final set of high-quality question-answer pairs ready to be added to Codex.

filtered_df[['query','answer']].head(3)

	query	answer
0	How can I regain access to my account when I n...	To regain access to your account without the t...
1	How can I regain access to my account after lo...	To regain access to your account after losing ...
2	How can I return an item I received in the wro...	To return an item received in the wrong color,...

Add Question-Answer Pairs into Codex Project

Finally, we initialize a Codex Project and load our high-quality QA pairs into it, so they can be served as expert answers. This is a great way to hot-start any Codex Project with a large set of expert answers, which didn’t require any human work to obtain!

Over time, as you collect more customer support tickets with good human answers in them, you can repeat this process, except now adding the new QA pairs to your existing Codex Project.

# Create a project
project = codex_client.create_project(
    name="Filtered: ABCD",
    description="QA pairs for ABCD",
)
access_key = project.create_access_key("test access key")

import pandas as pd

filtered_df = pd.read_csv("filtered_question_answer_pairs.csv")
print("Filtered DataFrame shape:", filtered_df.shape)

Filtered DataFrame shape: (10, 6)

from tqdm import tqdm

for row in tqdm(filtered_df.itertuples(index=False)):

    project.add_remediation(
        question=row.query,
        answer=row.answer,
    )

2it [00:02,  1.11s/it]

Form Question-Answer Pairs from Message Histories

Deduplicate message histories to ensure uniqueness​

(Optional) Add Background Information into Messages​

Extract initial Question and Answer from message history​

Deduplicate QA pairs so all queries are unique and have answers​

Filter out bad Question-Answer pairs​

Define Filters​

Apply a Specific Filter to the Entire DataFrame​

Review Filter Scores​

Add Question-Answer Pairs into Codex Project​