Skip to main content

Codex Analytics

Codex Analytics provides a real-time dashboard for you to monitor the health of your AI app, the impact of Codex in preventing bad responses, and the engagement of your SMEs in improving your AI app’s performance in Codex.

Codex Analytics seamlessly integrates with Cleanlab AI Platform’s real-time evaluations, allowing you to answer questions such as:

  • How many bad responses did my bot serve to users this week?
  • In what countries are users getting the worst responses?
  • What are the reasons for why my app is giving bad responses (underlying search? ungrounded/made-up responses)?
  • How many bad responses were prevented entirely by Codex?

Codex Analytics Dashboard

Who’s it for?

  • Product Owners and Stakeholders tracking the health of their AI app, volume of bad responses being served to users, and Codex’s remediation impact over time
  • Developers to understand at a high-level the performance of their app, and serve as their entry point to drill into specific examples of failures
  • SMEs to observe the overall app performance and trends
  • SME Managers & Team Leads to track SME contributions

What do Codex Analytics report?

MetricCapability
Total queriesSee the total volume of queries observed by Codex
# of Good/Bad Responses DetectedSee the overall health of your RAG app, based on the good/bad responses detected by Cleanlab’s real-time evaluations
# of Bad Responses PreventedOf the bad responses detected by Codex, you can clearly see the remediation impact of Codex by the # of bad responses answered by Codex. This number should increasingly grow as Codex becomes more capable of remediating your AI application’s bad responses.
# of Bad Responses Served to UsersOf the bad responses detected by Codex, this is the remainder that were not answered by Codex. This number should decrease as Codex becomes more capable of remediating your application’s bad responses
Primary RAG IssuesFor your RAG application’s bad responses, we root-cause the primary issue and show you the number of failures attributed to:
• Search Failures
• Hallucinations
• Unhelpful Responses
• Difficult Queries
Instruction-Adherence FailuresMeasure the health of your AI app in adhering to each individual system instruction
SME EngagementMeasure the engagement from each SME, based on the number of answers they’ve provided in Codex

How do I measure the impact of Codex improving my application?

One of the key metrics in Codex is the # of Bad Responses Prevented By Codex, highlighted in green within your Codex Analytics dashboard. This metric tracks the number of times where an Answer entered by an SME in Codex, was provided back to the user in place of an other-wise bad response.

As you continue to add more Answers within Codex, and those Answers are served in real-time traffic, this count should increase over time, signalling a key part of the ROI and impact of Codex.

How are the Primary Issues metrics defined?

The Primary Issues section in Codex Analytics shows the count of issues, labeled based on the evaluations from the TrustworthyRAG module from Cleanlab.

The hierarchy of Primary Issues assigned to any Bad Response is as follows, and included in Analytics:

EvalPrimary Issue Label
1. If the query_ease eval score is below thresholdDifficult query
2. If helpfulness eval score is below thresholdUnhelpful (Other Issues)
3. If context_sufficiency eval score is below thresholdSearch Failure
4. Finally, if either trustworthiness or groundedness eval scores are below thresholdHallucination

What are the instruction-adherence metrics in Analytics?

At a glance, Cleanlab allows you to evaluate the health of your app in following the instructions you’ve provided.

Instruction-Adherence evaluations from Cleanlab measure for every AI app response, whether it complies with each individual instruction in your system prompt.

In Analytics, there will be a section that captures:

  1. The total number of instruction-adherence failures detected by Codex
  2. For each individual instruction, the number of instruction-adherence failures detected by Codex

What can I filter by?

Date Filters

Date Filters are available for drilling into performance in a specific period, or comparing performance over time:

  • Preset Ranges: Last 7 days, 30 days, quarter, year
  • Custom Range: Pick start/end dates (MM/DD/YYYY)

Other Metadata Filters

Users of Codex Analytics can drill down into specific metadata categories they’re interested in, depending on what other metadata is logged into Codex.

For example, if the Geo of the user is logged, Codex allows you to filter down performance by region, in order to visualize in which regions Codex is detecting the most bad responses from your AI application.