Real-Time Evaluations in Codex
Powered by Cleanlab Trustworthy Language Model (TLM), all of Cleanlab’s powerful real-time evaluations are automatically integrated into Codex, providing analytics dashboards, trend analysis, and detailed logging.
This tutorial describes the real-time evaluations from Cleanlab applied on your AI application and surfaced in Codex.
Overview
Codex provides four main types of real-time evaluations to monitor your AI application:
- Trustworthiness Evaluations: Score the general reliability and accuracy of each LLM response
- TrustworthyRAG Evaluations: Detect RAG-specific failures and quality issues
- Instruction-Adherence Evaluations: Monitor compliance with system prompt instructions
- Custom Evaluations: Create additional, tailored evaluations for specific criteria
Trustworthiness Evaluations - Score and monitor the reliability of your LLM responses in real-time
What does it detect?
Trustworthiness Evaluations provide real-time scoring of LLM responses across various dimensions:
- Factual Accuracy: Detects incorrect or misleading information
- Confidence Level: Measures the model’s certainty in its responses
How it works
Trustworthiness Evaluations are powered by Cleanlab’s Trustworthy Language Model (TLM), which:
- Uses state-of-the-art uncertainty estimation techniques
- Can evaluate responses from any LLM (not just Cleanlab’s)
Figure 1: How Trustworthiness Evaluations work. The evaluation takes the user’s query and the LLM’s response as input, processes them through Cleanlab’s Trustworthy Language Model, and outputs a trustworthiness score between 0 and 1.
TrustworthyRAG Evaluations - Detect and fix RAG-specific issues before they impact users
What does it detect?
TrustworthyRAG provides real-time evaluation of RAG-specific issues:
- Search Failures: Issues with context retrieval, where your RAG’s underlying search returned irrelevant or incomplete results that cannot answer the user’s question
- Hallucinations: Ungrounded or untrustworthy responses, where the LLM is generating answers based on its own opinions/world-knowledge, rather than your product documentation
- Unhelpful Responses: Other quality issues, where though the information may be true and grounded in your internal knowledge, it’s not aligned with the intent of the user
- Difficult Queries: Complex or ambiguous user questions that root-cause the issues
How does it work?
Figure 2: TrustworthyRAG Evaluation process. The system evaluates the query, retrieved context, and LLM response to identify specific RAG-related issues and assign appropriate scores.
Instruction-Adherence Evaluations - Ensure your AI is following the instructions in your system prompt
What does it detect?
Instruction-Adherence Evaluations automatically detect failure rates for each individual instruction in your system prompt, such as:
- Policy compliance (e.g., “Do not expose account details without identity verification”)
- User experience guidelines (e.g., “Always output responses in HTML”)
- Customer-specific requirements
Figure 3: Instruction-Adherence Evaluation workflow. The system parses the system prompt into individual instructions, evaluates the LLM’s response against each instruction, and identifies any compliance failures.
Key Benefits
- Real-time Monitoring: Detect spikes in policy failures immediately
- Quantitative Health Metrics: Measure instruction compliance rates in analytics
- Automatic Evaluation: No need to maintain custom evals for new instructions
- Detailed Logging: Drill into specific instruction-failure cases
Custom Evaluations - Create tailored evaluations for your specific needs
What are Custom Evaluations?
Custom evaluations allow you to define specific criteria for assessing your AI application’s performance. You can evaluate aspects such as:
- Response conciseness and clarity
- Formatting and structure
- Tone and style
- Domain-specific requirements
- Internal compliance guidelines