Real-Time Evaluations in Cleanlab

Powered by Cleanlab Trustworthy Language Model (TLM), all of Cleanlab’s powerful real-time evaluations are automatically integrated into the Cleanlab AI Platform, providing analytics dashboards, trend analysis, and detailed logging.

This tutorial describes the real-time evaluations from Cleanlab applied on your AI application and surfaced in the Cleanlab UI.

Overview

Cleanlab provides four main types of real-time evaluations to monitor your AI application:

Trustworthiness Evaluations: Score the general reliability and accuracy of each LLM response
TrustworthyRAG Evaluations: Root-cause RAG-specific failures and quality issues
Instruction-Adherence Evaluations: Monitor compliance with system prompt instructions
Custom Evaluations: Create additional, tailored evaluations for specific criteria

Trustworthiness Evaluations - Score and monitor the reliability of your LLM responses in real-time

What does it detect?

Trustworthiness Evaluations provide real-time scoring of LLM responses across various dimensions:

Factual Accuracy: Detects incorrect or misleading information
Confidence Level: Measures the model’s certainty in its responses

How it works

Trustworthiness Evaluations are powered by Cleanlab’s Trustworthy Language Model (TLM), which:

Uses state-of-the-art uncertainty estimation techniques
Can evaluate responses from any LLM (not just Cleanlab’s)

Trustworthiness Evaluation Flow

Figure 1: How Trustworthiness Evaluations work. The evaluation takes the user’s query and the LLM’s response as input, processes them through Cleanlab’s Trustworthy Language Model, and outputs a trustworthiness score between 0 and 1.

TrustworthyRAG Evaluations - Root-cause RAG-specific issues that are causing your AI to serve bad responses

What does it detect?

TrustworthyRAG provides real-time evaluation of RAG-specific issues:

Search Failures: Issues with context retrieval, where your RAG’s underlying search returned irrelevant or incomplete results that cannot answer the user’s question
Hallucinations: Ungrounded or untrustworthy responses, where the LLM is generating answers based on its own opinions/world-knowledge, rather than your product documentation
Unhelpful Responses: Other quality issues, where though the information may be true and grounded in your internal knowledge, it’s not aligned with the intent of the user
Difficult Queries: Complex or ambiguous user questions that root-cause the issues

How does it work?

TrustworthyRAG Evaluation Flow

Figure 2: TrustworthyRAG Evaluation process. The system evaluates the query, retrieved context, and LLM response to identify specific RAG-related issues and assign appropriate scores.

Instruction-Adherence Evaluations - Ensure your AI is following the instructions in your system prompt

What does it detect?

Instruction-Adherence Evaluations automatically detect failure rates for each individual instruction in your system prompt, such as:

Policy compliance (e.g., “Do not expose account details without identity verification”)
User experience guidelines (e.g., “Always output responses in HTML”)
Customer-specific requirements

Figure 3: Instruction-Adherence Evaluation workflow. The system parses the system prompt into individual instructions, evaluates the LLM’s response against each instruction, and identifies any compliance failures.

Key Benefits

Real-time Monitoring: Detect spikes in policy failures immediately
Quantitative Health Metrics: Measure instruction compliance rates in analytics
Automatic Evaluation: No need to maintain custom evals for new instructions
Detailed Logging: Drill into specific instruction-failure cases

Custom Evaluations - Create tailored evaluations for your specific needs

What are Custom Evaluations?

Custom evaluations allow you to define specific criteria for assessing your AI application’s performance. You can evaluate aspects such as:

Response conciseness and clarity
Formatting and structure
Tone and style
Domain-specific requirements
Internal compliance guidelines

Overview​

What does it detect?​

How it works​

What does it detect?​

How does it work?​

What does it detect?​

Key Benefits​

What are Custom Evaluations?​

Overview

What does it detect?

How it works

What does it detect?

How does it work?

What does it detect?

Key Benefits

What are Custom Evaluations?