AI Evaluation with DeepEval

Learn how to evaluate RAG pipelines and LLM applications using DeepEval, Confident AI, and the LLM-as-judge pattern. From building a RAG to running custom evaluation metrics.

Introduction

As AI applications become part of real products, testing them is no longer optional. Unlike traditional software where outputs are deterministic, LLM outputs are probabilistic — the same input can produce different outputs. This makes classical assertion-based testing insufficient.

DeepEval is an open-source evaluation framework for LLM applications. It provides both built-in and custom metrics that use an LLM-as-judge approach — a powerful LLM (like GPT-4o) evaluates the quality of another LLM's responses.

Confident AI is the companion platform where you can store test datasets (called "goldens"), track evaluation results over time, and collaborate with your team.

What we'll buildIn this guide, we'll evaluate a RAG (Retrieval-Augmented Generation) pipeline that reads a business document and answers questions about it. We'll create a golden dataset, push it to Confident AI, pull it back, run the RAG, and evaluate the results with both built-in and custom metrics.

The Evaluation Flow

StepActionTool
1Build a RAG pipeline from a business documentLangChain + OpenAI
2Create golden test cases (input + expected output)DeepEval
3Push goldens to Confident AI for storageConfident AI
4Pull goldens back and run the RAG for each oneDeepEval + RAG
5Evaluate with metrics (relevancy, hallucination, custom)DeepEval (LLM-as-judge)
6View results in the dashboardConfident AI

Building the RAG Pipeline

RAG (Retrieval-Augmented Generation) is a pattern where instead of relying solely on the LLM's training data, you first retrieve relevant information from your own documents and then pass it to the LLM as context.

How RAG Works

The flow is straightforward:

  1. Load — Read the business document (e.g. company policy, FAQ, knowledge base)
  2. Chunk — Split the document into smaller pieces (500 characters each)
  3. Embed — Convert each chunk into a numerical vector using an embedding model
  4. Store — Save vectors in a vector store (FAISS) for fast similarity search
  5. Retrieve — When a question comes in, find the 3 most relevant chunks
  6. Generate — Pass the question + relevant chunks to the LLM to generate an answer

First, let's create the business document we'll use as our knowledge base:

company_policy.txttext
1# company_policy.txt (example business document)
2
3## Refund Policy
4Customers can request a full refund within 30 days of purchase.
5After 30 days, a partial refund of 50% is available up to 90 days.
6Digital products are non-refundable once downloaded.
7
8## Employee Benefits
9Full-time employees receive 20 paid vacation days per year.
10Part-time employees receive 10 paid vacation days.
11Unused vacation days can be carried over to the next year,
12up to a maximum of 5 days.
13
14## Remote Work Policy
15Employees may work remotely up to 3 days per week.
16A minimum of 2 days per week must be spent in the office.
17Remote work requests must be approved by the direct manager.
18
19## Data Security
20All employees must use two-factor authentication (2FA) for
21company accounts. Sensitive data must be encrypted at rest
22and in transit. Security training is mandatory every quarter.

Now let's build the RAG pipeline:

rag_pipeline.pypython
1from langchain_openai import ChatOpenAI, OpenAIEmbeddings
2from langchain_community.vectorstores import FAISS
3from langchain.text_splitter import RecursiveCharacterTextSplitter
4from langchain_community.document_loaders import TextLoader
5
6# 1. Load the business document
7loader = TextLoader("company_policy.txt", encoding="utf-8")
8documents = loader.load()
9
10# 2. Split into chunks for embedding
11splitter = RecursiveCharacterTextSplitter(
12    chunk_size=500,
13    chunk_overlap=50,
14)
15chunks = splitter.split_documents(documents)
16print(f"Split into {len(chunks)} chunks")
17
18# 3. Create embeddings and store in FAISS vector store
19embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
20vectorstore = FAISS.from_documents(chunks, embeddings)
21
22# 4. Create a retriever (returns top 3 most relevant chunks)
23retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
24
25# 5. Initialize the LLM
26llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
27
28def ask_rag(question: str) -> tuple[str, str]:
29    """Ask a question to the RAG pipeline.
30
31    Returns:
32        tuple: (actual_output, retrieval_context)
33    """
34    # Retrieve relevant chunks
35    docs = retriever.invoke(question)
36    context = "\n\n".join(doc.page_content for doc in docs)
37
38    # Generate answer using the LLM
39    prompt = f"""Answer the question based ONLY on the following context.
40If the context doesn't contain the answer, say "I don't have enough information."
41
42Context:
43{context}
44
45Question: {question}
46
47Answer:"""
48
49    response = llm.invoke(prompt)
50    return response.content, context
Why GPT-4o-mini for the RAG?We use the smaller, cheaper model (gpt-4o-mini) for the RAG pipeline since it only needs to answer based on provided context. The more powerful gpt-4o is reserved for the evaluation step where it acts as a judge.

DeepEval Setup

Install DeepEval and the required dependencies, then authenticate with Confident AI:

terminalbash
1# Install all required packages
2pip install deepeval langchain langchain-openai langchain-community faiss-cpu
3
4# Login to Confident AI (one-time setup)
5deepeval login

Set your OpenAI API key as an environment variable:

.envbash
1# .env file
2OPENAI_API_KEY=sk-your-openai-api-key
API Key RequiredYou need an OpenAI API key for both the RAG pipeline (embeddings + LLM) and the evaluation metrics (LLM-as-judge). The deepeval login command will give you a Confident AI API key for storing and retrieving datasets.

Creating the Golden Dataset

A golden dataset is a collection of test cases that define what "correct" behavior looks like. Each golden has two fields:

  • input — the question to ask the RAG
  • expected_output — the ideal answer the RAG should produce

Goldens are the foundation of your evaluation. You write them by hand (or with domain experts) based on what the document actually says. Then you push them to Confident AI so they're versioned, shareable, and can be pulled into any evaluation run.

create_goldens.pypython
1from deepeval.dataset import EvaluationDataset, Golden
2
3# Create golden test cases — these define "correct" behavior.
4# Each golden has an INPUT (the question) and an EXPECTED OUTPUT
5# (the ideal answer the RAG should produce).
6
7goldens = [
8    Golden(
9        input="What is the refund policy for digital products?",
10        expected_output="Digital products are non-refundable once downloaded."
11    ),
12    Golden(
13        input="How many vacation days do full-time employees get?",
14        expected_output="Full-time employees receive 20 paid vacation days per year."
15    ),
16    Golden(
17        input="Can unused vacation days be carried over?",
18        expected_output="Unused vacation days can be carried over to the next year, up to a maximum of 5 days."
19    ),
20    Golden(
21        input="How many days per week can employees work remotely?",
22        expected_output="Employees may work remotely up to 3 days per week."
23    ),
24    Golden(
25        input="What security training is required for employees?",
26        expected_output="Security training is mandatory every quarter."
27    ),
28]
29
30# Push goldens to Confident AI for storage and versioning
31dataset = EvaluationDataset(goldens=goldens)
32dataset.push(alias="company-policy-goldens")
33print("Goldens pushed to Confident AI!")
Why Confident AI?Storing goldens in Confident AI means your test data is centralized and versioned. Team members can view, edit, and extend the dataset from the web UI. When you run evaluations, results are tracked over time so you can see if your RAG improves or regresses.

Pulling Goldens & Running the RAG

Now comes the key step: we pull the goldens from Confident AI, and for each one, we run the RAG pipeline to get the actual output and the retrieval context. This gives us the 4 fields needed for evaluation:

test_case_structure.pypython
1# What each field in LLMTestCase represents:
2
3LLMTestCase(
4    # FROM THE GOLDEN (what we expect):
5    input="How many vacation days do full-time employees get?",
6    expected_output="Full-time employees receive 20 paid vacation days per year.",
7
8    # FROM THE RAG (what we actually got):
9    actual_output="Full-time employees get 20 paid vacation days annually.",
10    retrieval_context=["Full-time employees receive 20 paid vacation days per year. "
11                       "Part-time employees receive 10 paid vacation days."],
12)
FieldSourceDescription
inputGoldenThe question being asked
expected_outputGoldenThe ideal answer (human-written)
actual_outputRAGWhat the RAG pipeline actually returned
retrieval_contextRAGThe document chunks the retriever found

Here's the code to pull goldens and build test cases:

build_test_cases.pypython
1from deepeval.dataset import EvaluationDataset
2from deepeval.test_case import LLMTestCase
3
4# 1. Pull the golden dataset from Confident AI
5dataset = EvaluationDataset()
6dataset.pull(alias="company-policy-goldens")
7print(f"Pulled {len(dataset.goldens)} goldens")
8
9# 2. For each golden, run the RAG to get actual_output and context
10test_cases = []
11
12for golden in dataset.goldens:
13    # The golden gives us: input + expected_output
14    # The RAG gives us:    actual_output + retrieval_context
15    actual_output, retrieval_context = ask_rag(golden.input)
16
17    test_case = LLMTestCase(
18        input=golden.input,                      # from golden
19        expected_output=golden.expected_output,   # from golden
20        actual_output=actual_output,              # from RAG
21        retrieval_context=[retrieval_context],    # from RAG
22    )
23    test_cases.append(test_case)
24    print(f"✓ Processed: {golden.input[:50]}...")
25
26print(f"\nBuilt {len(test_cases)} test cases ready for evaluation")
The 2+2 PatternThink of it as 2 fields from the golden (input, expected_output) + 2 fields from the RAG (actual_output, retrieval_context). The metrics then compare these fields to score quality.

Evaluation Metrics

DeepEval uses LLM-as-judge: a powerful LLM (GPT-4o) evaluates the quality of the RAG's responses. Each metric focuses on a different quality dimension:

Built-in Metrics

MetricQuestion It AnswersUses
Answer RelevancyIs the answer relevant to the question asked?input, actual_output
HallucinationDoes the answer contain information NOT in the context?actual_output, retrieval_context
FaithfulnessIs every claim in the answer supported by the context?actual_output, retrieval_context

Custom Metric with GEval

GEval lets you define your own evaluation criteria in plain English. The LLM judge will score responses based on YOUR custom instructions. This is incredibly powerful for domain-specific evaluation:

  • Completeness — does the answer cover all aspects of the question?
  • Tone — is the answer professional and appropriate?
  • Conciseness — is the answer brief yet informative?

In our example, we create a "Completeness" metric that checks if the RAG's answer fully addresses the question using the retrieved context:

evaluate_rag.pypython
1from deepeval import evaluate
2from deepeval.metrics import (
3    AnswerRelevancyMetric,
4    HallucinationMetric,
5    FaithfulnessMetric,
6    GEval,
7)
8from deepeval.test_case import LLMTestCaseParams
9
10# --- Built-in Metrics ---
11
12# Does the answer actually address the question?
13answer_relevancy = AnswerRelevancyMetric(
14    model="gpt-4o",
15    threshold=0.7,
16)
17
18# Does the answer contain information NOT in the context?
19hallucination = HallucinationMetric(
20    model="gpt-4o",
21    threshold=0.5,
22)
23
24# Is the answer faithful to (supported by) the retrieved context?
25faithfulness = FaithfulnessMetric(
26    model="gpt-4o",
27    threshold=0.7,
28)
29
30# --- Custom Metric with GEval ---
31
32# GEval lets you define your own evaluation criteria.
33# The LLM judge will score based on YOUR custom instructions.
34completeness = GEval(
35    name="Completeness",
36    criteria="Determine whether the actual output fully and "
37             "accurately addresses every aspect of the input "
38             "question using information from the retrieval context. "
39             "Penalize if key details are missing or if the answer "
40             "is too vague.",
41    evaluation_params=[
42        LLMTestCaseParams.INPUT,
43        LLMTestCaseParams.ACTUAL_OUTPUT,
44        LLMTestCaseParams.RETRIEVAL_CONTEXT,
45    ],
46    model="gpt-4o",
47    threshold=0.7,
48)
49
50# --- Run Evaluation ---
51
52results = evaluate(
53    test_cases=test_cases,
54    metrics=[answer_relevancy, hallucination, faithfulness, completeness],
55)
56
57# Results are automatically uploaded to Confident AI dashboard
58print("Evaluation complete! Check your Confident AI dashboard for details.")
Threshold ValuesEach metric has a threshold (0.0 to 1.0). A test case passes if the score is above the threshold. Start with 0.7 and adjust based on your use case. Hallucination uses a lower threshold (0.5) because it's an inverse metric — lower hallucination is better.

Full Workflow

Here's everything combined into a single script. This is what a real evaluation run looks like end-to-end:

full_evaluation.pypython
1"""
2Full RAG Evaluation Workflow
3============================
41. Load business document → build RAG pipeline
52. Pull golden dataset from Confident AI
63. Run RAG for each golden → get actual_output + context
74. Evaluate with built-in + custom metrics (LLM-as-judge)
85. View results in Confident AI dashboard
9"""
10
11from langchain_openai import ChatOpenAI, OpenAIEmbeddings
12from langchain_community.vectorstores import FAISS
13from langchain.text_splitter import RecursiveCharacterTextSplitter
14from langchain_community.document_loaders import TextLoader
15from deepeval import evaluate
16from deepeval.dataset import EvaluationDataset
17from deepeval.test_case import LLMTestCase, LLMTestCaseParams
18from deepeval.metrics import (
19    AnswerRelevancyMetric,
20    HallucinationMetric,
21    FaithfulnessMetric,
22    GEval,
23)
24
25# ── Step 1: Build the RAG Pipeline ──────────────────────────
26
27loader = TextLoader("company_policy.txt", encoding="utf-8")
28documents = loader.load()
29
30splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
31chunks = splitter.split_documents(documents)
32
33embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
34vectorstore = FAISS.from_documents(chunks, embeddings)
35retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
36llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
37
38
39def ask_rag(question: str) -> tuple[str, str]:
40    docs = retriever.invoke(question)
41    context = "\n\n".join(doc.page_content for doc in docs)
42    prompt = f"""Answer based ONLY on the context below.
43If unsure, say "I don't have enough information."
44
45Context:
46{context}
47
48Question: {question}
49Answer:"""
50    response = llm.invoke(prompt)
51    return response.content, context
52
53
54# ── Step 2: Pull Goldens from Confident AI ──────────────────
55
56dataset = EvaluationDataset()
57dataset.pull(alias="company-policy-goldens")
58print(f"Pulled {len(dataset.goldens)} goldens from Confident AI")
59
60# ── Step 3: Run RAG → Build Test Cases ──────────────────────
61
62test_cases = []
63for golden in dataset.goldens:
64    actual_output, retrieval_context = ask_rag(golden.input)
65    test_cases.append(
66        LLMTestCase(
67            input=golden.input,
68            expected_output=golden.expected_output,
69            actual_output=actual_output,
70            retrieval_context=[retrieval_context],
71        )
72    )
73print(f"Built {len(test_cases)} test cases")
74
75# ── Step 4: Define Metrics ──────────────────────────────────
76
77metrics = [
78    AnswerRelevancyMetric(model="gpt-4o", threshold=0.7),
79    HallucinationMetric(model="gpt-4o", threshold=0.5),
80    FaithfulnessMetric(model="gpt-4o", threshold=0.7),
81    GEval(
82        name="Completeness",
83        criteria="Does the actual output fully address every aspect "
84                 "of the question using the retrieval context?",
85        evaluation_params=[
86            LLMTestCaseParams.INPUT,
87            LLMTestCaseParams.ACTUAL_OUTPUT,
88            LLMTestCaseParams.RETRIEVAL_CONTEXT,
89        ],
90        model="gpt-4o",
91        threshold=0.7,
92    ),
93]
94
95# ── Step 5: Evaluate ────────────────────────────────────────
96
97results = evaluate(test_cases=test_cases, metrics=metrics)
98print("Done! View results at https://app.confident-ai.com")
What happens after evaluation?Results are automatically uploaded to your Confident AI dashboard where you can: view pass/fail for each test case, see individual metric scores, compare results across runs, and identify which questions your RAG struggles with most.

Key Takeaways

  • Golden datasets define what "correct" looks like — store them in Confident AI
  • 2 + 2 pattern: goldens provide input + expected_output; RAG provides actual_output + context
  • Built-in metrics cover common dimensions: relevancy, hallucination, faithfulness
  • GEval lets you create custom metrics with plain English criteria
  • LLM-as-judge (GPT-4o) evaluates quality — no manual scoring needed
  • Track over time in Confident AI to catch regressions early