AI Evaluation with DeepEval
Learn how to evaluate RAG pipelines and LLM applications using DeepEval, Confident AI, and the LLM-as-judge pattern. From building a RAG to running custom evaluation metrics.
Introduction
As AI applications become part of real products, testing them is no longer optional. Unlike traditional software where outputs are deterministic, LLM outputs are probabilistic — the same input can produce different outputs. This makes classical assertion-based testing insufficient.
DeepEval is an open-source evaluation framework for LLM applications. It provides both built-in and custom metrics that use an LLM-as-judge approach — a powerful LLM (like GPT-4o) evaluates the quality of another LLM's responses.
Confident AI is the companion platform where you can store test datasets (called "goldens"), track evaluation results over time, and collaborate with your team.
The Evaluation Flow
| Step | Action | Tool |
|---|---|---|
| 1 | Build a RAG pipeline from a business document | LangChain + OpenAI |
| 2 | Create golden test cases (input + expected output) | DeepEval |
| 3 | Push goldens to Confident AI for storage | Confident AI |
| 4 | Pull goldens back and run the RAG for each one | DeepEval + RAG |
| 5 | Evaluate with metrics (relevancy, hallucination, custom) | DeepEval (LLM-as-judge) |
| 6 | View results in the dashboard | Confident AI |
Building the RAG Pipeline
RAG (Retrieval-Augmented Generation) is a pattern where instead of relying solely on the LLM's training data, you first retrieve relevant information from your own documents and then pass it to the LLM as context.
How RAG Works
The flow is straightforward:
- Load — Read the business document (e.g. company policy, FAQ, knowledge base)
- Chunk — Split the document into smaller pieces (500 characters each)
- Embed — Convert each chunk into a numerical vector using an embedding model
- Store — Save vectors in a vector store (FAISS) for fast similarity search
- Retrieve — When a question comes in, find the 3 most relevant chunks
- Generate — Pass the question + relevant chunks to the LLM to generate an answer
First, let's create the business document we'll use as our knowledge base:
1# company_policy.txt (example business document)
2
3## Refund Policy
4Customers can request a full refund within 30 days of purchase.
5After 30 days, a partial refund of 50% is available up to 90 days.
6Digital products are non-refundable once downloaded.
7
8## Employee Benefits
9Full-time employees receive 20 paid vacation days per year.
10Part-time employees receive 10 paid vacation days.
11Unused vacation days can be carried over to the next year,
12up to a maximum of 5 days.
13
14## Remote Work Policy
15Employees may work remotely up to 3 days per week.
16A minimum of 2 days per week must be spent in the office.
17Remote work requests must be approved by the direct manager.
18
19## Data Security
20All employees must use two-factor authentication (2FA) for
21company accounts. Sensitive data must be encrypted at rest
22and in transit. Security training is mandatory every quarter.Now let's build the RAG pipeline:
1from langchain_openai import ChatOpenAI, OpenAIEmbeddings
2from langchain_community.vectorstores import FAISS
3from langchain.text_splitter import RecursiveCharacterTextSplitter
4from langchain_community.document_loaders import TextLoader
5
6# 1. Load the business document
7loader = TextLoader("company_policy.txt", encoding="utf-8")
8documents = loader.load()
9
10# 2. Split into chunks for embedding
11splitter = RecursiveCharacterTextSplitter(
12 chunk_size=500,
13 chunk_overlap=50,
14)
15chunks = splitter.split_documents(documents)
16print(f"Split into {len(chunks)} chunks")
17
18# 3. Create embeddings and store in FAISS vector store
19embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
20vectorstore = FAISS.from_documents(chunks, embeddings)
21
22# 4. Create a retriever (returns top 3 most relevant chunks)
23retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
24
25# 5. Initialize the LLM
26llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
27
28def ask_rag(question: str) -> tuple[str, str]:
29 """Ask a question to the RAG pipeline.
30
31 Returns:
32 tuple: (actual_output, retrieval_context)
33 """
34 # Retrieve relevant chunks
35 docs = retriever.invoke(question)
36 context = "\n\n".join(doc.page_content for doc in docs)
37
38 # Generate answer using the LLM
39 prompt = f"""Answer the question based ONLY on the following context.
40If the context doesn't contain the answer, say "I don't have enough information."
41
42Context:
43{context}
44
45Question: {question}
46
47Answer:"""
48
49 response = llm.invoke(prompt)
50 return response.content, contextgpt-4o-mini) for the RAG pipeline since it only needs to answer based on provided context. The more powerful gpt-4o is reserved for the evaluation step where it acts as a judge.DeepEval Setup
Install DeepEval and the required dependencies, then authenticate with Confident AI:
1# Install all required packages
2pip install deepeval langchain langchain-openai langchain-community faiss-cpu
3
4# Login to Confident AI (one-time setup)
5deepeval loginSet your OpenAI API key as an environment variable:
1# .env file
2OPENAI_API_KEY=sk-your-openai-api-keydeepeval login command will give you a Confident AI API key for storing and retrieving datasets.Creating the Golden Dataset
A golden dataset is a collection of test cases that define what "correct" behavior looks like. Each golden has two fields:
- input — the question to ask the RAG
- expected_output — the ideal answer the RAG should produce
Goldens are the foundation of your evaluation. You write them by hand (or with domain experts) based on what the document actually says. Then you push them to Confident AI so they're versioned, shareable, and can be pulled into any evaluation run.
1from deepeval.dataset import EvaluationDataset, Golden
2
3# Create golden test cases — these define "correct" behavior.
4# Each golden has an INPUT (the question) and an EXPECTED OUTPUT
5# (the ideal answer the RAG should produce).
6
7goldens = [
8 Golden(
9 input="What is the refund policy for digital products?",
10 expected_output="Digital products are non-refundable once downloaded."
11 ),
12 Golden(
13 input="How many vacation days do full-time employees get?",
14 expected_output="Full-time employees receive 20 paid vacation days per year."
15 ),
16 Golden(
17 input="Can unused vacation days be carried over?",
18 expected_output="Unused vacation days can be carried over to the next year, up to a maximum of 5 days."
19 ),
20 Golden(
21 input="How many days per week can employees work remotely?",
22 expected_output="Employees may work remotely up to 3 days per week."
23 ),
24 Golden(
25 input="What security training is required for employees?",
26 expected_output="Security training is mandatory every quarter."
27 ),
28]
29
30# Push goldens to Confident AI for storage and versioning
31dataset = EvaluationDataset(goldens=goldens)
32dataset.push(alias="company-policy-goldens")
33print("Goldens pushed to Confident AI!")Pulling Goldens & Running the RAG
Now comes the key step: we pull the goldens from Confident AI, and for each one, we run the RAG pipeline to get the actual output and the retrieval context. This gives us the 4 fields needed for evaluation:
1# What each field in LLMTestCase represents:
2
3LLMTestCase(
4 # FROM THE GOLDEN (what we expect):
5 input="How many vacation days do full-time employees get?",
6 expected_output="Full-time employees receive 20 paid vacation days per year.",
7
8 # FROM THE RAG (what we actually got):
9 actual_output="Full-time employees get 20 paid vacation days annually.",
10 retrieval_context=["Full-time employees receive 20 paid vacation days per year. "
11 "Part-time employees receive 10 paid vacation days."],
12)| Field | Source | Description |
|---|---|---|
input | Golden | The question being asked |
expected_output | Golden | The ideal answer (human-written) |
actual_output | RAG | What the RAG pipeline actually returned |
retrieval_context | RAG | The document chunks the retriever found |
Here's the code to pull goldens and build test cases:
1from deepeval.dataset import EvaluationDataset
2from deepeval.test_case import LLMTestCase
3
4# 1. Pull the golden dataset from Confident AI
5dataset = EvaluationDataset()
6dataset.pull(alias="company-policy-goldens")
7print(f"Pulled {len(dataset.goldens)} goldens")
8
9# 2. For each golden, run the RAG to get actual_output and context
10test_cases = []
11
12for golden in dataset.goldens:
13 # The golden gives us: input + expected_output
14 # The RAG gives us: actual_output + retrieval_context
15 actual_output, retrieval_context = ask_rag(golden.input)
16
17 test_case = LLMTestCase(
18 input=golden.input, # from golden
19 expected_output=golden.expected_output, # from golden
20 actual_output=actual_output, # from RAG
21 retrieval_context=[retrieval_context], # from RAG
22 )
23 test_cases.append(test_case)
24 print(f"✓ Processed: {golden.input[:50]}...")
25
26print(f"\nBuilt {len(test_cases)} test cases ready for evaluation")Evaluation Metrics
DeepEval uses LLM-as-judge: a powerful LLM (GPT-4o) evaluates the quality of the RAG's responses. Each metric focuses on a different quality dimension:
Built-in Metrics
| Metric | Question It Answers | Uses |
|---|---|---|
| Answer Relevancy | Is the answer relevant to the question asked? | input, actual_output |
| Hallucination | Does the answer contain information NOT in the context? | actual_output, retrieval_context |
| Faithfulness | Is every claim in the answer supported by the context? | actual_output, retrieval_context |
Custom Metric with GEval
GEval lets you define your own evaluation criteria in plain English. The LLM judge will score responses based on YOUR custom instructions. This is incredibly powerful for domain-specific evaluation:
- Completeness — does the answer cover all aspects of the question?
- Tone — is the answer professional and appropriate?
- Conciseness — is the answer brief yet informative?
In our example, we create a "Completeness" metric that checks if the RAG's answer fully addresses the question using the retrieved context:
1from deepeval import evaluate
2from deepeval.metrics import (
3 AnswerRelevancyMetric,
4 HallucinationMetric,
5 FaithfulnessMetric,
6 GEval,
7)
8from deepeval.test_case import LLMTestCaseParams
9
10# --- Built-in Metrics ---
11
12# Does the answer actually address the question?
13answer_relevancy = AnswerRelevancyMetric(
14 model="gpt-4o",
15 threshold=0.7,
16)
17
18# Does the answer contain information NOT in the context?
19hallucination = HallucinationMetric(
20 model="gpt-4o",
21 threshold=0.5,
22)
23
24# Is the answer faithful to (supported by) the retrieved context?
25faithfulness = FaithfulnessMetric(
26 model="gpt-4o",
27 threshold=0.7,
28)
29
30# --- Custom Metric with GEval ---
31
32# GEval lets you define your own evaluation criteria.
33# The LLM judge will score based on YOUR custom instructions.
34completeness = GEval(
35 name="Completeness",
36 criteria="Determine whether the actual output fully and "
37 "accurately addresses every aspect of the input "
38 "question using information from the retrieval context. "
39 "Penalize if key details are missing or if the answer "
40 "is too vague.",
41 evaluation_params=[
42 LLMTestCaseParams.INPUT,
43 LLMTestCaseParams.ACTUAL_OUTPUT,
44 LLMTestCaseParams.RETRIEVAL_CONTEXT,
45 ],
46 model="gpt-4o",
47 threshold=0.7,
48)
49
50# --- Run Evaluation ---
51
52results = evaluate(
53 test_cases=test_cases,
54 metrics=[answer_relevancy, hallucination, faithfulness, completeness],
55)
56
57# Results are automatically uploaded to Confident AI dashboard
58print("Evaluation complete! Check your Confident AI dashboard for details.")threshold (0.0 to 1.0). A test case passes if the score is above the threshold. Start with 0.7 and adjust based on your use case. Hallucination uses a lower threshold (0.5) because it's an inverse metric — lower hallucination is better.Full Workflow
Here's everything combined into a single script. This is what a real evaluation run looks like end-to-end:
1"""
2Full RAG Evaluation Workflow
3============================
41. Load business document → build RAG pipeline
52. Pull golden dataset from Confident AI
63. Run RAG for each golden → get actual_output + context
74. Evaluate with built-in + custom metrics (LLM-as-judge)
85. View results in Confident AI dashboard
9"""
10
11from langchain_openai import ChatOpenAI, OpenAIEmbeddings
12from langchain_community.vectorstores import FAISS
13from langchain.text_splitter import RecursiveCharacterTextSplitter
14from langchain_community.document_loaders import TextLoader
15from deepeval import evaluate
16from deepeval.dataset import EvaluationDataset
17from deepeval.test_case import LLMTestCase, LLMTestCaseParams
18from deepeval.metrics import (
19 AnswerRelevancyMetric,
20 HallucinationMetric,
21 FaithfulnessMetric,
22 GEval,
23)
24
25# ── Step 1: Build the RAG Pipeline ──────────────────────────
26
27loader = TextLoader("company_policy.txt", encoding="utf-8")
28documents = loader.load()
29
30splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
31chunks = splitter.split_documents(documents)
32
33embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
34vectorstore = FAISS.from_documents(chunks, embeddings)
35retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
36llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
37
38
39def ask_rag(question: str) -> tuple[str, str]:
40 docs = retriever.invoke(question)
41 context = "\n\n".join(doc.page_content for doc in docs)
42 prompt = f"""Answer based ONLY on the context below.
43If unsure, say "I don't have enough information."
44
45Context:
46{context}
47
48Question: {question}
49Answer:"""
50 response = llm.invoke(prompt)
51 return response.content, context
52
53
54# ── Step 2: Pull Goldens from Confident AI ──────────────────
55
56dataset = EvaluationDataset()
57dataset.pull(alias="company-policy-goldens")
58print(f"Pulled {len(dataset.goldens)} goldens from Confident AI")
59
60# ── Step 3: Run RAG → Build Test Cases ──────────────────────
61
62test_cases = []
63for golden in dataset.goldens:
64 actual_output, retrieval_context = ask_rag(golden.input)
65 test_cases.append(
66 LLMTestCase(
67 input=golden.input,
68 expected_output=golden.expected_output,
69 actual_output=actual_output,
70 retrieval_context=[retrieval_context],
71 )
72 )
73print(f"Built {len(test_cases)} test cases")
74
75# ── Step 4: Define Metrics ──────────────────────────────────
76
77metrics = [
78 AnswerRelevancyMetric(model="gpt-4o", threshold=0.7),
79 HallucinationMetric(model="gpt-4o", threshold=0.5),
80 FaithfulnessMetric(model="gpt-4o", threshold=0.7),
81 GEval(
82 name="Completeness",
83 criteria="Does the actual output fully address every aspect "
84 "of the question using the retrieval context?",
85 evaluation_params=[
86 LLMTestCaseParams.INPUT,
87 LLMTestCaseParams.ACTUAL_OUTPUT,
88 LLMTestCaseParams.RETRIEVAL_CONTEXT,
89 ],
90 model="gpt-4o",
91 threshold=0.7,
92 ),
93]
94
95# ── Step 5: Evaluate ────────────────────────────────────────
96
97results = evaluate(test_cases=test_cases, metrics=metrics)
98print("Done! View results at https://app.confident-ai.com")Key Takeaways
- Golden datasets define what "correct" looks like — store them in Confident AI
- 2 + 2 pattern: goldens provide input + expected_output; RAG provides actual_output + context
- Built-in metrics cover common dimensions: relevancy, hallucination, faithfulness
- GEval lets you create custom metrics with plain English criteria
- LLM-as-judge (GPT-4o) evaluates quality — no manual scoring needed
- Track over time in Confident AI to catch regressions early
Recommended Courses
Go deeper into AI testing with this hand-picked course.
AI Testing: DeepEval, RAGAS & Ollama
Master testing and evaluating AI applications and LLMs. Covers DeepEval, RAGAS, Confident AI, local LLMs with Ollama, RAG testing, AI agent evaluation, and hands-on projects with real-world scenarios.
View Course on Udemy