OpenAI Evals¶
Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs. It offers an existing registry of evals to test different dimensions of OpenAI models and the ability to write your own custom evals for use cases you care about. You can also use your data to build private evals which represent the common LLMs patterns in your workflow without exposing any of that data publicly.
If you are building with LLMs, creating high quality evals is one of the most impactful things you can do. Without evals, it can be very difficult and time intensive to understand how different model versions might effect your use case.
Installation¶
There are two different options to install 'evals':
- Cloning 'evals' repository and installing from the source code
- Using pip install
Theoritically option 2 should be sufficient to use evals framework and run your evaluations. But I recommend the option 1 for the several reasons:
- The version in the Python Package Index is out-of-date, it is not compatible with the lastest Python library as of Feb 2024. This migth be the case when you try.
- To conveniently browse the source code, check the examples, read the documents, see the JSONL file contents it is good to have the source codes.
- 'evals' is open to contribution. You may want to contribute in your next step.
Option 1: Use the source code¶
Evals registry is stored using Git-LFS. First install LFS.
brew install git-lfs
git lfs install
Go to parent directory (Assuming this notebook is in openai-cheat-sheet/) and clone evals repository
cd ..
git clone https://github.com/openai/evals.git
Fetch all the files (from within your local copy of the evals repo). This will populate all the pointer files under evals/registry/data.
cd ../evals
git lfs fetch --all
git lfs pull
Install evals from the source code along with its dependencies.
cd ../evals
pip install -e .
Option 2: Install from Python Package Index¶
pip install evals
Scenario: Evaluation of retriaval with different options¶
Let's assume that we have a chat completion that uses a CSV file for retriaval. This file should have only two columns: text and embedding. For each user query, it will find top {k} embeddings from CSV, which are the closest to the user's query, then add the corresponding 'text' of these embeddings to the system message to enrich the context. And the completion will reply accordingly.
This is already implemented in evals.completion_fns.retrieval:RetrievalCompletionFn class in evals. But in my tests, it was not working correctly. So here we will create our own custom completion function and register it.
Our aim is to try out 3 different options with the same sample data set and compare their accuracy scores.
- Using gpt-4o-mini directly without feeding any extra context with retrieval.
- Using gpt-4o-mini with retrieval.
- Using gpt-4o-mini with retrieval and instructing it to use chain-of-thought logic.
For this, we will use a file with the birthdays of all presidents. Our completion function will retrieve the data from this file. Then we will test it with some sample user questions like "Was Franklin Pierce born before Abraham Lincoln? Answer Y or N."
Step 1: Setup retrieval data¶
While we are using RetrievalCompletionFn, we will use president_birthdays.csv.
- We will generate the 'text' column using the data in the file
- We will get the 'embeddings' using the text column
- We will generate 'output/presidents_embeddings.csv' file only with 'text' and 'embedding' columns. This file will be the input for RetrievalCompletionFn.
import pandas as pd
input_datapath = "data/president_birthdays.csv"
df = pd.read_csv(input_datapath).rename(columns={" \"Name\"": "Name", " \"Month\"": "Month", " \"Day\"": "Day", " \"Year\"": "Year"}).set_index("Index")
df["text"] = df.apply(lambda r: f"{r['Name']} was born on {r['Month']}/{r['Day']}/{r['Year']}", axis=1)
display(df.head())
Name | Day | Month | Year | text | |
---|---|---|---|---|---|
Index | |||||
1 | "George Washington" | 22 | 2 | 1732.0 | "George Washington" was born on 2/22/1732.0 |
2 | "John Adams" | 30 | 10 | 1735.0 | "John Adams" was born on 10/30/1735.0 |
3 | "Thomas Jefferson" | 13 | 4 | 1743.0 | "Thomas Jefferson" was born on 4/13/1743.0 |
4 | "James Madison" | 16 | 3 | 1751.0 | "James Madison" was born on 3/16/1751.0 |
5 | "James Monroe" | 28 | 4 | 1758.0 | "James Monroe" was born on 4/28/1758.0 |
from openai import OpenAI
client = OpenAI()
def embed(text):
return client.embeddings.create(
model="text-embedding-3-small",
input=text
).data[0].embedding
df["embedding"] = df['text'].apply(embed)
df[["text", "embedding"]].to_csv("output/presidents_embeddings.csv")
Step 2: Implement your completion function¶
You can check MLRetrievalCompletionFn class. It accepts 3 arguments.
- completion_fn: we can simply pass a model name, e.g. 'gpt-4-turbo' or pass another completions function. It is kind of chain, a completion function having another one as an input.
- embeddings_and_text_path: CSV file with 'text' and 'embedding' columns. It will be used for retrieval.
- k: Top k closest embeddings will be passed to the prompt. In our case it is 2, because the user will always ask questions similar to "Was Andrew Jackson born before William Harrison?", so we always need only two president's data.
If you want to read more about Completion Functions, you can read this document
Step 3: Register your completion function¶
To register our completion function, we need to write a YAML file. You can check presidents.yaml.
Here we defined 3 different completion functions:
- cot/gpt-4o-mini: We will not use this one directly, but use it as an input to the 3rd one. It is based on ChainOfThoughtCompletionFn class, adds chain of thought logic on top of gpt-4o-mini.
- retrieval/presidents/gpt-4o-mini: for using 'gpt-4o-mini' with retrieval.
- retrieval/presidents/cot/gpt-4o-mini: for using 'gpt-4o-mini' with chain of thought logic, with retrieval.
# Open the file in read mode ('r')
with open('evals_registry/completion_fns/presidents.yaml', 'r') as file:
# Read the file's content
file_content = file.read()
# Print the content
print(file_content)
cot/gpt-4o-mini: class: evals.completion_fns.cot:ChainOfThoughtCompletionFn args: cot_completion_fn: gpt-4o-mini retrieval/presidents/gpt-4o-mini: class: evals_registry.completion_fns.retrieval:MLRetrievalCompletionFn args: completion_fn: gpt-4o-mini embeddings_and_text_path: output/presidents_embeddings.csv k: 2 retrieval/presidents/cot/gpt-4o-mini: class: evals_registry.completion_fns.retrieval:MLRetrievalCompletionFn args: completion_fn: cot/gpt-4o-mini embeddings_and_text_path: output/presidents_embeddings.csv
Step 4: Build your eval¶
An eval is simply a dataset and a choice of eval class. You have 3 options to build an eval:
- Using one of the basic eval classes. Most common ones are 'Match', 'Includes' and 'FuzyMatch'. For a full list, you can check here.
- Using the model graded eval class. Model graded eval means using an LLM model to evaluate the outputs of another LLM model.
- Creating your custom eval class. Most of the cases, this will not be necessary. So it is not covered under this notebook. But you can check here, if you want to learn more.
Register the eval by adding a file to /evals/<eval_name>.yaml under registry folder (in our case it is 'evals_registry') using the elsuite registry format. For example, for a Match eval, it would be:
<eval_name>:
id: <eval_name>.dev.v0
description: <description>
metrics: [accuracy]
<eval_name>.dev.v0:
class: evals.elsuite.basic.match:Match
args:
samples_jsonl: <eval_name>/samples.jsonl
Upon running the eval, the data will be searched for in 'data' folder under registry. For example, if older/samples.jsonl is the provided file, the data is expected to be in evals_registry/data/older/samples.jsonl.
The naming convention for evals is in the form {eval_name}.{split>}.{version}
- eval_name is the eval name, used to group evals whose scores are comparable.
- split is the data split, used to further group evals that are under the same <base_eval>. E.g., "val", "test", or "dev" for testing.
- version is the version of the eval, which can be any descriptive text you'd like to use (though it's best if it does not contain .).
In general, running the same eval name against the same model should always give similar results so that others can reproduce it. Therefore, when you change your eval, you should bump the version.
In our sample scerio, we will go ahead with the first option and use 'Match'. You can check older.yaml.
# Open the file in read mode ('r')
with open('evals_registry/evals/older.yaml', 'r') as file:
# Read the file's content
file_content = file.read()
# Print the content
print(file_content)
older: id: older.dev.v0 description: Test the model's ability to determine who is older. metrics: [accuracy] older.dev.v0: class: evals.elsuite.basic.match:Match args: samples_jsonl: older/samples.jsonl
Step 5: Setup your sample data to use for evaluation¶
You will need to convert your samples into the right JSON lines (JSONL) format. A JSONL file is just a JSON file with a unique JSON object per line.
You can use the openai CLI (available with OpenAI-Python) to transform data from some common file types into JSONL:
openai tools fine_tunes.prepare_data -f data[.csv, .json, .txt, .xlsx or .tsv]
You can find some examples of JSONL eval files here
Each JSON object will represent one data point in your eval. The keys you need in the JSON object depend on the eval template. All templates expect an "input
" key, which is the prompt.
For the basic evals Match, Includes, and FuzzyMatch, the other required key is "ideal
", which is a string (or a list of strings) specifying the correct reference answer(s).
In our sample scerio, we will use samples.jsonl. Check the content of it to have an understanding.
# Open the file in read mode ('r')
with open('evals_registry/data/older/samples.jsonl', 'r') as file:
# Loop over the first 3 lines and print each
for i, line in enumerate(file):
if i < 3:
print(line, end='') # Use end='' to avoid adding extra newlines
else:
break
{"input": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Was Abraham Lincoln born before Franklin Pierce? Answer Y or N."}], "ideal": "N"} {"input": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Was Abraham Lincoln born before Andrew Johnson? Answer Y or N."}], "ideal": "N"} {"input": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Was Andrew Jackson born before John Quincy Adams? Answer Y or N."}], "ideal": "Y"}
Step 6: Run the eval¶
After installing evals tool, we can simply run evaluation in the command line by defining completion function and evaluation task.
completion_fn
: If you will only use an OpenAI model directly, you simply put its name. If you have your own custom completion function (as in presidents.yaml), put its name.eval_task
: It refers to a YAML file in the registry directory. The file defines parameters for a specific evaluation task, e.g., evaluation data, evaluation metrics and prompting strategies.
oaieval <completion_fn> <eval_task>
Each run will create a jsonl log file under the tmp folder as such '/tmp/evallogs/2402140749577WDOLKUD_gpt-4o-mini_older.jsonl'. You can check the corresponding file to see how the run went.
For more information on running evals, you can read this.
NOTE: The default registry path for evals is 'evals/registry', here we use a custom path of './evals_registry'. So we need to pass --registry_path
argument to point out our registry folder.
NOTE: For evals to resolve evals_registry.completion_fns.retrieval:MLRetrievalCompletionFn class, we need to add the path of the evals_registry folder to the PYTHONPATH.
# Use gpt-4o-mini directly without any retrieval or extra prompt -> Accuracy: 0.8
!export PYTHONPATH=".:$PYTHONPATH"; oaieval gpt-4o-mini older --max_samples 10 --registry_path ./evals_registry
[2024-08-10 19:53:33,489] [registry.py:272] Loading registry from /Users/meltemseyhan/M9/MLTeam/mlteam-openai-training/evals/evals/registry/evals
[2024-08-10 19:53:34,152] [registry.py:272] Loading registry from /Users/meltemseyhan/.evals/evals
[2024-08-10 19:53:34,152] [registry.py:272] Loading registry from evals_registry/evals
[2024-08-10 19:53:34,155] [oaieval.py:215] Run started: 2408101653344UTPQZFY
[2024-08-10 19:53:34,252] [data.py:94] Fetching evals_registry/data/older/samples.jsonl
[2024-08-10 19:53:34,254] [eval.py:36] Evaluating 10 samples
[2024-08-10 19:53:34,267] [eval.py:144] Running in threaded mode with 10 threads!
100%|███████████████████████████████████████████| 10/10 [00:01<00:00, 5.75it/s]
[2024-08-10 19:53:36,024] [oaieval.py:275] Found 10/10 sampling events with usage data
[2024-08-10 19:53:36,024] [oaieval.py:283] Token usage from 10 sampling events:
completion_tokens: 10
prompt_tokens: 316
total_tokens: 326
[2024-08-10 19:53:36,025] [record.py:371] Final report: {'accuracy': 0.8, 'boostrap_std': 0.1336369709324482, 'usage_completion_tokens': 10, 'usage_prompt_tokens': 316, 'usage_total_tokens': 326}. Logged to /tmp/evallogs/2408101653344UTPQZFY_gpt-4o-mini_older.jsonl
[2024-08-10 19:53:36,025] [oaieval.py:233] Final report:
[2024-08-10 19:53:36,025] [oaieval.py:235] accuracy: 0.8
[2024-08-10 19:53:36,025] [oaieval.py:235] boostrap_std: 0.1336369709324482
[2024-08-10 19:53:36,025] [oaieval.py:235] usage_completion_tokens: 10
[2024-08-10 19:53:36,025] [oaieval.py:235] usage_prompt_tokens: 316
[2024-08-10 19:53:36,025] [oaieval.py:235] usage_total_tokens: 326
[2024-08-10 19:53:36,031] [record.py:360] Logged 20 rows of events to /tmp/evallogs/2408101653344UTPQZFY_gpt-4o-mini_older.jsonl: insert_time=4.472ms
# Use gpt-4o-mini with retrieval -> Accuracy: 1.0
!export PYTHONPATH=".:$PYTHONPATH"; oaieval retrieval/presidents/gpt-4o-mini older --max_samples 10 --registry_path ./evals_registry
[2024-08-10 19:53:55,250] [registry.py:272] Loading registry from /Users/meltemseyhan/M9/MLTeam/mlteam-openai-training/evals/evals/registry/evals
[2024-08-10 19:53:55,798] [registry.py:272] Loading registry from /Users/meltemseyhan/.evals/evals
[2024-08-10 19:53:55,798] [registry.py:272] Loading registry from evals_registry/evals
[2024-08-10 19:53:56,312] [registry.py:272] Loading registry from /Users/meltemseyhan/M9/MLTeam/mlteam-openai-training/evals/evals/registry/completion_fns
[2024-08-10 19:53:56,320] [registry.py:272] Loading registry from /Users/meltemseyhan/.evals/completion_fns
[2024-08-10 19:53:56,320] [registry.py:272] Loading registry from evals_registry/completion_fns
[2024-08-10 19:53:56,322] [registry.py:272] Loading registry from /Users/meltemseyhan/M9/MLTeam/mlteam-openai-training/evals/evals/registry/solvers
[2024-08-10 19:53:56,500] [registry.py:272] Loading registry from /Users/meltemseyhan/.evals/solvers
[2024-08-10 19:53:56,500] [registry.py:272] Loading registry from evals_registry/solvers
[2024-08-10 19:53:56,823] [utils.py:161] NumExpr defaulting to 12 threads.
[2024-08-10 19:53:57,493] [oaieval.py:215] Run started: 240810165357I36QF4LQ
[2024-08-10 19:53:57,494] [data.py:94] Fetching evals_registry/data/older/samples.jsonl
[2024-08-10 19:53:57,495] [eval.py:36] Evaluating 10 samples
[2024-08-10 19:53:57,508] [eval.py:144] Running in threaded mode with 10 threads!
100%|███████████████████████████████████████████| 10/10 [00:01<00:00, 6.34it/s]
[2024-08-10 19:53:59,105] [oaieval.py:275] Found 10/20 sampling events with usage data
[2024-08-10 19:53:59,105] [oaieval.py:283] Token usage from 10 sampling events:
completion_tokens: 10
prompt_tokens: 828
total_tokens: 838
[2024-08-10 19:53:59,106] [record.py:371] Final report: {'accuracy': 1.0, 'boostrap_std': 0.0, 'usage_completion_tokens': 10, 'usage_prompt_tokens': 828, 'usage_total_tokens': 838}. Logged to /tmp/evallogs/240810165357I36QF4LQ_retrieval/presidents/gpt-4o-mini_older.jsonl
[2024-08-10 19:53:59,106] [oaieval.py:233] Final report:
[2024-08-10 19:53:59,106] [oaieval.py:235] accuracy: 1.0
[2024-08-10 19:53:59,106] [oaieval.py:235] boostrap_std: 0.0
[2024-08-10 19:53:59,106] [oaieval.py:235] usage_completion_tokens: 10
[2024-08-10 19:53:59,106] [oaieval.py:235] usage_prompt_tokens: 828
[2024-08-10 19:53:59,106] [oaieval.py:235] usage_total_tokens: 838
[2024-08-10 19:53:59,114] [record.py:360] Logged 30 rows of events to /tmp/evallogs/240810165357I36QF4LQ_retrieval/presidents/gpt-4o-mini_older.jsonl: insert_time=5.668ms
# Use gpt-4o-mini with retrieval and chain-of-thought -> Accuracy: 1.0
!export PYTHONPATH=".:$PYTHONPATH"; oaieval retrieval/presidents/cot/gpt-4o-mini older --max_samples 10 --registry_path ./evals_registry
[2024-08-10 19:54:18,209] [registry.py:272] Loading registry from /Users/meltemseyhan/M9/MLTeam/mlteam-openai-training/evals/evals/registry/evals
[2024-08-10 19:54:18,790] [registry.py:272] Loading registry from /Users/meltemseyhan/.evals/evals
[2024-08-10 19:54:18,790] [registry.py:272] Loading registry from evals_registry/evals
[2024-08-10 19:54:19,351] [registry.py:272] Loading registry from /Users/meltemseyhan/M9/MLTeam/mlteam-openai-training/evals/evals/registry/completion_fns
[2024-08-10 19:54:19,356] [registry.py:272] Loading registry from /Users/meltemseyhan/.evals/completion_fns
[2024-08-10 19:54:19,356] [registry.py:272] Loading registry from evals_registry/completion_fns
[2024-08-10 19:54:19,357] [registry.py:272] Loading registry from /Users/meltemseyhan/M9/MLTeam/mlteam-openai-training/evals/evals/registry/solvers
[2024-08-10 19:54:19,516] [registry.py:272] Loading registry from /Users/meltemseyhan/.evals/solvers
[2024-08-10 19:54:19,516] [registry.py:272] Loading registry from evals_registry/solvers
[2024-08-10 19:54:19,876] [utils.py:161] NumExpr defaulting to 12 threads.
[2024-08-10 19:54:20,496] [oaieval.py:215] Run started: 240810165420X2QELTYH
[2024-08-10 19:54:20,497] [data.py:94] Fetching evals_registry/data/older/samples.jsonl
[2024-08-10 19:54:20,498] [eval.py:36] Evaluating 10 samples
[2024-08-10 19:54:20,513] [eval.py:144] Running in threaded mode with 10 threads!
100%|███████████████████████████████████████████| 10/10 [00:04<00:00, 2.33it/s]
[2024-08-10 19:54:24,816] [oaieval.py:275] Found 20/50 sampling events with usage data
[2024-08-10 19:54:24,816] [oaieval.py:283] Token usage from 20 sampling events:
completion_tokens: 1,063
prompt_tokens: 4,303
total_tokens: 5,366
[2024-08-10 19:54:24,817] [record.py:371] Final report: {'accuracy': 1.0, 'boostrap_std': 0.0, 'usage_completion_tokens': 1063, 'usage_prompt_tokens': 4303, 'usage_total_tokens': 5366}. Logged to /tmp/evallogs/240810165420X2QELTYH_retrieval/presidents/cot/gpt-4o-mini_older.jsonl
[2024-08-10 19:54:24,817] [oaieval.py:233] Final report:
[2024-08-10 19:54:24,817] [oaieval.py:235] accuracy: 1.0
[2024-08-10 19:54:24,817] [oaieval.py:235] boostrap_std: 0.0
[2024-08-10 19:54:24,817] [oaieval.py:235] usage_completion_tokens: 1063
[2024-08-10 19:54:24,817] [oaieval.py:235] usage_prompt_tokens: 4303
[2024-08-10 19:54:24,817] [oaieval.py:235] usage_total_tokens: 5366
[2024-08-10 19:54:24,830] [record.py:360] Logged 60 rows of events to /tmp/evallogs/240810165420X2QELTYH_retrieval/presidents/cot/gpt-4o-mini_older.jsonl: insert_time=10.769ms