Skip to main content

Evaluation Quick Start

This guide helps you get started evaluating your AI system using LangSmith, so you can deploy the best perfoming model for your needs. This guide gets you started with the basics.

1. Install LangSmith

pip install -U langsmith

2. Evaluate

Evalution requires a system to test, data to serve as test cases, and optionally evaluators to grade the results.

from langsmith import Client
from langsmith.schemas import Run, Example
from langsmith.evaluation import evaluate
import openai
from langsmith.wrappers import wrap_openai

client = Client()

# Define dataset: these are your test cases
dataset_name = "Rap Battle Dataset"
dataset = client.create_dataset(dataset_name, description="Rap battle prompts.")
client.create_examples(
inputs=[
{"question": "a rap battle between Atticus Finch and Cicero"},
{"question": "a rap battle between Barbie and Oppenheimer"},
],
outputs=[
{"must_mention": ["lawyer", "justice"]},
{"must_mention": ["plastic", "nuclear"]},
],
dataset_id=dataset.id,
)

# Define AI system
openai_client = wrap_openai(openai.Client())

def predict(inputs: dict) -> dict:
messages = [{"role": "user", "content": inputs["question"]}]
response = openai_client.chat.completions.create(messages=messages, model="gpt-3.5-turbo")
return {"output": response}

# Define evaluators
def must_mention(run: Run, example: Example) -> dict:
prediction = run.outputs.get("output") or ""
required = example.outputs.get("must_mention") or []
score = all(phrase in prediction for phrase in required)
return {"key":"must_mention", "score": score}

experiment_results = evaluate(
predict, # Your AI system
data=dataset_name, # The data to predict and grade over
evaluators=[must_mention], # The evaluators to score the results
experiment_prefix="rap-generator", # A prefix for your experiment names to easily identify them
metadata={
"version": "1.0.0",
},
)

Configure your API key, then run the script to evaluate your system.

export LANGCHAIN_API_KEY=<your api key>

3. Review Results

The evaluation results will be streamed to a new experiment linked to your "Rap Battle Dataset". You can view the results by clicking on the link printed by the evaluate function or by navigating to the Datasets & Testing page, clicking "Rap Battle Dataset", and viewing the latest test run.

There, you can inspect the traces and feedback generated from the evaluation configuration.

Eval test run screenshot

You can click "Open Run" to view the trace and feedback generated for that example.

Eval trace screenshot

To compare to another test on this dataset, you can click "Compare Tests".

Compare Tests

More on evaluation

Congratulations! You've now created a dataset and used it to evaluate your agent or LLM. To learn how to make your own custom evaluators, review the Custom Evaluator guide. To learn more about some pre-built evaluators available in the LangChain open-source library, check out the LangChain Evaluators guide.


Was this page helpful?


You can leave detailed feedback on GitHub.