aevaluate_existing#

Evaluate existing experiment runs asynchronously.

Parameters:

experiment (Union[str, uuid.UUID]) – The identifier of the experiment to evaluate.
evaluators (Optional[Sequence[EVALUATOR_T]]) – Optional sequence of evaluators to use for individual run evaluation.
summary_evaluators (Optional[Sequence[SUMMARY_EVALUATOR_T]]) – Optional sequence of evaluators to apply over the entire dataset.
metadata (Optional[dict]) – Optional metadata to include in the evaluation results.
max_concurrency (int | None) – The maximum number of concurrent evaluations to run. If None then no limit is set. If 0 then no concurrency. Defaults to 0.
client (Optional[langsmith.Client]) – Optional Langsmith client to use for evaluation.
load_nested (bool) – Whether to load all child runs for the experiment. Default is to only load the top-level root runs.
blocking (bool) – Whether to block until evaluation is complete.

Returns:

An async iterator over the experiment results.

Return type:

AsyncIterator[ExperimentResultRow]

Examples

Define your evaluators

>>> from typing import Sequence
>>> from langsmith.schemas import Example, Run
>>> def accuracy(run: Run, example: Example):
...     # Row-level evaluator for accuracy.
...     pred = run.outputs["output"]
...     expected = example.outputs["answer"]
...     return {"score": expected.lower() == pred.lower()}
>>> def precision(runs: Sequence[Run], examples: Sequence[Example]):
...     # Experiment-level evaluator for precision.
...     # TP / (TP + FP)
...     predictions = [run.outputs["output"].lower() for run in runs]
...     expected = [example.outputs["answer"].lower() for example in examples]
...     # yes and no are the only possible answers
...     tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
...     fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
...     return {"score": tp / (tp + fp)}

Load the experiment and run the evaluation.

>>> import asyncio
>>> import uuid
>>> from langsmith import Client, aevaluate, aevaluate_existing
>>> client = Client()
>>> dataset_name = "__doctest_aevaluate_existing_" + uuid.uuid4().hex[:8]
>>> dataset = client.create_dataset(dataset_name)
>>> example = client.create_example(
...     inputs={"question": "What is 2+2?"},
...     outputs={"answer": "4"},
...     dataset_id=dataset.id,
... )
>>> async def apredict(inputs: dict) -> dict:
...     await asyncio.sleep(0.001)
...     return {"output": "4"}
>>> results = asyncio.run(
...     aevaluate(
...         apredict, data=dataset_name, experiment_prefix="doctest_experiment"
...     )
... )
View the evaluation results for experiment:...
>>> experiment_id = results.experiment_name
>>> # Consume all results to ensure evaluation is complete
>>> async def consume_results():
...     result_list = [r async for r in results]
...     return len(result_list) > 0
>>> asyncio.run(consume_results())
True
>>> import time
>>> time.sleep(3)
>>> results = asyncio.run(
...     aevaluate_existing(
...         experiment_id,
...         evaluators=[accuracy],
...         summary_evaluators=[precision],
...     )
... )
View the evaluation results for experiment:...
>>> client.delete_dataset(dataset_id=dataset.id)