aevaluate_existing#
- async langsmith.evaluation._arunner.aevaluate_existing(experiment: str | UUID | TracerSession, /, evaluators: Sequence[RunEvaluator | Callable[[Run, Example | None], EvaluationResult | EvaluationResults] | Callable[[...], dict | EvaluationResults | EvaluationResult] | Callable[[Run, Example | None], Awaitable[EvaluationResult | EvaluationResults]]] | None = None, summary_evaluators: Sequence[Callable[[Sequence[Run], Sequence[Example]], EvaluationResult | EvaluationResults] | Callable[[List[Run], List[Example]], EvaluationResult | EvaluationResults]] | None = None, metadata: dict | None = None, max_concurrency: int | None = 0, client: Client | None = None, load_nested: bool = False, blocking: bool = True) AsyncExperimentResults [source]#
Evaluate existing experiment runs asynchronously.
- Parameters:
experiment (Union[str, uuid.UUID]) – The identifier of the experiment to evaluate.
evaluators (Optional[Sequence[EVALUATOR_T]]) – Optional sequence of evaluators to use for individual run evaluation.
summary_evaluators (Optional[Sequence[SUMMARY_EVALUATOR_T]]) – Optional sequence of evaluators to apply over the entire dataset.
metadata (Optional[dict]) – Optional metadata to include in the evaluation results.
max_concurrency (int | None) – The maximum number of concurrent evaluations to run. If None then no limit is set. If 0 then no concurrency. Defaults to 0.
client (Optional[langsmith.Client]) – Optional Langsmith client to use for evaluation.
load_nested (bool) – Whether to load all child runs for the experiment. Default is to only load the top-level root runs.
blocking (bool) – Whether to block until evaluation is complete.
- Returns:
An async iterator over the experiment results.
- Return type:
AsyncIterator[ExperimentResultRow]
Examples
Define your evaluators
>>> from typing import Sequence >>> from langsmith.schemas import Example, Run >>> def accuracy(run: Run, example: Example): ... # Row-level evaluator for accuracy. ... pred = run.outputs["output"] ... expected = example.outputs["answer"] ... return {"score": expected.lower() == pred.lower()} >>> def precision(runs: Sequence[Run], examples: Sequence[Example]): ... # Experiment-level evaluator for precision. ... # TP / (TP + FP) ... predictions = [run.outputs["output"].lower() for run in runs] ... expected = [example.outputs["answer"].lower() for example in examples] ... # yes and no are the only possible answers ... tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"]) ... fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)]) ... return {"score": tp / (tp + fp)}
Load the experiment and run the evaluation.
>>> from langsmith import aevaluate, aevaluate_existing >>> dataset_name = "Evaluate Examples" >>> async def apredict(inputs: dict) -> dict: ... # This can be any async function or just an API call to your app. ... await asyncio.sleep(0.1) ... return {"output": "Yes"} >>> # First run inference on the dataset ... results = asyncio.run( ... aevaluate( ... apredict, ... data=dataset_name, ... ) ... ) View the evaluation results for experiment:...
Then evaluate the results >>> experiment_name = “My Experiment:64e6e91” # Or manually specify >>> results = asyncio.run( … aevaluate_existing( … experiment_name, … evaluators=[accuracy], … summary_evaluators=[precision], … ) … ) # doctest: +ELLIPSIS View the evaluation results for experiment:…