> ## Documentation Index
> Fetch the complete documentation index at: https://docs.wandb.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluations overview

> Evaluation-driven LLM application development to systematically improve applications

*Evaluation-driven LLM application development* helps you systematically improve LLM applications by systematically measuring their behavior using consistent, curated examples.

<Tabs>
  <Tab title="Python">
    In Weave, the core of the workflow is the *`Evaluation` object*, which defines:

    * A [`Dataset`](../core-types/datasets) or list of dictionaries for test examples.
    * One or more [scoring functions](../evaluation/scorers).
    * Optional configuration like [input preprocessing](#format-dataset-rows-before-evaluating).

    Once you've defined the `Evaluation`, you can run it against a [`Model`](../core-types/models) object or any custom function containing LLM application logic. Each call to `.evaluate()` triggers an *evaluation run*. Think of the `Evaluation` object as a blueprint, and each run as a measurement of how your application performs under that setup.
  </Tab>

  <Tab title="TypeScript">
    In Weave, the core of the workflow is the *`Evaluation` object*, which defines:

    * A [`Dataset`](../core-types/datasets) or array of objects for test examples.
    * One or more [scoring functions](../evaluation/scorers).

    Once you've defined the `Evaluation`, you can run it against any function wrapped with `weave.op`. Each call to `.evaluate()` triggers an *evaluation run*. Think of the `Evaluation` object as a blueprint, and each run as a measurement of how your application performs under that setup.

    <Note>
      The TypeScript SDK uses function-based models and scorers. Class-based `Model` and `Scorer` types are not yet available in TypeScript.
    </Note>
  </Tab>
</Tabs>

To get started with evaluations, complete the following steps:

1. [Create an `Evaluation` object](#1-create-an-evaluation-object)
2. [Define a dataset of examples](#2-define-a-dataset-of-test-examples)
3. [Define scoring functions](#3-define-scoring-functions)
4. [Define a model or function to evaluate](#4-define-a-model-or-function-to-evaluate)
5. [Run the evaluation](#5-run-the-evaluation)

A complete evaluation code sample can be found [here](#full-evaluation-code-sample). You can also learn more about [advanced evaluation features](#advanced-evaluation-usage) like [Saved views](#saved-views) and [Imperative evaluations](#imperative-evaluations-evaluationlogger).

## 1. Create an `Evaluation` object

Creating an `Evaluation` object is the first step in setting up your evaluation configuration. An `Evaluation` consists of example data, scoring logic, and optional preprocessing. You’ll later use it to run one or more evaluations.

Weave will take each example, pass it through your application and score the output on multiple custom scoring functions. By doing this, you'll have a view of the performance of your application, and a rich UI to drill into individual outputs and scores.

### (Optional) Custom naming

<Tabs>
  <Tab title="Python">
    There are two types of customizable names in the evaluation flow:

    * [*Evaluation object name* (`evaluation_name`)](#name-the-evaluation-object): A persistent label for your configured `Evaluation` object.
    * [*Evaluation run display name* (`__weave["display_name"]`)](#name-individual-evaluation-runs): A label for a specific evaluation execution, shown in the UI.

    #### Name the `Evaluation` object

    To name the `Evaluation` object itself, pass an `evaluation_name` parameter to the `Evaluation` class. This name helps you identify the Evaluation in code and UI listings.

    ```python lines theme={null}
    evaluation = Evaluation(
        dataset=examples, scorers=[match_score1], evaluation_name="My Evaluation"
    )
    ```

    #### Name individual evaluation runs

    To name a specific evaluation run (a call to `evaluate()`), use the `__weave` dictionary with a `display_name`. This affects what is shown in the UI for that run.

    ```python lines theme={null}
    evaluation = Evaluation(
        dataset=examples, scorers=[match_score1]
    )
    evaluation.evaluate(model, __weave={"display_name": "My Evaluation Run"})
    ```
  </Tab>

  <Tab title="TypeScript">
    To name the `Evaluation` object, pass an `id` parameter to the `Evaluation` constructor. This name helps you identify the Evaluation in code and UI listings.

    ```typescript lines theme={null}
    const evaluation = new weave.Evaluation({
      id: 'my-evaluation',
      dataset: dataset,
      scorers: [matchScore],
    });
    ```
  </Tab>
</Tabs>

## 2. Define a dataset of test examples

First, define a [Dataset](../core-types/datasets) object or list of examples with a collection of examples to be evaluated. These examples are often failure cases that you want to test for, similar to unit tests in Test-Driven Development (TDD).

<Tabs>
  <Tab title="Python">
    The following example shows a dataset defined as a list of dictionaries:

    ```python lines theme={null}
    examples = [
        {"question": "What is the capital of France?", "expected": "Paris"},
        {"question": "Who wrote 'To Kill a Mockingbird'?", "expected": "Harper Lee"},
        {"question": "What is the square root of 64?", "expected": "8"},
    ]
    ```
  </Tab>

  <Tab title="TypeScript">
    The following example shows a dataset defined as a `Dataset` object with an array of rows:

    ```typescript lines theme={null}
    const dataset = new weave.Dataset({
      id: 'my-dataset',
      rows: [
        {question: 'What is the capital of France?', expected: 'Paris'},
        {question: 'Who wrote "To Kill a Mockingbird"?', expected: 'Harper Lee'},
        {question: 'What is the square root of 64?', expected: '8'},
      ],
    });
    ```
  </Tab>
</Tabs>

## 3. Define scoring functions

Then, create one or more [scoring functions](../evaluation/scorers). These are used to score each example in the `Dataset`.

<Tabs>
  <Tab title="Python">
    Each scoring function must have an `output` parameter, and return a dictionary with the scores. Optionally, you can include other inputs from your examples.

    Scoring functions need to have an `output` keyword argument, but the other arguments are user defined and are taken from the dataset examples. It will only take the necessary keys by using a dictionary key based on the argument name.

    <Tip>
      If your scorer expects an `output` argument but isn't receiving it, check if it might be using the legacy `model_output` key. To fix this, update your scorer function to use output as a keyword argument.
    </Tip>

    The following example scorer function `match_score1` uses the `expected` value from the `examples` dictionary for scoring.

    ```python lines theme={null}
    import weave

    # Collect your examples
    examples = [
        {"question": "What is the capital of France?", "expected": "Paris"},
        {"question": "Who wrote 'To Kill a Mockingbird'?", "expected": "Harper Lee"},
        {"question": "What is the square root of 64?", "expected": "8"},
    ]

    # Define any custom scoring function
    @weave.op()
    def match_score1(expected: str, output: dict) -> dict:
        # Here is where you'd define the logic to score the model output
        return {'match': expected == output['generated_text']}
    ```

    ### (Optional) Define a custom `Scorer` class

    In some applications we want to create custom `Scorer` classes - where for example a standardized `LLMJudge` class should be created with specific parameters (for example, chat model, prompt), specific scoring of each row, and specific calculation of an aggregate score.

    See the tutorial on defining a `Scorer` class in [Model-Based Evaluation of RAG applications](/weave/tutorial-rag#optional-defining-a-scorer-class) for more information.
  </Tab>

  <Tab title="TypeScript">
    Each scoring function is wrapped with `weave.op` and receives an object with `modelOutput` and `datasetRow` properties.

    The following example scorer function `matchScore` compares the model output to the `expected` value from the dataset row.

    ```typescript lines theme={null}
    import * as weave from 'weave';

    // Collect your examples into a dataset
    const dataset = new weave.Dataset({
      id: 'my-dataset',
      rows: [
        {question: 'What is the capital of France?', expected: 'Paris'},
        {question: 'Who wrote "To Kill a Mockingbird"?', expected: 'Harper Lee'},
        {question: 'What is the square root of 64?', expected: '8'},
      ],
    });

    // Define any custom scoring function
    const matchScore = weave.op(
      ({modelOutput, datasetRow}) => {
        return {match: modelOutput === datasetRow.expected};
      },
      {name: 'matchScore'}
    );
    ```

    <Note>
      Class-based `Scorer` types are not yet available in TypeScript. Use function-based scorers wrapped with `weave.op`.
    </Note>
  </Tab>
</Tabs>

## 4. Define a model or function to evaluate

<Tabs>
  <Tab title="Python">
    To evaluate a `Model`, call `.evaluate()` using an `Evaluation`. `Models` are used when you have parameters that you want to experiment with and capture in Weave.

    ```python lines theme={null}
    from weave import Model, Evaluation
    import asyncio

    class MyModel(Model):
        prompt: str

        @weave.op()
        def predict(self, question: str):
            # here's where you would add your LLM call and return the output
            return {'generated_text': 'Hello, ' + self.prompt}

    model = MyModel(prompt='World')

    evaluation = Evaluation(
        dataset=examples, scorers=[match_score1]
    )
    weave.init('intro-example') # begin tracking results with weave
    asyncio.run(evaluation.evaluate(model))
    ```

    This runs `predict` on each example and score the output with each scoring function.

    ### (Optional) Define a function to evaluate

    Alternatively, you can also evaluate a custom function tracked by `@weave.op()`.

    ```python lines theme={null}
    @weave.op
    def function_to_evaluate(question: str):
        # here's where you would add your LLM call and return the output
        return  {'generated_text': 'some response'}

    asyncio.run(evaluation.evaluate(function_to_evaluate))
    ```
  </Tab>

  <Tab title="TypeScript">
    In TypeScript, you evaluate functions wrapped with `weave.op`. The function receives the dataset row and returns the model output.

    ```typescript lines theme={null}
    import * as weave from 'weave';

    // Initialize Weave
    await weave.init('intro-example');

    // Define the function to evaluate
    const myModel = weave.op(
      async ({question}) => {
        // here's where you would add your LLM call and return the output
        return 'Paris';
      },
      {name: 'myModel'}
    );

    // Create the evaluation
    const evaluation = new weave.Evaluation({
      id: 'my-evaluation',
      dataset: dataset,
      scorers: [matchScore],
    });

    // Run the evaluation
    const results = await evaluation.evaluate({model: myModel});
    ```

    This runs `myModel` on each example and scores the output with each scoring function.
  </Tab>
</Tabs>

## 5. Run the evaluation

To run an evaluation, call `.evaluate()` on the `Evaluation` object.

<Tabs>
  <Tab title="Python">
    Assuming an `Evaluation` object called `evaluation` and a `Model` object to evaluate called `model`, the following code instantiates an evaluation run.

    ```python lines theme={null}
    asyncio.run(evaluation.evaluate(model))
    ```

    ### (Optional) Run multiple trials

    You can set the `trials` parameter on the `Evaluation` object to run each example multiple times.

    ```python lines theme={null}
    evaluation = Evaluation(
        dataset=examples,
        scorers=[match_score],
        trials=3
    )
    ```
  </Tab>

  <Tab title="TypeScript">
    Assuming an `Evaluation` object called `evaluation` and a model function called `myModel`, the following code runs the evaluation.

    ```typescript lines theme={null}
    const results = await evaluation.evaluate({model: myModel});
    ```

    ### (Optional) Run multiple trials

    You can set the `nTrials` parameter when calling `evaluate()` to run each example multiple times.

    ```typescript lines theme={null}
    const results = await evaluation.evaluate({
      model: myModel,
      nTrials: 3,
    });
    ```
  </Tab>
</Tabs>

The run passes each example to the model three times, and each run is scored and displayed independently in Weave.

## Full evaluation code example

<Tabs>
  <Tab title="Python">
    The following code sample demonstrates a complete evaluation run from start to finish. The `examples` dictionary is used by the `match_score1` and `match_score2` scoring functions to evaluate `MyModel` given the value of `prompt`, as well as a custom function `function_to_evaluate`. The evaluation runs for both the `Model` and the function are invoked with `asyncio.run(evaluation.evaluate())`.

    ```python lines theme={null}
    from weave import Evaluation, Model
    import weave
    import asyncio
    weave.init('intro-example')
    examples = [
        {"question": "What is the capital of France?", "expected": "Paris"},
        {"question": "Who wrote 'To Kill a Mockingbird'?", "expected": "Harper Lee"},
        {"question": "What is the square root of 64?", "expected": "8"},
    ]

    @weave.op()
    def match_score1(expected: str, output: dict) -> dict:
        return {'match': expected == output['generated_text']}

    @weave.op()
    def match_score2(expected: dict, output: dict) -> dict:
        return {'match': expected == output['generated_text']}

    class MyModel(Model):
        prompt: str

        @weave.op()
        def predict(self, question: str):
            # here's where you would add your LLM call and return the output
            return {'generated_text': 'Hello, ' + question + self.prompt}

    model = MyModel(prompt='World')
    evaluation = Evaluation(dataset=examples, scorers=[match_score1, match_score2])

    asyncio.run(evaluation.evaluate(model))

    @weave.op()
    def function_to_evaluate(question: str):
        # here's where you would add your LLM call and return the output
        return  {'generated_text': 'some response' + question}

    asyncio.run(evaluation.evaluate(function_to_evaluate("What is the capitol of France?")))
    ```
  </Tab>

  <Tab title="TypeScript">
    The following code sample demonstrates a complete evaluation run from start to finish. The dataset rows are used by the `matchScore` scoring function to evaluate `myModel`.

    ```typescript lines theme={null}
    import * as weave from 'weave';

    // Initialize Weave
    await weave.init('intro-example');

    // Collect your examples into a dataset
    const dataset = new weave.Dataset({
      id: 'my-dataset',
      rows: [
        {question: 'What is the capital of France?', expected: 'Paris'},
        {question: 'Who wrote "To Kill a Mockingbird"?', expected: 'Harper Lee'},
        {question: 'What is the square root of 64?', expected: '8'},
      ],
    });

    // Define scoring functions
    const matchScore = weave.op(
      ({modelOutput, datasetRow}) => {
        return {match: modelOutput === datasetRow.expected};
      },
      {name: 'matchScore'}
    );

    // Define the function to evaluate
    const myModel = weave.op(
      async ({question}) => {
        // here's where you would add your LLM call and return the output
        return 'Paris';
      },
      {name: 'myModel'}
    );

    // Create and run the evaluation
    const evaluation = new weave.Evaluation({
      id: 'my-evaluation',
      dataset: dataset,
      scorers: [matchScore],
    });

    const results = await evaluation.evaluate({model: myModel});
    console.log('Evaluation results:', results);
    ```
  </Tab>
</Tabs>

<Frame>
  <img src="https://mintcdn.com/wb-21fd5541/aRvhhwVWqlxBzke5/images/evals-hero.png?fit=max&auto=format&n=aRvhhwVWqlxBzke5&q=85&s=7d7466d666ad412ed3916bfab533d118" alt="Evals hero" width="4100" height="2160" data-path="images/evals-hero.png" />
</Frame>

## Advanced evaluation usage

### Format dataset rows before evaluating

<Tabs>
  <Tab title="Python">
    <Warning>
      The `preprocess_model_input` function is only applied to inputs before passing them to the model's prediction function. Scorer functions always receive the original dataset example, without any preprocessing applied.
    </Warning>

    The `preprocess_model_input` parameter allows you to transform your dataset examples before they are passed to your evaluation function. This is useful when you need to:

    * Rename fields to match your model's expected input
    * Transform data into the correct format
    * Add or remove fields
    * Load additional data for each example

    Here's a simple example that shows how to use `preprocess_model_input` to rename fields:

    ```python lines theme={null}
    import weave
    from weave import Evaluation
    import asyncio

    # Our dataset has "input_text" but our model expects "question"
    examples = [
        {"input_text": "What is the capital of France?", "expected": "Paris"},
        {"input_text": "Who wrote 'To Kill a Mockingbird'?", "expected": "Harper Lee"},
        {"input_text": "What is the square root of 64?", "expected": "8"},
    ]

    @weave.op()
    def preprocess_example(example):
        # Rename input_text to question
        return {
            "question": example["input_text"]
        }

    @weave.op()
    def match_score(expected: str, output: dict) -> dict:
        return {'match': expected == output['generated_text']}

    @weave.op()
    def function_to_evaluate(question: str):
        return {'generated_text': f'Answer to: {question}'}

    # Create evaluation with preprocessing
    evaluation = Evaluation(
        dataset=examples,
        scorers=[match_score],
        preprocess_model_input=preprocess_example
    )

    # Run the evaluation
    weave.init('preprocessing-example')
    asyncio.run(evaluation.evaluate(function_to_evaluate))
    ```

    In this example, our dataset contains examples with an `input_text` field, but our evaluation function expects a `question` parameter. The `preprocess_example` function transforms each example by renaming the field, allowing the evaluation to work correctly.

    The preprocessing function:

    1. Receives the raw example from your dataset
    2. Returns a dictionary with the fields your model expects
    3. Is applied to each example before it's passed to your evaluation function

    This is particularly useful when working with external datasets that may have different field names or structures than what your model expects.
  </Tab>

  <Tab title="TypeScript">
    In TypeScript, you can use `columnMapping` on the `Evaluation` object to map dataset column names to the names your scorers expect. This is useful when your dataset has different field names than your scorer functions expect.

    The following example maps the `expectedOutputTimesTwo` column to the `expected` column:

    ```typescript lines theme={null}
    const myScorer = weave.op(
      ({modelOutput, datasetRow}) => {
        return modelOutput * 2 === datasetRow.expectedOutputTimesTwo;
      },
      {name: 'myScorer'}
    );

    const evaluation = new weave.Evaluation({
      id: 'my-evaluation',
      dataset: [{expected: 2}],
      scorers: [myScorer],
      columnMapping: {expectedOutputTimesTwo: 'expected'},
    });
    ```

    <Note>
      The `preprocess_model_input` parameter is not yet available in TypeScript. Use `columnMapping` to map dataset fields to scorer expectations.
    </Note>
  </Tab>
</Tabs>

### Use HuggingFace datasets with evaluations

<Tabs>
  <Tab title="Python">
    We are continuously improving our integrations with third-party services and libraries.

    While we work on building more seamless integrations, you can use `preprocess_model_input` as a temporary workaround for using HuggingFace Datasets in Weave evaluations.

    See our [Using HuggingFace datasets in evaluations cookbook](/weave/cookbooks/hf_dataset_evals) for the current approach.
  </Tab>

  <Tab title="TypeScript">
    ```plaintext theme={null}
    This feature is not currently available in TypeScript.
    ```
  </Tab>
</Tabs>

### Saved views

You can save your Evals table configurations, filters, and sorts as *saved views* for quick access to your preferred setup. You can configure and access saved views in the UI and with the Python SDK. For more information, see [Saved Views](/weave/guides/tools/saved-views).

### Imperative evaluations (`EvaluationLogger`)

If you prefer a more flexible evaluation framework, check out Weave's [`EvaluationLogger`](../evaluation/evaluation_logger). The `EvaluationLogger` is available in both Python and TypeScript and offers more flexibility for complex workflows, while the standard evaluation framework provides more structure and guidance.
