Introduction to Evaluations

This is an interactive notebook. You can run it locally or use the following links:

This notebook introduces W&B Weave evaluations by walking you through a minimal end-to-end example. You define a Weave Model, run it against a small dataset, score the outputs with a custom scoring function, and review the results in Weave. It’s intended for developers who are new to Weave and want a quick, hands-on starting point before exploring more advanced evaluation workflows.

Prerequisites

Before you can run a Weave evaluation, complete the following prerequisites.

Install the W&B Weave SDK and log in with your API key.
Install the OpenAI SDK and log in with your API key.
Initialize your W&B project.

# Install dependencies and imports
!pip install wandb weave openai -q

import os
from getpass import getpass

from openai import OpenAI
from pydantic import BaseModel

import weave

# 🔑 Setup your API keys
# Running this cell will prompt you for your API key with `getpass` and will not echo to the terminal.
#####
print("---")
print(
    "Create a W&B API key at: https://wandb.ai/settings#apikeys"
)
os.environ["WANDB_API_KEY"] = getpass("Enter your W&B API key: ")
print("---")
print("You can generate your OpenAI API key here: https://platform.openai.com/api-keys")
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
print("---")
#####

# 🏠 Enter your W&B project name
weave_client = weave.init("MY_PROJECT_NAME")  # 🐝 Your W&B project name

Run your first evaluation

With your environment set up, you’re ready to define and run an evaluation. The following code sample shows how to evaluate an LLM using Weave’s Model and Evaluation APIs. First, define a Weave model by subclassing weave.Model, specifying the model name and prompt format, and tracking a predict method with @weave.op. The predict method sends a prompt to OpenAI and parses the response into a structured output using a Pydantic schema (FruitExtract). Then, create a small evaluation dataset that consists of input sentences and expected targets. Next, define a custom scoring function (also tracked using @weave.op) that compares the model’s output to the target label. Finally, wrap everything in a weave.Evaluation, specifying your dataset and scorers, and call evaluate() to run the evaluation pipeline asynchronously.

# 1. Construct a Weave model
class FruitExtract(BaseModel):
    fruit: str
    color: str
    flavor: str

class ExtractFruitsModel(weave.Model):
    model_name: str
    prompt_template: str

    @weave.op()
    def predict(self, sentence: str) -> dict:
        client = OpenAI()

        response = client.beta.chat.completions.parse(
            model=self.model_name,
            messages=[
                {
                    "role": "user",
                    "content": self.prompt_template.format(sentence=sentence),
                }
            ],
            response_format=FruitExtract,
        )
        result = response.choices[0].message.parsed
        return result

model = ExtractFruitsModel(
    name="gpt4o",
    model_name="gpt-4o",
    prompt_template='Extract fields ("fruit": <str>, "color": <str>, "flavor": <str>) as json, from the following text : {sentence}',
)

# 2. Collect some samples
sentences = [
    "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
    "Pounits are a bright green color and are more savory than sweet.",
    "Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them.",
]
labels = [
    {"fruit": "neoskizzles", "color": "purple", "flavor": "candy"},
    {"fruit": "pounits", "color": "green", "flavor": "savory"},
    {"fruit": "glowls", "color": "orange", "flavor": "sour, bitter"},
]
examples = [
    {"id": "0", "sentence": sentences[0], "target": labels[0]},
    {"id": "1", "sentence": sentences[1], "target": labels[1]},
    {"id": "2", "sentence": sentences[2], "target": labels[2]},
]

# 3. Define a scoring function for your evaluation
@weave.op()
def fruit_name_score(target: dict, output: FruitExtract) -> dict:
    target_flavors = [f.strip().lower() for f in target["flavor"].split(",")]
    output_flavors = [f.strip().lower() for f in output.flavor.split(",")]
    # Check if any target flavor is present in the output flavors
    matches = any(tf in of for tf in target_flavors for of in output_flavors)
    return {"correct": matches}

# 4. Run your evaluation
evaluation = weave.Evaluation(
    name="fruit_eval",
    dataset=examples,
    scorers=[fruit_name_score],
)
await evaluation.evaluate(model)

After the evaluation finishes, Weave logs the model, dataset, and per-example scores to your project so you can inspect the results in the Weave UI.

Looking for more examples

Now that you’ve run a basic evaluation, you can explore more advanced workflows in the following tutorials:

Learn how to build an evaluation pipeline end-to-end.
Learn how to evaluate a RAG application.

Get Started

Guides

Cookbooks

Reference

Details & Support

Prerequisites

Run your first evaluation

Looking for more examples

​Prerequisites

​Run your first evaluation

​Looking for more examples

Prerequisites

Run your first evaluation

Looking for more examples