> ## Documentation Index
> Fetch the complete documentation index at: https://docs.wandb.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Introduction to Evaluations

> Learn how to use introduction to evaluations with W&B Weave

<Note>
  This is an interactive notebook. You can run it locally or use the links below:

  * [Open in Google Colab](https://colab.research.google.com/github/wandb/docs/blob/main/weave/cookbooks/source/Intro_to_Weave_Hello_Eval.ipynb)
  * [View source on GitHub](https://github.com/wandb/docs/blob/main/weave/cookbooks/source/Intro_to_Weave_Hello_Eval.ipynb)
</Note>

## 🔑 Prerequisites

Before you can run a Weave evaluation, complete the following prerequisites.

1. Install the W\&B Weave SDK and log in with your [API key](https://wandb.ai/settings#apikeys).
2. Install the OpenAI SDK and log in with your [API key](https://platform.openai.com/api-keys).
3. Initialize your W\&B project.

```python lines theme={null}
# Install dependencies and imports
!pip install wandb weave openai -q

import os
from getpass import getpass

from openai import OpenAI
from pydantic import BaseModel

import weave

# 🔑 Setup your API keys
# Running this cell will prompt you for your API key with `getpass` and will not echo to the terminal.
#####
print("---")
print(
    "Create a W&B API key at: https://wandb.ai/settings#apikeys"
)
os.environ["WANDB_API_KEY"] = getpass("Enter your W&B API key: ")
print("---")
print("You can generate your OpenAI API key here: https://platform.openai.com/api-keys")
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
print("---")
#####

# 🏠 Enter your W&B project name
weave_client = weave.init("MY_PROJECT_NAME")  # 🐝 Your W&B project name
```

## 🐝 Run your first evaluation

The following code sample shows how to evaluate an LLM using Weave’s `Model` and `Evaluation` APIs. First, define a Weave model by subclassing `weave.Model`, specifying the model name and prompt format, and tracking a `predict` method with `@weave.op`. The `predict` method sends a prompt to OpenAI and parses the response into a structured output using a Pydantic schema (`FruitExtract`). Then, create a small evaluation dataset consisting of input sentences and expected targets. Next, define a custom scoring function (also tracked using `@weave.op`) that compares the model’s output to the target label. Finally,  wrap everything in a `weave.Evaluation`, specifying your dataset and scorers, and call `evaluate()` to run the evaluation pipeline asynchronously.

```python lines theme={null}
# 1. Construct a Weave model
class FruitExtract(BaseModel):
    fruit: str
    color: str
    flavor: str

class ExtractFruitsModel(weave.Model):
    model_name: str
    prompt_template: str

    @weave.op()
    def predict(self, sentence: str) -> dict:
        client = OpenAI()

        response = client.beta.chat.completions.parse(
            model=self.model_name,
            messages=[
                {
                    "role": "user",
                    "content": self.prompt_template.format(sentence=sentence),
                }
            ],
            response_format=FruitExtract,
        )
        result = response.choices[0].message.parsed
        return result

model = ExtractFruitsModel(
    name="gpt4o",
    model_name="gpt-4o",
    prompt_template='Extract fields ("fruit": <str>, "color": <str>, "flavor": <str>) as json, from the following text : {sentence}',
)

# 2. Collect some samples
sentences = [
    "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
    "Pounits are a bright green color and are more savory than sweet.",
    "Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them.",
]
labels = [
    {"fruit": "neoskizzles", "color": "purple", "flavor": "candy"},
    {"fruit": "pounits", "color": "green", "flavor": "savory"},
    {"fruit": "glowls", "color": "orange", "flavor": "sour, bitter"},
]
examples = [
    {"id": "0", "sentence": sentences[0], "target": labels[0]},
    {"id": "1", "sentence": sentences[1], "target": labels[1]},
    {"id": "2", "sentence": sentences[2], "target": labels[2]},
]

# 3. Define a scoring function for your evaluation
@weave.op()
def fruit_name_score(target: dict, output: FruitExtract) -> dict:
    target_flavors = [f.strip().lower() for f in target["flavor"].split(",")]
    output_flavors = [f.strip().lower() for f in output.flavor.split(",")]
    # Check if any target flavor is present in the output flavors
    matches = any(tf in of for tf in target_flavors for of in output_flavors)
    return {"correct": matches}

# 4. Run your evaluation
evaluation = weave.Evaluation(
    name="fruit_eval",
    dataset=examples,
    scorers=[fruit_name_score],
)
await evaluation.evaluate(model)
```

## 🚀 Looking for more examples?

* Learn how to build an [evlauation pipeline end-to-end](/weave/tutorial-eval).
* Learn how to evaluate a [RAG application by building](/weave/tutorial-rag).
