This is an interactive notebook. You can run it locally or use the following links:
Model, run it against a small dataset, score the outputs with a custom scoring function, and review the results in Weave. It’s intended for developers who are new to Weave and want a quick, hands-on starting point before exploring more advanced evaluation workflows.
Prerequisites
Before you can run a Weave evaluation, complete the following prerequisites.- Install the W&B Weave SDK and log in with your API key.
- Install the OpenAI SDK and log in with your API key.
- Initialize your W&B project.
Run your first evaluation
With your environment set up, you’re ready to define and run an evaluation. The following code sample shows how to evaluate an LLM using Weave’sModel and Evaluation APIs. First, define a Weave model by subclassing weave.Model, specifying the model name and prompt format, and tracking a predict method with @weave.op. The predict method sends a prompt to OpenAI and parses the response into a structured output using a Pydantic schema (FruitExtract). Then, create a small evaluation dataset that consists of input sentences and expected targets. Next, define a custom scoring function (also tracked using @weave.op) that compares the model’s output to the target label. Finally, wrap everything in a weave.Evaluation, specifying your dataset and scorers, and call evaluate() to run the evaluation pipeline asynchronously.
Looking for more examples
Now that you’ve run a basic evaluation, you can explore more advanced workflows in the following tutorials:- Learn how to build an evaluation pipeline end-to-end.
- Learn how to evaluate a RAG application.