This is an interactive notebook. You can run it locally or use the links below:
Leaderboard Quickstart
This quickstart shows you how to use the W&B Weave Leaderboard to compare model performance across multiple datasets and scoring functions. By the end, you will have published a leaderboard that ranks several models against a shared set of evaluations. You can then identify which model performs best on each metric. This guide is for developers familiar with running Weave evaluations who want to compare results side by side. Specifically, you will:- Generate a dataset of fake zip code data.
- Author some scoring functions and evaluate a baseline model.
- Use these techniques to evaluate a matrix of models versus evaluations.
- Review the leaderboard in the Weave UI.
Step 1: Generate a dataset of fake zip code data
First, create a functiongenerate_dataset_rows that generates a list of fake zip code data. This synthetic dataset gives the leaderboard a consistent set of inputs and expected values to score each model against.
Step 2: Author scoring functions
Next, author three scoring functions. Each scorer evaluates a different aspect of the model output so the leaderboard can rank models on distinct dimensions of quality:check_concrete_fields: Checks if the model output matches the expected city and state.check_value_fields: Checks if the model output is within 10% of the expected population and median income.check_subjective_fields: Uses an LLM to check if the model output matches the expected “known for” field.
Step 3: Create an evaluation
Next, define an evaluation using the fake data and scoring functions. TheEvaluation object pairs the dataset with the scorers, so you can run any model against the same benchmark.