This is an interactive notebook. You can run it locally or use the links below:
Evaluation. By the end, you’ll have a working evaluation pipeline that references rows in a HuggingFace dataset by index, preprocesses each row into the format your model expects, and tracks the scored results in Weave. This pattern is useful when you want to evaluate models against existing HuggingFace datasets without first converting them into a Weave-native dataset.
This guide demonstrates a workaround for using HuggingFace datasets with Weave evaluations. The approach works today, and more seamless integrations for external datasets are in development.
Setup and imports
First, initialize Weave and connect to W&B so that Weave tracks the evaluation runs and their results in your Weave project.Load and prepare HuggingFace dataset
Next, load the HuggingFace dataset and build a lightweight index that the evaluation iterates over. Rather than passing the dataset rows directly to Weave, pass a list of index references and resolve them to full rows during preprocessing. This approach keeps the evaluation tied to the original HuggingFace dataset and maintains references to it.In the index, encode the
hf_hub_name along with the hf_id to ensure each row has a unique identifier. Weave uses this unique digest value to track and reference specific dataset entries during evaluations.Define processing and evaluation functions
With the index in place, define the three functions that make up the evaluation pipeline: one to turn each index reference into a usable example, one to score the model’s output, and one representing the model being evaluated. The processing pipeline uses the following functions:preprocess_example: Transforms the index reference into the actual data needed for evaluation.hf_eval: Defines how to score the model outputs.function_to_evaluate: The actual function or model being evaluated.
Create and run the evaluation
Finally, wire the index, scorer, and preprocessing function into a WeaveEvaluation and run it against the model. For each entry in hf_index, Weave performs the following steps:
preprocess_examplegets the corresponding data from the HuggingFace dataset.- Weave passes the preprocessed data to
function_to_evaluate. hf_evalscores the output.- Weave tracks the results.