Evaluate models
4 minute read
Evaluate models with Weave
W&B Weave is a purpose-built toolkit for evaluating LLMs and GenAI applications. It provides comprehensive evaluation capabilities including scorers, judges, and detailed tracing to help you understand and improve model performance. Weave integrates with W&B Models, allowing you to evaluate models stored in your Model Registry.

Key features for model evaluation
- Scorers and judges: Pre-built and custom evaluation metrics for accuracy, relevance, coherence, and more
- Evaluation datasets: Structured test sets with ground truth for systematic evaluation
- Model versioning: Track and compare different versions of your models
- Detailed tracing: Debug model behavior with complete input/output traces
- Cost tracking: Monitor API costs and token usage across evaluations
Getting started: Evaluate a model from W&B Registry
Download a model from W&B Models Registry and evaluate it using Weave:
import weave
import wandb
from typing import Any
# Initialize Weave
weave.init("your-entity/your-project")
# Define a ChatModel that loads from W&B Registry
class ChatModel(weave.Model):
model_name: str
def model_post_init(self, __context):
# Download model from W&B Models Registry
run = wandb.init(project="your-project", job_type="model_download")
artifact = run.use_artifact(self.model_name)
self.model_path = artifact.download()
# Initialize your model here
@weave.op()
async def predict(self, query: str) -> str:
# Your model inference logic
return self.model.generate(query)
# Create evaluation dataset
dataset = weave.Dataset(name="eval_dataset", rows=[
{"input": "What is the capital of France?", "expected": "Paris"},
{"input": "What is 2+2?", "expected": "4"},
])
# Define scorers
@weave.op()
def exact_match_scorer(expected: str, output: str) -> dict:
return {"correct": expected.lower() == output.lower()}
# Run evaluation
model = ChatModel(model_name="wandb-entity/registry-name/model:version")
evaluation = weave.Evaluation(
dataset=dataset,
scorers=[exact_match_scorer]
)
results = await evaluation.evaluate(model)
Integrate Weave evaluations with W&B Models
The Models and Weave Integration Demo shows the complete workflow for:
- Load models from Registry: Download fine-tuned models stored in W&B Models Registry
- Create evaluation pipelines: Build comprehensive evaluations with custom scorers
- Log results back to W&B: Connect evaluation metrics to your model runs
- Version evaluated models: Save improved models back to the Registry
Log evaluation results to both Weave and W&B Models:
# Run evaluation with W&B tracking
with weave.attributes({"wandb-run-id": wandb.run.id}):
summary, call = await evaluation.evaluate.call(evaluation, model)
# Log metrics to W&B Models
wandb.run.log(summary)
wandb.run.config.update({
"weave_eval_url": f"https://wandb.ai/{entity}/{project}/r/call/{call.id}"
})
Advanced Weave features
Custom scorers and judges
Create sophisticated evaluation metrics tailored to your use case:
@weave.op()
def llm_judge_scorer(expected: str, output: str, judge_model) -> dict:
prompt = f"Is this answer correct? Expected: {expected}, Got: {output}"
judgment = await judge_model.predict(prompt)
return {"judge_score": judgment}
Batch evaluations
Evaluate multiple model versions or configurations:
models = [
ChatModel(model_name="model:v1"),
ChatModel(model_name="model:v2"),
]
for model in models:
results = await evaluation.evaluate(model)
print(f"{model.model_name}: {results}")
Next steps
Evaluate models with tables
Use W&B Tables to:
- Compare model predictions: View side-by-side comparisons of how different models perform on the same test set
- Track prediction changes: Monitor how predictions evolve across training epochs or model versions
- Analyze errors: Filter and query to find commonly misclassified examples and error patterns
- Visualize rich media: Display images, audio, text, and other media types alongside predictions and metrics
Basic example: Log evaluation results
import wandb
# Initialize a run
run = wandb.init(project="model-evaluation")
# Create a table with evaluation results
columns = ["id", "input", "ground_truth", "prediction", "confidence", "correct"]
eval_table = wandb.Table(columns=columns)
# Add evaluation data
for idx, (input_data, label) in enumerate(test_dataset):
prediction = model(input_data)
confidence = prediction.max()
predicted_class = prediction.argmax()
eval_table.add_data(
idx,
wandb.Image(input_data), # Log images or other media
label,
predicted_class,
confidence,
label == predicted_class
)
# Log the table
run.log({"evaluation_results": eval_table})
Advanced table workflows
Compare multiple models
Log evaluation tables from different models to the same key for direct comparison:
# Model A evaluation
with wandb.init(project="model-comparison", name="model_a") as run:
eval_table_a = create_eval_table(model_a, test_data)
run.log({"test_predictions": eval_table_a})
# Model B evaluation
with wandb.init(project="model-comparison", name="model_b") as run:
eval_table_b = create_eval_table(model_b, test_data)
run.log({"test_predictions": eval_table_b})
Track predictions over time
Log tables at different training epochs to visualize improvement:
for epoch in range(num_epochs):
train_model(model, train_data)
# Evaluate and log predictions for this epoch
eval_table = wandb.Table(columns=["image", "truth", "prediction"])
for image, label in test_subset:
pred = model(image)
eval_table.add_data(wandb.Image(image), label, pred.argmax())
wandb.log({f"predictions_epoch_{epoch}": eval_table})
Interactive analysis in the W&B UI
Once logged, you can:
- Filter results: Click on column headers to filter by prediction accuracy, confidence thresholds, or specific classes
- Compare tables: Select multiple table versions to see side-by-side comparisons
- Query data: Use the query bar to find specific patterns (for example,
"correct" = false AND "confidence" > 0.8
) - Group and aggregate: Group by predicted class to see per-class accuracy metrics
Example: Error analysis with enriched tables
# Create a mutable table to add analysis columns
eval_table = wandb.Table(
columns=["id", "image", "label", "prediction"],
log_mode="MUTABLE" # Allows adding columns later
)
# Initial predictions
for idx, (img, label) in enumerate(test_data):
pred = model(img)
eval_table.add_data(idx, wandb.Image(img), label, pred.argmax())
run.log({"eval_analysis": eval_table})
# Add confidence scores for error analysis
confidences = [model(img).max() for img, _ in test_data]
eval_table.add_column("confidence", confidences)
# Add error types
error_types = classify_errors(eval_table.get_column("label"),
eval_table.get_column("prediction"))
eval_table.add_column("error_type", error_types)
run.log({"eval_analysis": eval_table})
Feedback
Was this page helpful?
Glad to hear it! If you have more to say, please let us know.
Sorry to hear that. Please tell us how we can improve.