Skip to main content
This guide shows you how to use EvaluationLogger to record predictions and scores from your existing Python or TypeScript code, so you can evaluate model performance in Weave without first defining a full dataset and scorer suite. This approach is helpful in complex workflows where the entire dataset or all scorers might not be defined upfront. In contrast to the standard Evaluation object, which requires a predefined Dataset and list of Scorer objects, the EvaluationLogger lets you log individual predictions and their associated scores incrementally as they become available.
Prefer a more structured evaluation?If you prefer a more opinionated evaluation framework with predefined datasets and scorers, see Weave’s standard Evaluation framework.The EvaluationLogger offers flexibility while the standard framework offers structure and guidance.

Basic workflow

Following these steps records a complete evaluation in Weave, with per-prediction scores and an aggregated summary that you can review in the Weave UI.
  1. Initialize the logger: Create an instance of EvaluationLogger, optionally providing metadata about the model and dataset. Weave uses defaults if omitted.
    To capture token usage and cost for LLM calls (for example, OpenAI), initialize EvaluationLogger before any LLM invocations. If you call your LLM first and then log predictions afterward, Weave doesn’t capture token and cost data.
  2. Log predictions: Call log_prediction() for each input and output pair from your system.
  3. Log scores: Use the returned ScoreLogger to log_score() for the prediction. Multiple scores per prediction are supported.
  4. Finish prediction: Always call finish() after logging scores for a prediction to finalize it.
  5. Log summary: After all predictions are processed, call log_summary() to aggregate scores and add optional custom metrics.
After calling finish() on a prediction, no more scores can be logged for it.
For a Python example that demonstrates this workflow, see the Basic example. If the output and all scores are available at once, Python users can combine steps 2 through 4 into a single call using log_example().

Basic example

The following example shows how to use EvaluationLogger to log predictions and scores inline with your existing code.
The user_model model function is defined and applied to a list of inputs. For each example:
  • The input and output are logged using log_prediction.
  • A correctness score (correctness_score) is logged through log_score.
  • finish() finalizes logging for that prediction.
Finally, log_summary records any aggregate metrics and triggers automatic score summarization in Weave.
import weave
from openai import OpenAI
from weave import EvaluationLogger

weave.init('your-team/your-project')

# Initialize EvaluationLogger BEFORE calling the model to ensure token tracking
eval_logger = EvaluationLogger(
    model="my_model",
    dataset="my_dataset"
)

# Example input data (this can be any data structure you want)
eval_samples = [
    {'inputs': {'a': 1, 'b': 2}, 'expected': 3},
    {'inputs': {'a': 2, 'b': 3}, 'expected': 5},
    {'inputs': {'a': 3, 'b': 4}, 'expected': 7},
]

# Example model logic using OpenAI
@weave.op
def user_model(a: int, b: int) -> int:
    oai = OpenAI()
    response = oai.chat.completions.create(
        messages=[{"role": "user", "content": f"What is {a}+{b}?"}],
        model="gpt-4o-mini"
    )
    # Use the response in some way (here we just return a + b for simplicity)
    return a + b

# Iterate through examples, predict, and log
for sample in eval_samples:
    inputs = sample["inputs"]
    model_output = user_model(**inputs) # Pass inputs as kwargs

    # Log the prediction input and output
    pred_logger = eval_logger.log_prediction(
        inputs=inputs,
        output=model_output
    )

    # Calculate and log a score for this prediction
    expected = sample["expected"]
    correctness_score = model_output == expected
    pred_logger.log_score(
        scorer="correctness", # Simple string name for the scorer
        score=correctness_score
    )

    # Finish logging for this specific prediction
    pred_logger.finish()

# Log a final summary for the entire evaluation.
# Weave auto-aggregates the 'correctness' scores logged above.
summary_stats = {"subjective_overall_score": 0.8}
eval_logger.log_summary(summary_stats)

print("Evaluation logging complete. View results in the Weave UI.")

Simplified logging with log_example()

Use log_example() to log inputs, an output, and scores in a single call. This convenience method combines log_prediction(), log_score(), and finish() into one step, and it’s useful when you already have the inputs, model outputs, and scores ready to log, such as during batch or offline evaluations.
import weave
from weave import EvaluationLogger

weave.init('your-team-name/your-project-name')

eval_logger = EvaluationLogger(
    model="my_model",
    dataset="my_dataset"
)

eval_samples = [
    {'inputs': {'a': 1, 'b': 2}, 'expected': 3},
    {'inputs': {'a': 2, 'b': 3}, 'expected': 5},
    {'inputs': {'a': 3, 'b': 4}, 'expected': 7},
]

for sample in eval_samples:
    inputs = sample['inputs']
    output = inputs['a'] + inputs['b']

    eval_logger.log_example(
        inputs=inputs,
        output=output,
        scores={"correctness": output == sample['expected']}
    )

eval_logger.log_summary({"avg_score": 1.0})
The previous log_example() call is equivalent to:
pred = eval_logger.log_prediction(inputs=inputs, output=output)
pred.log_score(scorer="correctness", score=output == sample['expected'])
pred.finish()
log_example() is not available for the Weave TypeScript SDK. TypeScript users should use the logPrediction() and logScore() pattern shown in the basic example.

Advanced usage

The EvaluationLogger offers flexible patterns beyond the basic workflow to accommodate more complex evaluation scenarios. The following sections describe advanced techniques, including how to use context managers for automatic resource management, separate model execution from logging, work with rich media data, and compare multiple model evaluations side by side.

Use context managers

The EvaluationLogger supports context managers (with statements) for both predictions and scores. This can provide cleaner code, automatic resource cleanup, and better tracking of nested operations like LLM judge calls. Using with statements in this context provides:
  • Automatic finish() calls when exiting the context.
  • Better token and cost tracking for nested LLM calls.
  • Setting output after model execution within the prediction context.
import openai
import weave

weave.init("nested-evaluation-example")
oai = openai.OpenAI()

# Initialize the logger
ev = weave.EvaluationLogger(
    model="gpt-4o-mini",
    dataset="joke_dataset"
)

user_prompt = "Tell me a joke"

# Use context manager for prediction - no need to call finish()
with ev.log_prediction(inputs={"user_prompt": user_prompt}) as pred:
    # Make your model call within the context
    result = oai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": user_prompt}],
    )

    # Set the output after the model call
    pred.output = result.choices[0].message.content

    # Log simple scores
    pred.log_score("correctness", 1.0)
    pred.log_score("ambiguity", 0.3)
    
    # Use nested context manager for scores that require LLM calls
    with pred.log_score("llm_judge") as score:
        judge_result = oai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "Rate how funny the joke is from 1-5"},
                {"role": "user", "content": pred.output},
            ],
        )
        # Set the score value after computation
        score.value = judge_result.choices[0].message.content

# finish() is automatically called when exiting the 'with' block

ev.log_summary({"avg_score": 1.0})
This pattern ensures that all nested operations are tracked and attributed to the parent prediction, giving you accurate token usage and cost data in the Weave UI.
When you pass raw datasets as inputs to log_prediction, Weave reimports the data with every evaluation run. This stores duplicate data, which can waste space if the dataset is large or if many evaluations reuse it. To avoid this duplication, publish your dataset to Weave before running any evaluations, then pass the published dataset’s rows as inputs. Weave resolves references to published rows using internal references instead of reimporting the data. This technique gives you the same linked experience as the standard Evaluation framework, where each prediction links back to a specific dataset row in the Weave UI. The following example publishes a dataset and links to it in the EvaluationLogger, before retrieving and iterating over it like any other dataset.
import weave
from weave import EvaluationLogger

weave.init("your-team-name/your-project-name")

# Publish the dataset (only needs to happen once)
dataset = weave.Dataset(
    name="my_eval_dataset",
    rows=[
      {"question": "What is the capitol of France?", "expected": "Paris"},
      {"question": "What U.S. state is Seattle in?", "expected": "Washington"},
      {"question": "In what country is Mount Fuji located in?", "expected": "Japan"},
    ],
)
weave.publish(dataset)

# Retrieve the published dataset
dataset = weave.ref("my_eval_dataset").get()

Get outputs before logging

You can first compute your model outputs, then separately log predictions and scores. This separates evaluation and logging logic, which can make code easier to test and maintain when different parts of your system handle prediction generation and scoring.
# Initialize EvaluationLogger BEFORE calling the model to ensure token tracking
ev = EvaluationLogger(
    model="example_model",
    dataset="example_dataset"
)

# Model outputs (e.g. OpenAI calls) must happen after logger init for token tracking
outputs = [your_output_generator(**inputs) for inputs in your_dataset]
preds = [ev.log_prediction(inputs, output) for inputs, output in zip(your_dataset, outputs)]
for pred, output in zip(preds, outputs):
    pred.log_score(scorer="greater_than_5_scorer", score=output > 5)
    pred.log_score(scorer="greater_than_7_scorer", score=output > 7)
    pred.finish()

ev.log_summary()

Log rich media

Inputs, outputs, and scores can include rich media such as images, videos, audio, or structured tables. Logging rich media lets you inspect the actual content alongside scores in the Weave UI, which is helpful for qualitative analysis of multimodal models. Pass a dict or media object into the log_prediction or log_score methods.
import io
import wave
import struct
from PIL import Image
import random
from typing import Any
import weave

def generate_random_audio_wave_read(duration=2, sample_rate=44100):
    n_samples = duration * sample_rate
    amplitude = 32767  # 16-bit max amplitude

    buffer = io.BytesIO()

    # Write wave data to the buffer
    with wave.open(buffer, 'wb') as wf:
        wf.setnchannels(1)
        wf.setsampwidth(2)  # 16-bit
        wf.setframerate(sample_rate)

        for _ in range(n_samples):
            sample = random.randint(-amplitude, amplitude)
            wf.writeframes(struct.pack('<h', sample))

    # Rewind the buffer to the beginning so we can read from it
    buffer.seek(0)

    # Return a Wave_read object
    return wave.open(buffer, 'rb')

rich_media_dataset = [
    {
        'image': Image.new(
            "RGB",
            (100, 100),
            color=(
                random.randint(0, 255),
                random.randint(0, 255),
                random.randint(0, 255),
            ),
        ),
        "audio": generate_random_audio_wave_read(),
    }
    for _ in range(5)
]

@weave.op
def your_output_generator(image: Image.Image, audio) -> dict[str, Any]:
    return {
        "result": random.randint(0, 10),
        "image": image,
        "audio": audio,
    }

ev = EvaluationLogger(model="example_model", dataset="example_dataset")

for inputs in rich_media_dataset:
    output = your_output_generator(**inputs)
    pred = ev.log_prediction(inputs, output)
    pred.log_score(scorer="greater_than_5_scorer", score=output["result"] > 5)
    pred.log_score(scorer="greater_than_7_scorer", score=output["result"] > 7)

ev.log_summary()

Log and compare multiple evaluations

With EvaluationLogger, you can log and compare multiple evaluations side by side in the Weave UI. This is useful for assessing how different models perform on the same dataset.
  1. Run the following code sample.
  2. In the Weave UI, navigate to the Evals tab.
  3. Select the evals that you want to compare.
  4. Click the Compare button. In the Compare view, you can:
    • Choose which Evals to add or remove.
    • Choose which metrics to show or hide.
    • Page through specific examples to see how different models performed for the same input on a given dataset.
For more information about comparisons, see Comparisons.
import weave

models = [
    "model1",
    "model2",
     {"name": "model3", "metadata": {"coolness": 9001}}
]

for model in models:
    # EvalLogger must be initialized before model calls to capture tokens
    ev = EvaluationLogger(
        name="comparison-eval",
        model=model, 
        dataset="example_dataset",
        scorers=["greater_than_3_scorer", "greater_than_5_scorer", "greater_than_7_scorer"],
        eval_attributes={"experiment_id": "exp_123"}
    )
    for inputs in your_dataset:
        output = your_output_generator(**inputs)
        pred = ev.log_prediction(inputs=inputs, output=output)
        pred.log_score(scorer="greater_than_3_scorer", score=output > 3)
        pred.log_score(scorer="greater_than_5_scorer", score=output > 5)
        pred.log_score(scorer="greater_than_7_scorer", score=output > 7)
        pred.finish()

    ev.log_summary()
The Evals tab
The Comparison view

Usage tips

The following tips help you get the most out of EvaluationLogger:
  • Call finish() promptly after each prediction.
  • Use log_summary to capture metrics not tied to single predictions (for example, overall latency).
  • Rich media logging is useful for qualitative analysis.