Skip to main content
This is an interactive notebook. You can run it locally or use the links below:
This tutorial shows you how to build, trace, and evaluate a computer vision pipeline that performs named entity recognition (NER) on images of handwritten patient information. By the end, you’ll have a working optical character recognition (OCR) pipeline backed by a vision-language model (VLM) and a W&B Weave Evaluation that measures how accurately the pipeline extracts structured fields from images. This guide is for developers who want to use Weave to iterate on prompts and systematically measure the quality of multimodal extraction pipelines. The following sections walk through five stages: creating and iterating on prompts, retrieving the dataset, building the NER pipeline, defining scorers, and running an evaluation.

Prerequisites

Before you begin, install and import the required libraries, get your W&B API key, and initialize your Weave project. Completing this step ensures that your environment can authenticate with W&B and log traces to your Weave project.
# Install the required dependencies
!pip install openai weave -q
python
import json
import os

from google.colab import userdata
from openai import OpenAI

import weave
python
# Get API Keys
os.environ["OPENAI_API_KEY"] = userdata.get(
    "OPENAI_API_KEY"
)  # please set the keys as collab environment secrets from the menu on the left
os.environ["WANDB_API_KEY"] = userdata.get("WANDB_API_KEY")

# Set project name
# Replace the PROJECT value with your project name
PROJECT = "vlm-handwritten-ner"

# Initiatlize the Weave project
weave.init(PROJECT)

Create and iterate on prompts with Weave

Good prompt engineering is critical to guiding the model to properly extract entities. In this section, you author an initial prompt, publish it to Weave so you can track changes over time, and then refine it with stricter validation rules. First, create a basic prompt that gives the model the instructions on what to extract from the image data and how to format it. Then, store the prompt in Weave for tracking and iteration.
# Create your prompt object with Weave
prompt = """
Extract all readable text from this image. Format the extracted entities as a valid JSON.
Do not return any extra text, just the JSON. Do not include ```json```
Use the following format:
{"Patient Name": "James James","Date": "4/22/2025","Patient ID": "ZZZZZZZ123","Group Number": "3452542525"}
"""
system_prompt = weave.StringPrompt(prompt)
# Publish your prompt to Weave
weave.publish(system_prompt, name="NER-prompt")
Next, improve the prompt by adding more instructions and validation rules to help reduce errors in the output. Publishing the revised version under the same name lets Weave track the prompt as a new version so you can compare results across iterations.
better_prompt = """
You are a precision OCR assistant. Given an image of patient information, extract exactly these fields into a single JSON object (and nothing else):

- Patient Name
- Date (MM/DD/YYYY)
- Patient ID
- Group Number

Validation rules:
1. Date must match MM/DD/YY; if not, set Date to "".
2. Patient ID must be alphanumeric; if unreadable, set to "".
3. Always zero-pad months and days (e.g. "04/07/25").
4. Omit any markup, commentary, or code fences.
5. Return strictly valid JSON with only those four keys.

Do not return any extra text, just the JSON. Do not include ```json```
Example output:
{"Patient Name":"James James","Date":"04/22/25","Patient ID":"ZZZZZZZ123","Group Number":"3452542525"}
"""
# Edit the prompt
system_prompt = weave.StringPrompt(better_prompt)
# Publish the edited prompt to Weave
weave.publish(system_prompt, name="NER-prompt")

Get the dataset

With a prompt in place, you next need input data to run through the pipeline. Retrieve the dataset of handwritten notes that serves as input for the OCR pipeline. The images in the dataset are already base64 encoded, which means that the LLM can use the data without any pre-processing.
# Retrieve the dataset from the following Weave project
dataset = weave.ref(
    "weave://wandb-smle/vlm-handwritten-ner/object/NER-eval-dataset:G8MEkqWBtvIxPYAY23sXLvqp8JKZ37Cj0PgcG19dGjw"
).get()

# Access a specific example in the dataset
example_image = dataset.rows[3]["image_base64"]

# Display the example_image
from IPython.display import HTML, display

html = f'<img src="{example_image}" style="max-width: 100%; height: auto;">'
display(HTML(html))

Build the NER pipeline

Now that you have a prompt and a dataset, build the NER pipeline that connects them to the VLM. The pipeline consists of two functions:
  • An encode_image function that takes a PIL image from the dataset and returns a base64 encoded string representation of the image that can be passed to the VLM.
  • An extract_named_entities_from_image function that takes an image and system prompt and returns the extracted entities from that image as described by the system prompt.
# Traceable function using GPT-4-Vision
def extract_named_entities_from_image(image_base64) -> dict:
    # init LLM Client
    client = OpenAI()

    # Setup the instruction prompt
    # You can optionally use a prompt stored in Weave withweave.ref("weave://wandb-smle/vlm-handwritten-ner/object/NER-prompt:FmCv4xS3RFU21wmNHsIYUFal3cxjtAkegz2ylM25iB8").get().content.strip()
    prompt = better_prompt

    response = client.responses.create(
        model="gpt-4.1",
        input=[
            {
                "role": "user",
                "content": [
                    {"type": "input_text", "text": prompt},
                    {
                        "type": "input_image",
                        "image_url": image_base64,
                    },
                ],
            }
        ],
    )

    return response.output_text
Now, create a function called named_entity_recognation that:
  • Passes the image data to the NER pipeline.
  • Returns correctly formatted JSON with the results.
Use the @weave.op() decorator to automatically track and trace function execution in the W&B UI. Every time named_entity_recognation runs, the full trace results are visible in the Weave UI. To view the traces, navigate to the Traces tab of your Weave project.
# NER Function for evaluations
@weave.op()
def named_entity_recognation(image_base64, id):
    result = {}
    try:
        # 1) call the vision op, get back a JSON string
        output_text = extract_named_entities_from_image(image_base64)

        # 2) parse JSON exactly once
        result = json.loads(output_text)

        print(f"Processed: {str(id)}")
    except Exception as e:
        print(f"Failed to process {str(id)}: {e}")
    return result
Finally, run the pipeline over the dataset, and view the results. This step produces the model outputs that you evaluate in the next section. The following code loops over the dataset and stores the results in a local file processing_results.json. The results are also viewable in the Weave UI.
# Output results
results = []

# loop over all images in the dataset
for row in dataset.rows:
    result = named_entity_recognation(row["image_base64"], str(row["id"]))
    result["image_id"] = str(row["id"])
    results.append(result)

# Save all results to a JSON file
output_file = "processing_results.json"
with open(output_file, "w") as f:
    json.dump(results, f, indent=2)

print(f"Results saved to: {output_file}")
You see something similar to the following in the Traces table in the Weave UI.
Weave Traces table showing NER pipeline execution results.

Evaluate the pipeline using Weave

Now that you have created a pipeline to perform NER using a VLM, you can use Weave to systematically evaluate it and find out how well it performs. Evaluating the pipeline lets you measure extraction quality across the full dataset rather than relying on spot checks. For more information about Evaluations in Weave, see Evaluations Overview. A fundamental part of a Weave Evaluation is the Scorer. Scorers evaluate AI outputs and return evaluation metrics. They take the AI’s output, analyze it, and return a dictionary of results. Scorers can use your input data as reference if needed and can also output extra information, such as explanations or reasonings from the evaluation. In this section, you create two scorers to evaluate the pipeline:
  • Programatic scorer.
  • LLM-as-a-judge scorer.

Programatic scorer

The first Scorer is a deterministic check that runs without an LLM. The programmatic scorer, check_for_missing_fields_programatically, takes the model output (the output of the named_entity_recognition function), and identifies which keys are missing or empty in the results. This check is useful for identifying samples where the model missed capturing any fields.
# Add weave.op() to track execution of the scorer
@weave.op()
def check_for_missing_fields_programatically(model_output):
    # Required keys for every entry
    required_fields = {"Patient Name", "Date", "Patient ID", "Group Number"}

    for key in required_fields:
        if (
            key not in model_output
            or model_output[key] is None
            or str(model_output[key]).strip() == ""
        ):
            return False  # This entry has a missing or empty field

    return True  # All required fields are present and non-empty

LLM-as-a-judge scorer

Because the programmatic scorer only catches missing or empty fields, you need a second scorer to check whether the extracted values match what appears in the image. In this step of the evaluation, you provide both the image data and the model’s output to ensure the assessment reflects actual NER performance. The image content is explicitly referenced, not just the model output. The scorer used for this step, check_for_missing_fields_with_llm, uses an LLM to perform scoring (specifically OpenAI’s gpt-4o). As specified by the contents of the eval_prompt, check_for_missing_fields_with_llm outputs a Boolean value. If all fields match the information in the image and formatting is correct, the scorer returns true. If any field is missing, empty, incorrect, or mismatched, the result is false, and the scorer also returns a message explaining the problem.
# The system prompt for the LLM-as-a-judge

eval_prompt = """
You are an OCR validation system. Your role is to assess whether the structured text extracted from an image accurately reflects the information in that image.
Only validate the structured text and use the image as your source of truth.

Expected input text format:
{"Patient Name": "First Last", "Date": "04/23/25", "Patient ID": "131313JJH", "Group Number": "35453453"}

Evaluation criteria:
- All four fields must be present.
- No field should be empty or contain placeholder/malformed values.
- The "Date" should be in MM/DD/YY format (e.g., "04/07/25") (zero padding the date is allowed)

Scoring:
- Return: {"Correct": true, "Reason": ""} if **all fields** match the information in the image and formatting is correct.
- Return: {"Correct": false, "Reason": "EXPLANATION"} if **any** field is missing, empty, incorrect, or mismatched.

Output requirements:
- Respond with a valid JSON object only.
- "Correct" must be a JSON boolean: true or false (not a string or number).
- "Reason" must be a short, specific string indicating all the problem — e.g., "Patient Name mismatch", "Date not zero-padded", or "Missing Group Number".
- Do not return any additional explanation or formatting.

Your response must be exactly one of the following:
{"Correct": true, "Reason": null}
OR
{"Correct": false, "Reason": "EXPLANATION_HERE"}
"""

# Add weave.op() to track execution of the Scorer
@weave.op()
def check_for_missing_fields_with_llm(model_output, image_base64):
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "developer", "content": [{"text": eval_prompt, "type": "text"}]},
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": image_base64,
                        },
                    },
                    {"type": "text", "text": str(model_output)},
                ],
            },
        ],
        response_format={"type": "json_object"},
    )
    response = json.loads(response.choices[0].message.content)
    return response

Run the evaluation

With both Scorers defined, you can now run the evaluation. Define an evaluation call that automatically loops over the dataset passed and logs the results together in the Weave UI. The following code kicks off the evaluation and applies the two Scorers to every output from the NER pipeline. Results are visible in the Evals tab in the Weave UI.
evaluation = weave.Evaluation(
    dataset=dataset,
    scorers=[
        check_for_missing_fields_with_llm,
        check_for_missing_fields_programatically,
    ],
    name="Evaluate_4.1_NER",
)

print(await evaluation.evaluate(named_entity_recognation))
When the previous code runs, Weave generates a link to the Evaluation table in the Weave UI. Follow the link to view the results and compare different iterations of the pipeline across models, prompts, and datasets of your choice. The Weave UI automatically creates a visualization like the following for your team.
Weave Evaluation results comparing scorer outputs across the dataset.