Evaluate using local scorers - Weights & Biases Documentation

Weave’s local scorers are a suite of small language models that run locally on your machine with minimal latency. These models evaluate the safety and quality of your AI system’s inputs, context, and outputs. This page describes how to evaluate the safety and quality of your AI system using Weave’s local scorers, and walks through each available scorer with usage notes and examples. Use this guide to choose a scorer for your use case, install the dependencies, and add scorers to your AI system’s inputs, context, or outputs. W&B fine-tunes some of these models, while others are state-of-the-art open source models trained by the community. W&B Reports document the training and evaluation process. For the full details, see this list of W&B Reports. The model weights are publicly available in W&B Artifacts, and Weave automatically downloads them when you instantiate the scorer class. To download them yourself, see the artifact paths in weave.scorers.default_models. These scorers return an object with a passed boolean attribute that indicates whether the input text is safe or high quality. The object also includes a metadata attribute with more detail, such as the raw score from the model. You can run local scorers on CPUs, but W&B recommends GPUs for best performance.

Local scorers are only available for the Weave Python SDK. They aren’t yet available for the Weave TypeScript SDK.To use Weave scorers in TypeScript, see function-based scorers.

Prerequisites

Before you can use Weave local scorers, install the additional dependencies that bundle the required scorer libraries:

pip install weave[scorers]

With the dependencies installed, you’re ready to choose a scorer for your evaluation.

Select a scorer

The following local scorers are available. Select a scorer based on your use case, then continue to the matching section below for usage notes and an example.

Scorer	Scenario
WeaveToxicityScorerV1	Identify toxic or harmful content in your AI system’s inputs and outputs, including hate speech or threats.
WeaveBiasScorerV1	Detect biased or stereotypical content in your AI system’s inputs and outputs. Useful for reducing harmful biases in generated text.
WeaveHallucinationScorerV1	Identify whether your RAG system generates hallucinations in its output based on the input and context provided.
WeaveContextRelevanceScorerV1	Measure whether the AI system’s output is relevant to the input and context provided.
WeaveCoherenceScorerV1	Evaluate the coherence and logical structure of the AI system’s output.
WeaveFluencyScorerV1	Measure whether the AI system’s output is fluent.
WeaveTrustScorerV1	Combine the toxicity, hallucination, context relevance, fluency, and coherence scorers into a single trust signal.
PresidioScorer	Detect personally identifiable information (PII) in your AI system’s inputs and outputs using the Presidio library from Microsoft.

`WeaveBiasScorerV1`

This scorer assesses gender and race or origin bias along two dimensions:

Race and Origin: Racism and bias against a country or region of origin, immigration status, or ethnicity.
Gender and Sexuality: Sexism, misogyny, homophobia, transphobia, or sexual harassment.

WeaveBiasScorerV1 uses a fine-tuned deberta-small-long-nli model. For more details on the model, dataset, and calibration process, see the WeaveBiasScorerV1 W&B Report.

Usage notes

The score method expects a string to be passed to the output parameter.
A higher score means that there is a stronger prediction of bias in the text.
The threshold parameter is set but can also be overridden on initialization.

Usage example

import weave
from weave.scorers import WeaveBiasScorerV1

bias_scorer = WeaveBiasScorerV1()
result = bias_scorer.score(output="Martian men are terrible at cleaning")

print(f"The text is biased: {not result.passed}")
print(result)

`WeaveToxicityScorerV1`

This scorer assesses the input text for toxicity along five dimensions:

Race and Origin: Racism and bias against a country or region of origin, immigration status, or ethnicity.
Gender and Sexuality: Sexism, misogyny, homophobia, transphobia, or sexual harassment.
Religious: Bias or stereotypes against someone’s religion.
Ability: Bias related to someone’s physical, mental, or intellectual ability or disability.
Violence and Abuse: Overly graphic descriptions of violence, threats of violence, or incitement of violence.

The WeaveToxicityScorerV1 uses the open source Celadon model from PleIAs. For more information, see the WeaveToxicityScorerV1 W&B Report.

Usage notes

The score method expects a string to be passed to the output parameter.
The model returns scores from 0 to 3 across five different categories:
- If the sum of these scores is above total_threshold (default value 5), the input is flagged as toxic.
- If any single category has a score higher than category_threshold (default 2), the input is flagged as toxic.
To make filtering more aggressive, override category_threshold or total_threshold during initialization.

Usage example

import weave
from weave.scorers import WeaveToxicityScorerV1

toxicity_scorer = WeaveToxicityScorerV1()
result = toxicity_scorer.score(output="people from the south pole of Mars are the worst")

print(f"Input is toxic: {not result.passed}")
print(result)

`WeaveHallucinationScorerV1`

This scorer checks whether your AI system’s output contains hallucinations based on the input data. The WeaveHallucinationScorerV1 uses the open source HHEM 2.1 model from Vectara. For more information, see the WeaveHallucinationScorerV1 W&B Report.

Usage notes

The score method expects values to be passed to the query and output parameters.
The context should be passed to the output parameter (as a string or list of strings).
A higher output score means a stronger prediction of hallucination in the output.
The threshold parameter is set but can be overridden on initialization.

Usage example

import weave
from weave.scorers import WeaveHallucinationScorerV1

hallucination_scorer = WeaveHallucinationScorerV1()

result = hallucination_scorer.score(
    query="What is the capital of Antarctica?",
    context="People in Antarctica love the penguins.",
    output="While Antarctica is known for its sea life, penguins aren't liked there."
)

print(f"Output is hallucinated: {not result.passed}")
print(result)

`WeaveContextRelevanceScorerV1`

Use this scorer to evaluate RAG systems. It scores the relevance of the context to the query. The WeaveContextRelevanceScorerV1 uses a fine-tuned deberta-small-long-nli model from tasksource. For more details, see the WeaveContextRelevanceScorerV1 W&B Report.

Usage notes

The score method expects values for query and output.
The context should be passed to the output parameter (string or list of strings).
A higher score means a stronger prediction that the context is relevant to the query.
You can pass verbose=True to the score method to get per-chunk scores.

Usage example

import weave
from weave.scorers import WeaveContextRelevanceScorerV1

context_relevance_scorer = WeaveContextRelevanceScorerV1()

result = context_relevance_scorer.score(
    query="What is the capital of Antarctica?",
    output="The Antarctic has the happiest penguins."  # context is passed to the output parameter
)

print(f"Output is relevant: {result.passed}")
print(result)

`WeaveCoherenceScorerV1`

This scorer checks whether the input text is coherent. The WeaveCoherenceScorerV1 uses a fine-tuned deberta-small-long-nli model from tasksource. For more information, see the WeaveCoherenceScorerV1 W&B Report.

Usage notes

The score method expects text to be passed to the query and output parameters.
A higher output score means a stronger prediction of coherence.

Usage example

import weave
from weave.scorers import WeaveCoherenceScorerV1

coherence_scorer = WeaveCoherenceScorerV1()

result = coherence_scorer.score(
    query="What is the capital of Antarctica?",
    output="but why not monkey up day"
)

print(f"Output is coherent: {result.passed}")
print(result)

`WeaveFluencyScorerV1`

This scorer checks whether the input text is fluent (readable and similar to natural human language). It evaluates grammar, syntax, and overall readability. The WeaveFluencyScorerV1 uses a fine-tuned ModernBERT-base model from AnswerDotAI. For more information, see the WeaveFluencyScorerV1 W&B Report.

Usage notes

The score method expects text to be passed to the output parameter.
A higher output score indicates higher fluency.

Usage example

import weave
from weave.scorers import WeaveFluencyScorerV1

fluency_scorer = WeaveFluencyScorerV1()

result = fluency_scorer.score(
    output="The cat did stretching lazily into warmth of sunlight."
)

print(f"Output is fluent: {result.passed}")
print(result)

`WeaveTrustScorerV1`

Use this scorer when you want a single trust signal that combines multiple scorers instead of running each one individually. The WeaveTrustScorerV1 is a composite scorer for RAG systems. It evaluates the trustworthiness of model outputs by grouping other scorers into two categories: Critical and Advisory. Based on the composite score, it returns a trust level:

high: No issues detected.
medium: Only Advisory issues detected.
low: Critical issues detected or input is empty.

Any input that fails a Critical scorer results in a low trust level. Failing an Advisory scorer results in medium.

Critical:
- WeaveToxicityScorerV1
- WeaveHallucinationScorerV1
- WeaveContextRelevanceScorerV1
Advisory:
- WeaveFluencyScorerV1
- WeaveCoherenceScorerV1

Usage notes

Use this scorer to evaluate RAG pipelines.
It requires query, context, and output keys for correct scoring.

Usage example

import weave
from weave.scorers import WeaveTrustScorerV1

trust_scorer = WeaveTrustScorerV1()

def print_trust_scorer_result(result):
    print()
    print(f"Output is trustworthy: {result.passed}")
    print(f"Trust level: {result.metadata['trust_level']}")
    if not result.passed:
        print("Triggered scorers:")
        for scorer_name, scorer_data in result.metadata['raw_outputs'].items():
            if not scorer_data.passed:
                print(f"  - {scorer_name} did not pass")
    print()
    print(f"WeaveToxicityScorerV1 scores: {result.metadata['scores']['WeaveToxicityScorerV1']}")
    print(f"WeaveHallucinationScorerV1 scores: {result.metadata['scores']['WeaveHallucinationScorerV1']}")
    print(f"WeaveContextRelevanceScorerV1 score: {result.metadata['scores']['WeaveContextRelevanceScorerV1']}")
    print(f"WeaveCoherenceScorerV1 score: {result.metadata['scores']['WeaveCoherenceScorerV1']}")
    print(f"WeaveFluencyScorerV1: {result.metadata['scores']['WeaveFluencyScorerV1']}")
    print()

result = trust_scorer.score(
    query="What is the capital of Antarctica?",
    context="People in Antarctica love the penguins.",
    output="The cat stretched lazily in the warm sunlight."
)

print_trust_scorer_result(result)
print(result)

`PresidioScorer`

This scorer uses the Presidio library to detect personally identifiable information (PII) in your AI system’s inputs and outputs.

Usage notes

To detect specific entity types, such as emails or phone numbers, pass a list of Presidio entities to the selected_entities parameter. Otherwise, Presidio detects all entity types in its default entities list.
You can pass custom recognizers through the custom_recognizers parameter as a list of presidio.EntityRecognizer instances.
To handle non-English input, use the language parameter to specify the language.

Usage example

import weave
from weave.scorers import PresidioScorer

presidio_scorer = PresidioScorer()

result = presidio_scorer.score(
    output="Mary Jane is a software engineer at XYZ company and her email is mary.jane@xyz.com."
)

print(f"Output contains PII: {not result.passed}")
print(result)

​Prerequisites

​Select a scorer

​WeaveBiasScorerV1

​Usage notes

​Usage example

​WeaveToxicityScorerV1

​Usage notes

​Usage example

​WeaveHallucinationScorerV1

​Usage notes

​Usage example

​WeaveContextRelevanceScorerV1

​Usage notes

​Usage example

​WeaveCoherenceScorerV1

​Usage notes

​Usage example

​WeaveFluencyScorerV1

​Usage notes

​Usage example

​WeaveTrustScorerV1

​Usage notes

​Usage example

​PresidioScorer

​Usage notes

​Usage example

Prerequisites

Select a scorer

`WeaveBiasScorerV1`

Usage notes

Usage example

`WeaveToxicityScorerV1`

Usage notes

Usage example

`WeaveHallucinationScorerV1`

Usage notes

Usage example

`WeaveContextRelevanceScorerV1`

Usage notes

Usage example

`WeaveCoherenceScorerV1`

Usage notes

Usage example

`WeaveFluencyScorerV1`

Usage notes

Usage example

`WeaveTrustScorerV1`

Usage notes

Usage example

`PresidioScorer`

Usage notes

Usage example