weave.scorers.default_models.
These scorers return an object with a passed boolean attribute that indicates whether the input text is safe or high quality. The object also includes a metadata attribute with more detail, such as the raw score from the model.
You can run local scorers on CPUs, but W&B recommends GPUs for best performance.
Local scorers are only available for the Weave Python SDK. They aren’t yet available for the Weave TypeScript SDK.To use Weave scorers in TypeScript, see function-based scorers.
Prerequisites
Before you can use Weave local scorers, install the additional dependencies that bundle the required scorer libraries:Select a scorer
The following local scorers are available. Select a scorer based on your use case, then continue to the matching section below for usage notes and an example.| Scorer | Scenario |
|---|---|
| WeaveToxicityScorerV1 | Identify toxic or harmful content in your AI system’s inputs and outputs, including hate speech or threats. |
| WeaveBiasScorerV1 | Detect biased or stereotypical content in your AI system’s inputs and outputs. Useful for reducing harmful biases in generated text. |
| WeaveHallucinationScorerV1 | Identify whether your RAG system generates hallucinations in its output based on the input and context provided. |
| WeaveContextRelevanceScorerV1 | Measure whether the AI system’s output is relevant to the input and context provided. |
| WeaveCoherenceScorerV1 | Evaluate the coherence and logical structure of the AI system’s output. |
| WeaveFluencyScorerV1 | Measure whether the AI system’s output is fluent. |
| WeaveTrustScorerV1 | Combine the toxicity, hallucination, context relevance, fluency, and coherence scorers into a single trust signal. |
| PresidioScorer | Detect personally identifiable information (PII) in your AI system’s inputs and outputs using the Presidio library from Microsoft. |
WeaveBiasScorerV1
This scorer assesses gender and race or origin bias along two dimensions:
- Race and Origin: Racism and bias against a country or region of origin, immigration status, or ethnicity.
- Gender and Sexuality: Sexism, misogyny, homophobia, transphobia, or sexual harassment.
WeaveBiasScorerV1 uses a fine-tuned deberta-small-long-nli model. For more details on the model, dataset, and calibration process, see the WeaveBiasScorerV1 W&B Report.
Usage notes
- The
scoremethod expects a string to be passed to theoutputparameter. - A higher score means that there is a stronger prediction of bias in the text.
- The
thresholdparameter is set but can also be overridden on initialization.
Usage example
WeaveToxicityScorerV1
This scorer assesses the input text for toxicity along five dimensions:
- Race and Origin: Racism and bias against a country or region of origin, immigration status, or ethnicity.
- Gender and Sexuality: Sexism, misogyny, homophobia, transphobia, or sexual harassment.
- Religious: Bias or stereotypes against someone’s religion.
- Ability: Bias related to someone’s physical, mental, or intellectual ability or disability.
- Violence and Abuse: Overly graphic descriptions of violence, threats of violence, or incitement of violence.
WeaveToxicityScorerV1 uses the open source Celadon model from PleIAs. For more information, see the WeaveToxicityScorerV1 W&B Report.
Usage notes
- The
scoremethod expects a string to be passed to theoutputparameter. - The model returns scores from
0to3across five different categories:- If the sum of these scores is above
total_threshold(default value5), the input is flagged as toxic. - If any single category has a score higher than
category_threshold(default2), the input is flagged as toxic.
- If the sum of these scores is above
- To make filtering more aggressive, override
category_thresholdortotal_thresholdduring initialization.
Usage example
WeaveHallucinationScorerV1
This scorer checks whether your AI system’s output contains hallucinations based on the input data.
The WeaveHallucinationScorerV1 uses the open source HHEM 2.1 model from Vectara. For more information, see the WeaveHallucinationScorerV1 W&B Report.
Usage notes
- The
scoremethod expects values to be passed to thequeryandoutputparameters. - The context should be passed to the
outputparameter (as a string or list of strings). - A higher output score means a stronger prediction of hallucination in the output.
- The
thresholdparameter is set but can be overridden on initialization.
Usage example
WeaveContextRelevanceScorerV1
Use this scorer to evaluate RAG systems. It scores the relevance of the context to the query.
The WeaveContextRelevanceScorerV1 uses a fine-tuned deberta-small-long-nli model from tasksource. For more details, see the WeaveContextRelevanceScorerV1 W&B Report.
Usage notes
- The
scoremethod expects values forqueryandoutput. - The context should be passed to the
outputparameter (string or list of strings). - A higher score means a stronger prediction that the context is relevant to the query.
- You can pass
verbose=Trueto thescoremethod to get per-chunk scores.
Usage example
WeaveCoherenceScorerV1
This scorer checks whether the input text is coherent.
The WeaveCoherenceScorerV1 uses a fine-tuned deberta-small-long-nli model from tasksource. For more information, see the WeaveCoherenceScorerV1 W&B Report.
Usage notes
- The
scoremethod expects text to be passed to thequeryandoutputparameters. - A higher output score means a stronger prediction of coherence.
Usage example
WeaveFluencyScorerV1
This scorer checks whether the input text is fluent (readable and similar to natural human language). It evaluates grammar, syntax, and overall readability.
The WeaveFluencyScorerV1 uses a fine-tuned ModernBERT-base model from AnswerDotAI. For more information, see the WeaveFluencyScorerV1 W&B Report.
Usage notes
- The
scoremethod expects text to be passed to theoutputparameter. - A higher output score indicates higher fluency.
Usage example
WeaveTrustScorerV1
Use this scorer when you want a single trust signal that combines multiple scorers instead of running each one individually.
The WeaveTrustScorerV1 is a composite scorer for RAG systems. It evaluates the trustworthiness of model outputs by grouping other scorers into two categories: Critical and Advisory. Based on the composite score, it returns a trust level:
high: No issues detected.medium: Only Advisory issues detected.low: Critical issues detected or input is empty.
low trust level. Failing an Advisory scorer results in medium.
-
Critical:
WeaveToxicityScorerV1WeaveHallucinationScorerV1WeaveContextRelevanceScorerV1
-
Advisory:
WeaveFluencyScorerV1WeaveCoherenceScorerV1
Usage notes
- Use this scorer to evaluate RAG pipelines.
- It requires
query,context, andoutputkeys for correct scoring.
Usage example
PresidioScorer
This scorer uses the Presidio library to detect personally identifiable information (PII) in your AI system’s inputs and outputs.
Usage notes
- To detect specific entity types, such as emails or phone numbers, pass a list of Presidio entities to the
selected_entitiesparameter. Otherwise, Presidio detects all entity types in its default entities list. - You can pass custom recognizers through the
custom_recognizersparameter as a list ofpresidio.EntityRecognizerinstances. - To handle non-English input, use the
languageparameter to specify the language.