評価を作成する - Weights & Biases Documentation

このチュートリアルでは、Weave でエンドツーエンドの評価パイプラインを構築し、改善を重ねながら LLM アプリケーションの品質を測定してトラッキングする方法を学びます。評価を使うと、一貫した一連のサンプルに対して変更を比較し、ユーザーに届く前にリグレッションを検出できます。このチュートリアルは、LLM を活用したアプリケーションを構築し、それらをテストするための再現可能な方法を求める開発者を対象としています。 Weave は、Model クラスと Evaluation クラスによる評価のトラッキングをネイティブにサポートしています。API は前提を最小限に抑えて設計されているため、幅広いユースケースに柔軟に対応できます。

このガイドで学ぶこと

このガイドでは、次の内容を学びます。

Model を設定する。
LLM の応答をテストするためのデータセットを作成する。
モデルの出力を期待される出力と比較するスコアリング関数を定義する。
スコアリング関数と追加の組み込み Scorer を使用して、データセットに対するモデルの評価を実行する。
評価結果を Weave UI で確認する。

最後には、サンプルのモデルをデータセットに対してスコアリングし、その結果を Weave にログする、動作する評価パイプラインが完成します。

前提条件

W&Bアカウント
Python 3.10+ または Node.js 18+
必要なパッケージがインストールされていること:
- Python: pip install weave openai
- TypeScript: npm install weave openai
OpenAI APIキーが環境変数として設定されていること。

必要なライブラリと関数をインポートする

以下のライブラリをスクリプトにインポートします。

Python
TypeScript

import json
import openai
import asyncio
import weave
from weave.scorers import MultiTaskBinaryClassificationF1

import * as weave from 'weave';
import OpenAI from 'openai';

`Model` を作成する

ライブラリの準備ができたら、次のステップは評価したいモデルを定義することです。 Weave では、Models はオブジェクトであり、モデルまたはエージェントの動作 (ロジック、prompt、パラメーター) と、バージョン管理されたメタデータ (パラメーター、code、マイクロ設定) の両方を取得します。これにより、モデルを確実にトラッキング、比較、評価し、反復的に改善できます。 Model をインスタンス化すると、Weave はその設定と動作を自動的に取得し、変更が発生するとバージョンを更新します。これにより、改善を重ねながら、時間の経過に伴うパフォーマンスをトラッキングできます。 Model を宣言するには、Model をサブクラス化し、1 つの example を受け取って response を返す predict 関数を実装します。次のモデル例では、OpenAI を使用して入力文からエイリアンの果物の名前、色、味を抽出します。

Python
TypeScript

class ExtractFruitsModel(weave.Model):
    model_name: str
    prompt_template: str

    @weave.op()
    async def predict(self, sentence: str) -> dict:
        client = openai.AsyncClient()

        response = await client.chat.completions.create(
            model=self.model_name,
            messages=[
                {"role": "user", "content": self.prompt_template.format(sentence=sentence)}
            ],
        )
        result = response.choices[0].message.content
        if result is None:
            raise ValueError("No response from model")
        parsed = json.loads(result)
        return parsed

// 注: TypeScript では `weave.Model` はまだサポートされていません。
// 代わりに、モデルのような関数を `weave.op` でラップします

import * as weave from 'weave';
import OpenAI from 'openai';

const openaiClient = new OpenAI();

const model = weave.op(async function myModel({datasetRow}) {
  const prompt = `Extract fields ("fruit": <str>, "color": <str>, "flavor") from the following text, as json: ${datasetRow.sentence}`;
  const response = await openaiClient.chat.completions.create({
    model: 'gpt-3.5-turbo',
    messages: [{ role: 'user', content: prompt }],
    response_format: { type: 'json_object' }
  });
  return JSON.parse(response.choices[0].message.content);
});

ExtractFruitsModel クラスは weave.Model を継承 (サブクラス化) しているため、Weave はインスタンス化されたオブジェクトをトラッキングできます。@weave.op は predict 関数をデコレートし、その inputs と出力をトラッキングします。 Model オブジェクトは次のようにインスタンス化できます。

Python
TypeScript

# チームとプロジェクト名を設定します
weave.init('[YOUR-TEAM]/eval_pipeline_quickstart')

model = ExtractFruitsModel(
    model_name='gpt-3.5-turbo-1106',
    prompt_template='Extract fields ("fruit": <str>, "color": <str>, "flavor": <str>) from the following text, as json: {sentence}'
)

sentence = "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy."

print(asyncio.run(model.predict(sentence)))
# Jupyter Notebook を使用している場合は、次を実行します:
# await model.predict(sentence)

await weave.init('eval_pipeline_quickstart');

const sentence = "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.";

const result = await model({ datasetRow: { sentence } });

console.log(result);

データセットを作成する

Model を定義したら、次はそれを評価するためのデータセットが必要です。Dataset は Weaveオブジェクトとして保存されるサンプルのコレクションです。データセットを Weave に公開するとバージョン管理され、評価 run 間で再利用できるようになります。次のデータセット例では、3 つの入力文のサンプルとそれぞれの正解 (labels) を定義し、スコアリング関数が読み取れる JSON の表形式に整形します。この例では、コード内でサンプルのリストを作成していますが、実行中のアプリケーションから 1 件ずつログすることもできます。

Python
TypeScript

sentences = ["There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
"Pounits are a bright green color and are more savory than sweet.",
"Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."]
labels = [
    {'fruit': 'neoskizzles', 'color': 'purple', 'flavor': 'candy'},
    {'fruit': 'pounits', 'color': 'bright green', 'flavor': 'savory'},
    {'fruit': 'glowls', 'color': 'pale orange', 'flavor': 'sour and bitter'}
]
examples = [
    {'id': '0', 'sentence': sentences[0], 'target': labels[0]},
    {'id': '1', 'sentence': sentences[1], 'target': labels[1]},
    {'id': '2', 'sentence': sentences[2], 'target': labels[2]}
]

const sentences = [
  "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
  "Pounits are a bright green color and are more savory than sweet.",
  "Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."
];
const labels = [
  { fruit: 'neoskizzles', color: 'purple', flavor: 'candy' },
  { fruit: 'pounits', color: 'bright green', flavor: 'savory' },
  { fruit: 'glowls', color: 'pale orange', flavor: 'sour and bitter' }
];
const examples = sentences.map((sentence, i) => ({
  id: i.toString(),
  sentence,
  target: labels[i]
}));

次に、weave.Dataset() クラスを使ってデータセットを作成し、公開します。

Python
TypeScript

weave.init('eval_pipeline_quickstart')
dataset = weave.Dataset(name='fruits', rows=examples)
weave.publish(dataset)

import * as weave from 'weave';
await weave.init('eval_pipeline_quickstart');
const dataset = new weave.Dataset({
  name: 'fruits',
  rows: examples
});
await dataset.save();

カスタムのスコアリング関数を定義する

モデルとデータセットが用意できたら、各サンプルに対するモデルのパフォーマンスを測定する方法が必要です。スコアリング関数は、モデルの出力を期待される target と比較し、評価がレポートするメトリクスを生成します。 Weave の評価を使用する場合、Weave では output と比較するための target が必要です。次のスコアリング関数は 2 つの辞書 (target と output) を受け取り、出力が target と一致するかどうかを示す真偽値の辞書を返します。@weave.op() デコレーターを使うと、Weave でスコアリング関数の実行をトラッキングできます。

Python
TypeScript

@weave.op()
def fruit_name_score(target: dict, output: dict) -> dict:
    return {'correct': target['fruit'] == output['fruit']}

import * as weave from 'weave';

const fruitNameScorer = weave.op(
  function fruitNameScore({target, output}) {
    return { correct: target.fruit === output.fruit };
  }
);

独自のスコアリング関数を作成する方法については、Scorer ガイドを参照してください。アプリケーションによっては、カスタムの Scorer クラスを作成したい場合があります。たとえば、特定のパラメーター (チャットモデルやプロンプトなど) 、特定の行に対するスコアリング、集約スコアの計算を備えた、標準化された LLMJudge クラスを作成できます。詳細は、RAG アプリケーションのモデルベース評価にある Scorer クラスの定義に関するチュートリアルを参照してください。

組み込み Scorer を使用して評価を実行する

モデル、データセット、カスタム Scorer の準備が整ったので、あとはそれらをまとめて評価 run を構成するだけです。カスタムのスコアリング関数に加えて、Weave の組み込み Scorerも使用できます。以下の評価では、weave.Evaluation() は前のセクションで定義した fruit_name_score 関数と、F1 スコアを計算する組み込みの MultiTaskBinaryClassificationF1 Scorer を使用します。次の例では、2 つのスコアリング関数を使用して fruits データセット上で ExtractFruitsModel を評価し、その結果を Weave にログします。

Python
TypeScript

weave.init('eval_pipeline_quickstart')

evaluation = weave.Evaluation(
    name='fruit_eval',
    dataset=dataset, 
    scorers=[
        MultiTaskBinaryClassificationF1(class_names=["fruit", "color", "flavor"]), 
        fruit_name_score
    ],
)
print(asyncio.run(evaluation.evaluate(model)))
# Jupyter Notebook で実行している場合は、次を実行します:
# await evaluation.evaluate(model)

import * as weave from 'weave';

await weave.init('eval_pipeline_quickstart');

const evaluation = new weave.Evaluation({
  name: 'fruit_eval',
  dataset: dataset,
  scorers: [fruitNameScorer],
});
const results = await evaluation.evaluate(model);
console.log(results);

Python スクリプトから実行する場合は、asyncio.run を使用する必要があります。一方、Jupyter Notebook から実行する場合は、await を直接使用できます。

完全な例

1つのスクリプトで完結する評価パイプライン:

Python
TypeScript

import json
import asyncio
import openai
import weave
from weave.scorers import MultiTaskBinaryClassificationF1

# Weave を一度初期化する
weave.init('eval_pipeline_quickstart')

# 1. モデルを定義する
class ExtractFruitsModel(weave.Model):
    model_name: str
    prompt_template: str

    @weave.op()
    async def predict(self, sentence: str) -> dict:
        client = openai.AsyncClient()
        response = await client.chat.completions.create(
            model=self.model_name,
            messages=[{"role": "user", "content": self.prompt_template.format(sentence=sentence)}],
        )
        result = response.choices[0].message.content
        if result is None:
            raise ValueError("No response from model")
        return json.loads(result)

# 2. モデルをインスタンス化する
model = ExtractFruitsModel(
    model_name='gpt-3.5-turbo-1106',
    prompt_template='Extract fields ("fruit": <str>, "color": <str>, "flavor": <str>) from the following text, as json: {sentence}'
)

# 3. データセットを作成する
sentences = ["There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
"Pounits are a bright green color and are more savory than sweet.",
"Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."]
labels = [
    {'fruit': 'neoskizzles', 'color': 'purple', 'flavor': 'candy'},
    {'fruit': 'pounits', 'color': 'bright green', 'flavor': 'savory'},
    {'fruit': 'glowls', 'color': 'pale orange', 'flavor': 'sour and bitter'}
]
examples = [
    {'id': '0', 'sentence': sentences[0], 'target': labels[0]},
    {'id': '1', 'sentence': sentences[1], 'target': labels[1]},
    {'id': '2', 'sentence': sentences[2], 'target': labels[2]}
]

dataset = weave.Dataset(name='fruits', rows=examples)
weave.publish(dataset)

# 4. スコアリング関数を定義する
@weave.op()
def fruit_name_score(target: dict, output: dict) -> dict:
    return {'correct': target['fruit'] == output['fruit']}

# 5. 評価を実行する
evaluation = weave.Evaluation(
    name='fruit_eval',
    dataset=dataset,
    scorers=[
        MultiTaskBinaryClassificationF1(class_names=["fruit", "color", "flavor"]),
        fruit_name_score
    ],
)
print(asyncio.run(evaluation.evaluate(model)))

import * as weave from 'weave';
import OpenAI from 'openai';

// Weave を一度初期化する
await weave.init('eval_pipeline_quickstart');

// 1. モデルを定義する
// 注意: weave.Model は TypeScript ではまだサポートされていません。
// 代わりに、モデルのような関数を weave.op でラップしてください。
const openaiClient = new OpenAI();

const model = weave.op(async function myModel({datasetRow}) {
  const prompt = `Extract fields ("fruit": <str>, "color": <str>, "flavor": <str>) from the following text, as json: ${datasetRow.sentence}`;
  const response = await openaiClient.chat.completions.create({
    model: 'gpt-3.5-turbo',
    messages: [{ role: 'user', content: prompt }],
    response_format: { type: 'json_object' }
  });
  return JSON.parse(response.choices[0].message.content);
});

// 2. データセットを作成する
const sentences = [
  "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
  "Pounits are a bright green color and are more savory than sweet.",
  "Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."
];
const labels = [
  { fruit: 'neoskizzles', color: 'purple', flavor: 'candy' },
  { fruit: 'pounits', color: 'bright green', flavor: 'savory' },
  { fruit: 'glowls', color: 'pale orange', flavor: 'sour and bitter' }
];
const examples = sentences.map((sentence, i) => ({
  id: i.toString(),
  sentence,
  target: labels[i]
}));

const dataset = new weave.Dataset({
  name: 'fruits',
  rows: examples
});
await dataset.save();

// 3. スコアリング関数を定義する
const fruitNameScorer = weave.op(
  function fruitNameScore({target, output}) {
    return { correct: target.fruit === output.fruit };
  }
);

// 4. 評価を実行する
const evaluation = new weave.Evaluation({
  name: 'fruit_eval',
  dataset: dataset,
  scorers: [fruitNameScorer],
});
const results = await evaluation.evaluate(model);
console.log(results);

評価結果を確認する

評価が完了すると、Weave UI で各予測と Scorer の結果を確認できます。Weave は、各予測とスコアのトレースを自動的に記録します。評価の実行時に出力されるリンクをクリックすると、Weave UI で結果を確認できます。

Weave の評価についてさらに詳しく見る

これで、完全な評価パイプラインが完成しました。Weave の評価機能をさらに深く理解するには、次のリソースを参照してください。

scorer の構築方法と使用方法について詳しく学びます。
Weave の組み込みスコアリング関数を確認します。
LLM を判定者として使用するモデルベース評価について学びます。

次のステップ

RAG アプリケーションを構築して、検索拡張生成の評価について学びましょう。

​このガイドで学ぶこと

​前提条件

​必要なライブラリと関数をインポートする

​Model を作成する

​データセットを作成する

​カスタムのスコアリング関数を定義する

​組み込み Scorer を使用して評価を実行する

​完全な例

​評価結果を確認する

​Weave の評価についてさらに詳しく見る

​次のステップ

このガイドで学ぶこと

前提条件

必要なライブラリと関数をインポートする

`Model` を作成する

データセットを作成する

カスタムのスコアリング関数を定義する

組み込み Scorer を使用して評価を実行する

完全な例

評価結果を確認する

Weave の評価についてさらに詳しく見る

次のステップ