Evaluations 도입

이것은 인터랙티브 노트북입니다. 로컬에서 실행하거나 아래 링크를 사용할 수 있습니다:

🔑 Prerequisites

Weave 평가를 실행하기 전에 다음의 전제 조건을 완료하세요.

W&B Weave SDK를 설치하고 API 키로 로그인합니다.
OpenAI SDK를 설치하고 API 키로 로그인합니다.
W&B Projects 를 초기화합니다.

# 의존성 설치 및 임포트
!pip install wandb weave openai -q

import os
from getpass import getpass

from openai import OpenAI
from pydantic import BaseModel

import weave

# 🔑 API 키 설정
# 이 셀을 실행하면 `getpass`를 통해 API 키 입력을 요청하며, 터미널에는 표시되지 않습니다.
#####
print("---")
print(
    "다음 링크에서 W&B API 키를 생성하세요: https://wandb.ai/settings#apikeys"
)
os.environ["WANDB_API_KEY"] = getpass("W&B API 키를 입력하세요: ")
print("---")
print("다음 링크에서 OpenAI API 키를 생성할 수 있습니다: https://platform.openai.com/api-keys")
os.environ["OPENAI_API_KEY"] = getpass("OpenAI API 키를 입력하세요: ")
print("---")
#####

# 🏠 W&B 프로젝트 이름을 입력하세요
weave_client = weave.init("MY_PROJECT_NAME")  # 🐝 사용자의 W&B 프로젝트 이름

🐝 첫 번째 평가 실행하기

다음 코드 샘플은 Weave 의 Model 및 Evaluation API를 사용하여 LLM을 평가하는 방법을 보여줍니다. 먼저 weave.Model을 상속받아 Weave 모델을 정의하고, 모델 이름과 프롬프트 형식을 지정한 뒤 @weave.op를 사용하여 predict 메소드를 추적합니다. predict 메소드는 OpenAI에 프롬프트를 보내고 Pydantic 스키마(FruitExtract)를 사용하여 응답을 구조화된 출력으로 파싱합니다. 그런 다음 입력 문장과 예상 타겟으로 구성된 작은 평가 데이터셋을 만듭니다. 다음으로, 모델의 출력과 타겟 레이블을 비교하는 커스텀 스코어링 함수(@weave.op로 추적됨)를 정의합니다. 마지막으로, 모든 것을 weave.Evaluation으로 감싸 데이터셋과 스코어러를 지정하고 evaluate()를 호출하여 평가 파이프라인을 비동기적으로 실행합니다.

# 1. Weave 모델 구성
class FruitExtract(BaseModel):
    fruit: str
    color: str
    flavor: str

class ExtractFruitsModel(weave.Model):
    model_name: str
    prompt_template: str

    @weave.op()
    def predict(self, sentence: str) -> dict:
        client = OpenAI()

        response = client.beta.chat.completions.parse(
            model=self.model_name,
            messages=[
                {
                    "role": "user",
                    "content": self.prompt_template.format(sentence=sentence),
                }
            ],
            response_format=FruitExtract,
        )
        result = response.choices[0].message.parsed
        return result

model = ExtractFruitsModel(
    name="gpt4o",
    model_name="gpt-4o",
    prompt_template='Extract fields ("fruit": <str>, "color": <str>, "flavor": <str>) as json, from the following text : {sentence}',
)

# 2. 샘플 수집
sentences = [
    "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
    "Pounits are a bright green color and are more savory than sweet.",
    "Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them.",
]
labels = [
    {"fruit": "neoskizzles", "color": "purple", "flavor": "candy"},
    {"fruit": "pounits", "color": "green", "flavor": "savory"},
    {"fruit": "glowls", "color": "orange", "flavor": "sour, bitter"},
]
examples = [
    {"id": "0", "sentence": sentences[0], "target": labels[0]},
    {"id": "1", "sentence": sentences[1], "target": labels[1]},
    {"id": "2", "sentence": sentences[2], "target": labels[2]},
]

# 3. 평가를 위한 스코어링 함수 정의
@weave.op()
def fruit_name_score(target: dict, output: FruitExtract) -> dict:
    target_flavors = [f.strip().lower() for f in target["flavor"].split(",")]
    output_flavors = [f.strip().lower() for f in output.flavor.split(",")]
    # 타겟 풍미가 출력 풍미에 포함되어 있는지 확인
    matches = any(tf in of for tf in target_flavors for of in output_flavors)
    return {"correct": matches}

# 4. 평가 실행
evaluation = weave.Evaluation(
    name="fruit_eval",
    dataset=examples,
    scorers=[fruit_name_score],
)
await evaluation.evaluate(model)

🚀 더 많은 예제를 찾으시나요?

엔드투엔드 평가 파이프라인 구축 방법을 알아보세요.
RAG 애플리케이션 구축 및 평가 방법을 알아보세요.

Get Started

Guides

Cookbooks

Reference

Open Source

Community

🔑 Prerequisites

🐝 첫 번째 평가 실행하기

🚀 더 많은 예제를 찾으시나요?

Get Started

Guides

Cookbooks

Reference

Open Source

Community

​🔑 Prerequisites

​🐝 첫 번째 평가 실행하기

​🚀 더 많은 예제를 찾으시나요?

🔑 Prerequisites

🐝 첫 번째 평가 실행하기

🚀 더 많은 예제를 찾으시나요?