W&B Inference와 함께 Weave 배우기

Try in Colab GitHub source

이 가이드는 W&B Weave를 W&B Inference와 함께 사용하는 방법을 설명합니다. W&B Inference를 사용하면 자체 인프라를 구축하거나 여러 제공업체의 API 키를 관리할 필요 없이 실시간 오픈 소스 모델을 사용하여 LLM 애플리케이션을 구축하고 추적(trace)할 수 있습니다. W&B API 키만 있으면 W&B Inference에서 호스팅하는 모든 모델과 상호 작용할 수 있습니다.

학습 내용

이 가이드에서 배울 내용은 다음과 같습니다:

Weave 및 W&B Inference 설정
자동 트레이싱 기능이 포함된 기본 LLM 애플리케이션 구축
여러 모델 비교
데이터셋(Datasets)에서의 모델 성능 평가
Weave UI에서 결과 확인

Prerequisites

A W&B account
Python 3.8+ or Node.js 18+
Required packages installed:
- Python: pip install weave openai
- TypeScript: npm install weave openai
An OpenAI API key set as an environment variable

첫 LLM 호출 트레이싱하기

시작하려면 다음 코드 예제를 복사하여 붙여넣으세요. 이 예제는 W&B Inference의 Llama 3.1-8B를 사용합니다. 이 코드를 실행하면 Weave는 다음을 수행합니다:

LLM 호출을 자동으로 추적(trace)합니다.
입력값, 결과값, 지연 시간(latency) 및 토큰 사용량을 로깅합니다.
Weave UI에서 트레이스를 확인할 수 있는 링크를 제공합니다.

Python
TypeScript

import weave
import openai

# Weave 초기화 - your-team/your-project로 변경하세요
weave.init("<team-name>/inference-quickstart")

# W&B Inference를 가리키는 OpenAI 호환 클라이언트 생성
client = openai.OpenAI(
    base_url='https://api.inference.wandb.ai/v1',
    api_key="YOUR_WANDB_API_KEY",  # 실제 API 키로 교체하세요
    project="<team-name>/my-first-weave-project",  # 사용량 추적을 위해 필수입니다
)

# 트레이싱을 활성화하기 위해 함수에 데코레이터를 추가합니다; 표준 OpenAI 클라이언트를 사용합니다
@weave.op()
def ask_llama(question: str) -> str:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": question}
        ],
    )
    return response.choices[0].message.content

# 함수 호출 - Weave가 모든 것을 자동으로 트레이싱합니다
result = ask_llama("What are the benefits of using W&B Weave for LLM development?")
print(result)

import * as weave from 'weave';
import OpenAI from 'openai';

// Weave 초기화 - "<>"로 둘러싸인 값을 본인의 정보로 변경하세요.
await weave.init("<team-name>/inference-quickstart")

// W&B Inference를 가리키는 OpenAI 호환 클라이언트 생성
const client = new OpenAI({
    baseURL: 'https://api.inference.wandb.ai/v1',  // W&B Inference 엔드포인트
    apiKey: process.env.WANDB_API_KEY || 'YOUR_WANDB_API_KEY', // API 키로 교체하거나 WANDB_API_KEY 환경 변수를 설정하세요
});

// 트레이싱을 활성화하기 위해 weave.op로 함수를 감쌉니다
const askLlama = weave.op(async function askLlama(question: string): Promise<string> {
const response = await client.chat.completions.create({
    model: 'meta-llama/Llama-3.1-70B-Instruct',
    messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: question }
    ],
});
return response.choices[0].message.content || '';
});

// 함수 호출 - Weave가 모든 것을 자동으로 트레이싱합니다
const result = await askLlama('What are the benefits of using W&B Weave for LLM development?');
console.log(result);

텍스트 요약 애플리케이션 구축하기

다음으로, Weave가 중첩된 작업(nested operations)을 어떻게 트레이싱하는지 보여주는 기본 요약 앱 코드를 실행해 보세요:

Python
TypeScript

import weave
import openai

# Weave 초기화 - "<>"로 둘러싸인 값을 본인의 정보로 변경하세요.
weave.init("<team-name>/inference-quickstart")

client = openai.OpenAI(
    base_url='https://api.inference.wandb.ai/v1',
    api_key="YOUR_WANDB_API_KEY",  # 실제 API 키로 교체하세요
    project="<team-name>/my-first-weave-project",  # 사용량 추적을 위해 필수입니다
)

@weave.op()
def extract_key_points(text: str) -> list[str]:
    """텍스트에서 핵심 포인트를 추출합니다."""
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[
            {"role": "system", "content": "Extract 3-5 key points from the text. Return each point on a new line."},
            {"role": "user", "content": text}
        ],
    )
    # 빈 줄을 제외하고 응답을 반환합니다
    return [line for line in response.choices[0].message.content.strip().splitlines() if line.strip()]

@weave.op()
def create_summary(key_points: list[str]) -> str:
    """핵심 포인트를 바탕으로 간결한 요약을 생성합니다."""
    points_text = "\n".join(f"- {point}" for point in key_points)
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[
            {"role": "system", "content": "Create a one-sentence summary based on these key points."},
            {"role": "user", "content": f"Key points:\n{points_text}"}
        ],
    )
    return response.choices[0].message.content

@weave.op()
def summarize_text(text: str) -> dict:
    """메인 요약 파이프라인."""
    key_points = extract_key_points(text)
    summary = create_summary(key_points)
    return {
        "key_points": key_points,
        "summary": summary
    }

# 샘플 텍스트로 시도해 보기
sample_text = """
The Apollo 11 mission was a historic spaceflight that landed the first humans on the Moon 
on July 20, 1969. Commander Neil Armstrong and lunar module pilot Buzz Aldrin descended 
to the lunar surface while Michael Collins remained in orbit. Armstrong became the first 
person to step onto the Moon, followed by Aldrin 19 minutes later. They spent about 
two and a quarter hours together outside the spacecraft, collecting samples and taking photographs.
"""

result = summarize_text(sample_text)
print("Key Points:", result["key_points"])
print("\nSummary:", result["summary"])

import * as weave from 'weave';
import OpenAI from 'openai';

// Weave 초기화 - your-team/your-project로 변경하세요
await weave.init('<team-name>/inference-quickstart');

const client = new OpenAI({
baseURL: 'https://api.inference.wandb.ai/v1',
apiKey: process.env.WANDB_API_KEY || 'YOUR_WANDB_API_KEY',  // API 키로 교체하거나 WANDB_API_KEY 환경 변수를 설정하세요
});

const extractKeyPoints = weave.op(async function extractKeyPoints(text: string): Promise<string[]> {
const response = await client.chat.completions.create({
    model: 'meta-llama/Llama-3.1-8B-Instruct',
    messages: [
    { role: 'system', content: 'Extract 3-5 key points from the text. Return each point on a new line.' },
    { role: 'user', content: text }
    ],
});
// 빈 줄을 제외하고 응답을 반환합니다
const content = response.choices[0].message.content || '';
return content.split('\n').map(line => line.trim()).filter(line => line.length > 0);
});

const createSummary = weave.op(async function createSummary(keyPoints: string[]): Promise<string> {
const pointsText = keyPoints.map(point => `- ${point}`).join('\n');
const response = await client.chat.completions.create({
    model: 'meta-llama/Llama-3.1-8B-Instruct',
    messages: [
    { role: 'system', content: 'Create a one-sentence summary based on these key points.' },
    { role: 'user', content: `Key points:\n${pointsText}` }
    ],
});
return response.choices[0].message.content || '';
});

const summarizeText = weave.op(async function summarizeText(text: string): Promise<{key_points: string[], summary: string}> {
const keyPoints = await extractKeyPoints(text);
const summary = await createSummary(keyPoints);
return {
    key_points: keyPoints,
    summary: summary
};
});

// 샘플 텍스트로 시도해 보기
const sampleText = `
The Apollo 11 mission was a historic spaceflight that landed the first humans on the Moon 
on July 20, 1969. Commander Neil Armstrong and lunar module pilot Buzz Aldrin descended 
to the lunar surface while Michael Collins remained in orbit. Armstrong became the first 
person to step onto the Moon, followed by Aldrin 19 minutes later. They spent about 
two and a quarter hours together outside the spacecraft, collecting samples and taking photographs.
`;

const result = await summarizeText(sampleText);
console.log('Key Points:', result.key_points);
console.log('\nSummary:', result.summary);

여러 모델 비교하기

W&B Inference는 여러 모델에 대한 엑세스를 제공합니다. 다음 코드를 사용하여 Llama와 DeepSeek의 응답 성능을 비교해 보세요:

Python
TypeScript

import weave
import openai

# Weave 초기화 - your-team/your-project로 변경하세요
weave.init("<team-name>/inference-quickstart")

client = openai.OpenAI(
    base_url='https://api.inference.wandb.ai/v1',
    api_key="YOUR_WANDB_API_KEY",  # 실제 API 키로 교체하세요
    project="<team-name>/my-first-weave-project",  # 사용량 추적을 위해 필수입니다
)

# 서로 다른 LLM을 비교하기 위한 Model 클래스 정의
class InferenceModel(weave.Model):
    model_name: str
    
    @weave.op()
    def predict(self, question: str) -> str:
        response = client.chat.completions.create(
            model=self.model_name,
            messages=[
                {"role": "user", "content": question}
            ],
        )
        return response.choices[0].message.content

# 각 모델의 인스턴스 생성
llama_model = InferenceModel(model_name="meta-llama/Llama-3.1-8B-Instruct")
deepseek_model = InferenceModel(model_name="deepseek-ai/DeepSeek-V3-0324")

# 응답 비교
test_question = "Explain quantum computing in one paragraph for a high school student."

print("Llama 3.1 8B response:")
print(llama_model.predict(test_question))
print("\n" + "="*50 + "\n")
print("DeepSeek V3 response:")
print(deepseek_model.predict(test_question))

import * as weave from 'weave';
import OpenAI from 'openai';

// Weave 초기화 - your-team/your-project로 변경하세요
await weave.init("<team-name>/inference-quickstart")

const client = new OpenAI({
  baseURL: 'https://api.inference.wandb.ai/v1',
  apiKey: process.env.WANDB_API_KEY || 'YOUR_WANDB_API_KEY', // API 키로 교체하거나 WANDB_API_KEY 환경 변수를 설정하세요
});

// weave.op를 사용하여 모델 함수 생성 (TypeScript에서는 weave.Model이 지원되지 않음)
function createModel(modelName: string) {
  return weave.op(async function predict(question: string): Promise<string> {
    const response = await client.chat.completions.create({
      model: modelName,
      messages: [
        { role: 'user', content: question }
      ],
    });
    return response.choices[0].message.content || '';
  });
}

// 각 모델의 인스턴스 생성
const llamaModel = createModel('meta-llama/Llama-3.1-8B-Instruct');
const deepseekModel = createModel('deepseek-ai/DeepSeek-V3-0324');

// 응답 비교
const testQuestion = 'Explain quantum computing in one paragraph for a high school student.';

console.log('Llama 3.1 8B response:');
console.log(await llamaModel(testQuestion));
console.log('\n' + '='.repeat(50) + '\n');
console.log('DeepSeek V3 response:');
console.log(await deepseekModel(testQuestion));

모델 성능 평가하기

Weave의 기본 내장 기능인 EvaluationLogger를 사용하여 Q&A 작업에서 모델의 성능을 평가합니다. 이는 자동 집계, 토큰 사용량 캡처 및 UI에서의 풍부한 비교 기능과 함께 구조화된 평가 추적 기능을 제공합니다. 이전 섹션에서 사용한 스크립트에 다음 코드를 추가하세요:

Python
TypeScript

from typing import Optional
from weave import EvaluationLogger

# 간단한 데이터셋 생성
dataset = [
    {"question": "What is 2 + 2?", "expected": "4"},
    {"question": "What is the capital of France?", "expected": "Paris"},
    {"question": "Name a primary color", "expected_one_of": ["red", "blue", "yellow"]},
]

# 스코어러(scorer) 정의
@weave.op()
def accuracy_scorer(expected: str, output: str, expected_one_of: Optional[list[str]] = None) -> dict:
    """모델 출력의 정확도를 측정합니다."""
    output_clean = output.strip().lower()
    
    if expected_one_of:
        is_correct = any(option.lower() in output_clean for option in expected_one_of)
    else:
        is_correct = expected.lower() in output_clean
    
    return {"correct": is_correct, "score": 1.0 if is_correct else 0.0}

# Weave의 EvaluationLogger를 사용하여 모델 평가
def evaluate_model(model: InferenceModel, dataset: list[dict]):
    """Weave의 내장 평가 프레임워크를 사용하여 데이터셋에 대한 평가를 실행합니다."""
    # 토큰 사용량을 캡처하기 위해 모델 호출 전에 EvaluationLogger를 초기화합니다
    # 이는 특히 W&B Inference에서 비용을 추적하는 데 중요합니다
    # 모델 이름을 유효한 형식으로 변환합니다 (영숫자가 아닌 문자를 밑줄로 변경)
    safe_model_name = model.model_name.replace("/", "_").replace("-", "_").replace(".", "_")
    eval_logger = EvaluationLogger(
        model=safe_model_name,
        dataset="qa_dataset"
    )
    
    for example in dataset:
        # 모델 예측값 가져오기
        output = model.predict(example["question"])
        
        # 예측값 로깅
        pred_logger = eval_logger.log_prediction(
            inputs={"question": example["question"]},
            output=output
        )
        
        # 결과 채점
        score = accuracy_scorer(
            expected=example.get("expected", ""),
            output=output,
            expected_one_of=example.get("expected_one_of")
        )
        
        # 점수 로깅
        pred_logger.log_score(
            scorer="accuracy",
            score=score["score"]
        )
        
        # 이 예측에 대한 로깅 종료
        pred_logger.finish()
    
    # 요약 로깅 - Weave가 정확도 점수를 자동으로 집계합니다
    eval_logger.log_summary()
    print(f"Evaluation complete for {model.model_name} (logged as: {safe_model_name}). View results in the Weave UI.")

# 여러 모델 비교 - Weave 평가 프레임워크의 핵심 기능
models_to_compare = [
    llama_model,
    deepseek_model,
]

for model in models_to_compare:
    evaluate_model(model, dataset)

# Weave UI에서 Evals 탭으로 이동하여 모델 간 결과를 비교하세요

import { EvaluationLogger } from 'weave';

// 간단한 데이터셋 생성
interface DatasetExample {
  question: string;
  expected?: string;
  expected_one_of?: string[];
}

const dataset: DatasetExample[] = [
  { question: 'What is 2 + 2?', expected: '4' },
  { question: 'What is the capital of France?', expected: 'Paris' },
  { question: 'Name a primary color', expected_one_of: ['red', 'blue', 'yellow'] },
];

// 스코어러 정의
const accuracyScorer = weave.op(function accuracyScorer(args: {
  expected: string;
  output: string;
  expected_one_of?: string[];
}): { correct: boolean; score: number } {
  const outputClean = args.output.trim().toLowerCase();
  
  let isCorrect: boolean;
  if (args.expected_one_of) {
    isCorrect = args.expected_one_of.some(option => 
      outputClean.includes(option.toLowerCase())
    );
  } else {
    isCorrect = outputClean.includes(args.expected.toLowerCase());
  }
  
  return { correct: isCorrect, score: isCorrect ? 1.0 : 0.0 };
});

// Weave의 EvaluationLogger를 사용하여 모델 평가
async function evaluateModel(
  model: (question: string) => Promise<string>,
  modelName: string,
  dataset: DatasetExample[]
): Promise<void> {
  // 토큰 사용량을 캡처하기 위해 모델 호출 전에 EvaluationLogger를 초기화합니다
  // 이는 특히 W&B Inference에서 비용을 추적하는 데 중요합니다
  // 모델 이름을 유효한 형식으로 변환합니다 (영숫자가 아닌 문자를 밑줄로 변경)
  const safeModelName = modelName.replace(/\//g, '_').replace(/-/g, '_').replace(/\./g, '_');
  const evalLogger = new EvaluationLogger({
    name: 'inference_evaluation',
    model: { name: safeModelName },
    dataset: 'qa_dataset'
  });
  
  for (const example of dataset) {
    // 모델 예측값 가져오기
    const output = await model(example.question);
    
    // 예측값 로깅
    const predLogger = evalLogger.logPrediction(
      { question: example.question },
      output
    );
    
    // 결과 채점
    const score = await accuracyScorer({
      expected: example.expected || '',
      output: output,
      expected_one_of: example.expected_one_of
    });
    
    // 점수 로깅
    predLogger.logScore('accuracy', score.score);
    
    // 이 예측에 대한 로깅 종료
    predLogger.finish();
  }
  
  // 요약 로깅 - Weave가 정확도 점수를 자동으로 집계합니다
  await evalLogger.logSummary();
  console.log(`Evaluation complete for ${modelName} (logged as: {safeModelName}). View results in the Weave UI.`);
}

// 여러 모델 비교 - Weave 평가 프레임워크의 핵심 기능
const modelsToCompare = [
  { model: llamaModel, name: 'meta-llama/Llama-3.1-8B-Instruct' },
  { model: deepseekModel, name: 'deepseek-ai/DeepSeek-V3-0324' },
];

for (const { model, name } of modelsToCompare) {
  await evaluateModel(model, name, dataset);
}

// Weave UI에서 Evals 탭으로 이동하여 모델 간 결과를 비교하세요

이 예제들을 실행하면 터미널에 트레이스 링크가 반환됩니다. 링크를 클릭하여 Weave UI에서 트레이스를 확인하세요. Weave UI에서는 다음을 할 수 있습니다:

모든 LLM 호출의 타임라인 검토
각 작업의 입력값 및 결과값 조사
토큰 사용량 및 예상 비용 확인 (EvaluationLogger에 의해 자동 캡처됨)
지연 시간 및 성능 메트릭 분석
Evals 탭으로 이동하여 집계된 평가 결과 확인
Compare 기능을 사용하여 여러 모델 간의 성능 분석
특정 예제를 페이지별로 넘겨보며 동일한 입력에 대해 서로 다른 모델이 어떻게 작동했는지 확인

사용 가능한 모델

사용 가능한 전체 모델 목록은 W&B Inference 문서의 사용 가능한 모델 섹션을 참조하세요.

다음 단계

Playground 사용: Weave Playground에서 대화식으로 모델을 시도해 보세요.
평가 구축: LLM 애플리케이션의 체계적인 평가에 대해 알아보세요.
다른 인테그레이션 시도: Weave는 OpenAI, Anthropic 및 기타 여러 모델과 함께 작동합니다.

Get Started

Guides

Cookbooks

Reference

Open Source

Community

학습 내용

Prerequisites

첫 LLM 호출 트레이싱하기

텍스트 요약 애플리케이션 구축하기

여러 모델 비교하기

모델 성능 평가하기

사용 가능한 모델

다음 단계

문제 해결

Get Started

Guides

Cookbooks

Reference

Open Source

Community

​학습 내용

​Prerequisites

​첫 LLM 호출 트레이싱하기

​텍스트 요약 애플리케이션 구축하기

​여러 모델 비교하기

​모델 성능 평가하기

​사용 가능한 모델

​다음 단계

​문제 해결

학습 내용

Prerequisites

첫 LLM 호출 트레이싱하기

텍스트 요약 애플리케이션 구축하기

여러 모델 비교하기

모델 성능 평가하기

사용 가능한 모델

다음 단계

문제 해결