RAG 애플리케이션 평가하기

Try in Colab GitHub source

RAG (Retrieval Augmented Generation)는 커스텀 지식 베이스에 액세스할 수 있는 생성형 AI 애플리케이션을 구축하는 일반적인 방법입니다.

학습할 내용:

이 가이드에서는 다음 방법을 설명합니다:

지식 베이스 구축
관련 문서를 찾는 검색 단계가 포함된 RAG 애플리케이션 생성
Weave 로 검색 단계 추적
LLM judge를 사용하여 컨텍스트 정밀도(context precision)를 측정하여 RAG 애플리케이션 평가
커스텀 스코어링 함수 정의

Prerequisites

A W&B account
Python 3.8+ or Node.js 18+
Required packages installed:
- Python: pip install weave openai
- TypeScript: npm install weave openai
An OpenAI API key set as an environment variable

지식 베이스 구축

먼저, 아티클에 대한 임베딩을 계산합니다. 일반적으로는 아티클에 대해 이 작업을 한 번 수행하고 임베딩과 메타데이터를 데이터베이스에 저장하지만, 여기서는 단순함을 위해 스크립트가 실행될 때마다 수행합니다.

Python
TypeScript

from openai import OpenAI
import weave
from weave import Model
import numpy as np
import json
import asyncio

articles = [
    "Novo Nordisk and Eli Lilly rival soars 32 percent after promising weight loss drug results Shares of Denmarks Zealand Pharma shot 32 percent higher in morning trade, after results showed success in its liver disease treatment survodutide, which is also on trial as a drug to treat obesity. The trial “tells us that the 6mg dose is safe, which is the top dose used in the ongoing [Phase 3] obesity trial too,” one analyst said in a note. The results come amid feverish investor interest in drugs that can be used for weight loss.",
    "Berkshire shares jump after big profit gain as Buffetts conglomerate nears $1 trillion valuation Berkshire Hathaway shares rose on Monday after Warren Buffetts conglomerate posted strong earnings for the fourth quarter over the weekend. Berkshires Class A and B shares jumped more than 1.5%, each. Class A shares are higher by more than 17% this year, while Class B has gained more than 18%. Berkshire was last valued at $930.1 billion, up from $905.5 billion where it closed on Friday, according to FactSet. Berkshire on Saturday posted fourth-quarter operating earnings of $8.481 billion, about 28 percent higher than the $6.625 billion from the year-ago period, driven by big gains in its insurance business. Operating earnings refers to profits from businesses across insurance, railroads and utilities. Meanwhile, Berkshires cash levels also swelled to record levels. The conglomerate held $167.6 billion in cash in the fourth quarter, surpassing the $157.2 billion record the conglomerate held in the prior quarter.",
    "Highmark Health says its combining tech from Google and Epic to give doctors easier access to information Highmark Health announced it is integrating technology from Google Cloud and the health-care software company Epic Systems. The integration aims to make it easier for both payers and providers to access key information they need, even if its stored across multiple points and formats, the company said. Highmark is the parent company of a health plan with 7 million members, a provider network of 14 hospitals and other entities",
    "Rivian and Lucid shares plunge after weak EV earnings reports Shares of electric vehicle makers Rivian and Lucid fell Thursday after the companies reported stagnant production in their fourth-quarter earnings after the bell Wednesday. Rivian shares sank about 25 percent, and Lucids stock dropped around 17 percent. Rivian forecast it will make 57,000 vehicles in 2024, slightly less than the 57,232 vehicles it produced in 2023. Lucid said it expects to make 9,000 vehicles in 2024, more than the 8,428 vehicles it made in 2023.",
    "Mauritius blocks Norwegian cruise ship over fears of a potential cholera outbreak Local authorities on Sunday denied permission for the Norwegian Dawn ship, which has 2,184 passengers and 1,026 crew on board, to access the Mauritius capital of Port Louis, citing “potential health risks.” The Mauritius Ports Authority said Sunday that samples were taken from at least 15 passengers on board the cruise ship. A spokesperson for the U.S.-headquartered Norwegian Cruise Line Holdings said Sunday that 'a small number of guests experienced mild symptoms of a stomach-related illness' during Norwegian Dawns South Africa voyage.",
    "Intuitive Machines lands on the moon in historic first for a U.S. company Intuitive Machines Nova-C cargo lander, named Odysseus after the mythological Greek hero, is the first U.S. spacecraft to soft land on the lunar surface since 1972. Intuitive Machines is the first company to pull off a moon landing — government agencies have carried out all previously successful missions. The company's stock surged in extended trading Thursday, after falling 11 percent in regular trading.",
    "Lunar landing photos: Intuitive Machines Odysseus sends back first images from the moon Intuitive Machines cargo moon lander Odysseus returned its first images from the surface. Company executives believe the lander caught its landing gear sideways on the moon's surface while touching down and tipped over. Despite resting on its side, the company's historic IM-1 mission is still operating on the moon.",
]

def docs_to_embeddings(docs: list) -> list:
    openai = OpenAI()
    document_embeddings = []
    for doc in docs:
        response = (
            openai.embeddings.create(input=doc, model="text-embedding-3-small")
            .data[0]
            .embedding
        )
        document_embeddings.append(response)
    return document_embeddings

# 참고: 일반적으로는 아티클에 대해 이 작업을 한 번 수행하고 임베딩 및 메타데이터를 데이터베이스에 넣습니다.
article_embeddings = docs_to_embeddings(articles)

require('dotenv').config();
import { OpenAI } from 'openai';
import * as weave from 'weave';

interface Article {
    text: string;
    embedding?: number[];
}

const articles: Article[] = [
    { 
        text: `Novo Nordisk and Eli Lilly rival soars 32 percent after promising weight loss drug results Shares of Denmarks Zealand Pharma shot 32 percent higher in morning trade, after results showed success in its liver disease treatment survodutide, which is also on trial as a drug to treat obesity. The trial tells us that the 6mg dose is safe, which is the top dose used in the ongoing [Phase 3] obesity trial too, one analyst said in a note. The results come amid feverish investor interest in drugs that can be used for weight loss.`
    },
    { 
        text: `Berkshire shares jump after big profit gain as Buffetts conglomerate nears $1 trillion valuation Berkshire Hathaway shares rose on Monday after Warren Buffetts conglomerate posted strong earnings for the fourth quarter over the weekend. Berkshires Class A and B shares jumped more than 1.5%, each. Class A shares are higher by more than 17% this year, while Class B has gained more than 18%. Berkshire was last valued at $930.1 billion, up from $905.5 billion where it closed on Friday, according to FactSet. Berkshire on Saturday posted fourth-quarter operating earnings of $8.481 billion, about 28 percent higher than the $6.625 billion from the year-ago period, driven by big gains in its insurance business. Operating earnings refers to profits from businesses across insurance, railroads and utilities. Meanwhile, Berkshires cash levels also swelled to record levels. The conglomerate held $167.6 billion in cash in the fourth quarter, surpassing the $157.2 billion record the conglomerate held in the prior quarter.`
    },
    { 
        text: `Highmark Health says its combining tech from Google and Epic to give doctors easier access to information Highmark Health announced it is integrating technology from Google Cloud and the health-care software company Epic Systems. The integration aims to make it easier for both payers and providers to access key information they need, even if its stored across multiple points and formats, the company said. Highmark is the parent company of a health plan with 7 million members, a provider network of 14 hospitals and other entities`
    }
];

function cosineSimilarity(a: number[], b: number[]): number {
    const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
    const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
    const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
    return dotProduct / (magnitudeA * magnitudeB);
}

const docsToEmbeddings = weave.op(async function(docs: Article[]): Promise<Article[]> {
    const openai = new OpenAI();
    const enrichedDocs = await Promise.all(docs.map(async (doc) => {
        const response = await openai.embeddings.create({
            input: doc.text,
            model: "text-embedding-3-small"
        });
        return {
            ...doc,
            embedding: response.data[0].embedding
        };
    }));
    return enrichedDocs;
});

RAG 앱 생성

다음으로, 검색 함수인 get_most_relevant_document를 weave.op() 데코레이터로 감싸고 Model 클래스를 생성합니다. 나중에 검토할 수 있도록 함수의 모든 입력과 출력을 추적하기 시작하려면 weave.init('<team-name>/rag-quickstart')를 호출하세요. 팀 이름을 지정하지 않으면 결과는 W&B 기본 팀 또는 Entity에 기록됩니다.

Python
TypeScript

from openai import OpenAI
import weave
from weave import Model
import numpy as np
import asyncio

@weave.op()
def get_most_relevant_document(query):
    openai = OpenAI()
    query_embedding = (
        openai.embeddings.create(input=query, model="text-embedding-3-small")
        .data[0]
        .embedding
    )
    similarities = [
        np.dot(query_embedding, doc_emb)
        / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb))
        for doc_emb in article_embeddings
    ]
    # 가장 유사한 문서의 인덱스 가져오기
    most_relevant_doc_index = np.argmax(similarities)
    return articles[most_relevant_doc_index]

class RAGModel(Model):
    system_message: str
    model_name: str = "gpt-3.5-turbo-1106"

    @weave.op()
    def predict(self, question: str) -> dict: # 참고: `question`은 나중에 evaluation 행에서 데이터를 선택하는 데 사용됩니다.
        from openai import OpenAI
        context = get_most_relevant_document(question)
        client = OpenAI()
        query = f"""Use the following information to answer the subsequent question. If the answer cannot be found, write "I don't know."
        Context:
        \"\"\"
        {context}
        \"\"\"
        Question: {question}"""
        response = client.chat.completions.create(
            model=self.model_name,
            messages=[
                {"role": "system", "content": self.system_message},
                {"role": "user", "content": query},
            ],
            temperature=0.0,
            response_format={"type": "text"},
        )
        answer = response.choices[0].message.content
        return {'answer': answer, 'context': context}

# 팀 및 프로젝트 이름 설정
weave.init('<team-name>/rag-quickstart')
model = RAGModel(
    system_message="You are an expert in finance and answer questions related to finance, financial services, and financial markets. When responding based on provided information, be sure to cite the source."
)
model.predict("What significant result was reported about Zealand Pharma's obesity trial?")

class RAGModel {
    private openai: OpenAI;
    private systemMessage: string;
    private modelName: string;
    private articleEmbeddings: Article[];

    constructor(config: {
        systemMessage: string;
        modelName?: string;
        articleEmbeddings: Article[];
    }) {
        this.openai = new OpenAI();
        this.systemMessage = config.systemMessage;
        this.modelName = config.modelName || "gpt-3.5-turbo-1106";
        this.articleEmbeddings = config.articleEmbeddings;
        this.predict = weave.op(this, this.predict);
    }

    async predict(question: string): Promise<{
        answer: string;
        context: string;
    }> {
        const context = await this.getMostRelevantDocument(question);
        
        const response = await this.openai.chat.completions.create({
            model: this.modelName,
            messages: [
                { role: "system", content: this.systemMessage },
                { role: "user", content: `Use the following information to answer the subsequent question. If the answer cannot be found, write "I don't know."
                    Context:
                    """
                    ${context}
                    """
                    Question: ${question}` }
            ],
            temperature: 0
        });

        return {
            answer: response.choices[0].message.content || "",
            context
        };
    }
}

LLM Judge로 평가하기

애플리케이션을 평가하는 간단한 방법이 없을 때 한 가지 방법은 LLM을 사용하여 측면을 평가하는 것입니다. 다음은 LLM judge를 사용하여 컨텍스트가 주어진 답변에 도달하는 데 유용했는지 확인하도록 프롬프트를 작성하여 컨텍스트 정밀도를 측정하는 예입니다. 이 프롬프트는 널리 사용되는 RAGAS 프레임워크를 바탕으로 보강되었습니다.

스코어링 함수 정의하기

Evaluation 파이프라인 구축 튜토리얼에서와 같이, 앱을 테스트할 예시 행 세트와 스코어링 함수를 정의합니다. 스코어링 함수는 한 행을 가져와서 평가합니다. 입력 인수는 행 사전의 해당 키와 일치해야 하므로, 여기서 question은 행 사전에서 가져옵니다. output은 모델의 결과입니다. 모델의 입력은 해당 입력 인수를 기반으로 예시에서 가져오므로 여기에서도 question이 사용됩니다. 이 예제에서는 병렬로 빠르게 실행되도록 async 함수를 사용합니다. async에 대한 빠른 소개가 필요한 경우 여기에서 찾을 수 있습니다.

Python
TypeScript

from openai import OpenAI
import weave
import asyncio

@weave.op()
async def context_precision_score(question, output):
    context_precision_prompt = """Given question, answer and context verify if the context was useful in arriving at the given answer. Give verdict as "1" if useful and "0" if not with json output.
    Output in only valid JSON format.

    question: {question}
    context: {context}
    answer: {answer}
    verdict: """
    client = OpenAI()

    prompt = context_precision_prompt.format(
        question=question,
        context=output['context'],
        answer=output['answer'],
    )

    response = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[{"role": "user", "content": prompt}],
        response_format={ "type": "json_object" }
    )
    response_message = response.choices[0].message
    response = json.loads(response_message.content)
    return {
        "verdict": int(response["verdict"]) == 1,
    }

questions = [
    {"question": "What significant result was reported about Zealand Pharma's obesity trial?"},
    {"question": "How much did Berkshire Hathaway's cash levels increase in the fourth quarter?"},
    {"question": "What is the goal of Highmark Health's integration of Google Cloud and Epic Systems technology?"},
    {"question": "What were Rivian and Lucid's vehicle production forecasts for 2024?"},
    {"question": "Why was the Norwegian Dawn cruise ship denied access to Mauritius?"},
    {"question": "Which company achieved the first U.S. moon landing since 1972?"},
    {"question": "What issue did Intuitive Machines' lunar lander encounter upon landing on the moon?"}
]
evaluation = weave.Evaluation(dataset=questions, scorers=[context_precision_score])
asyncio.run(evaluation.evaluate(model)) # 참고: 평가할 모델을 정의해야 합니다.

const contextPrecisionScore = weave.op(async function(args: {
    datasetRow: QuestionRow;
    modelOutput: { answer: string; context: string; }
}): Promise<ScorerResult> {
    const openai = new OpenAI();
    
    const prompt = `Given question, answer and context verify if the context was useful...`;

    const response = await openai.chat.completions.create({
        model: "gpt-4-turbo-preview",
        messages: [{ role: "user", content: prompt }],
        response_format: { type: "json_object" }
    });

    const result = JSON.parse(response.choices[0].message.content || "{}");
    return {
        verdict: parseInt(result.verdict) === 1
    };
});

const evaluation = new weave.Evaluation({
    dataset: createQuestionDataset(),
    scorers: [contextPrecisionScore]
});

await evaluation.evaluate({
    model: weave.op((args: { datasetRow: QuestionRow }) => 
        model.predict(args.datasetRow.question)
    )
});

선택 사항: `Scorer` 클래스 정의하기

일부 애플리케이션에서는 커스텀 평가 클래스를 만들고 싶을 수 있습니다. 예를 들어, 특정 파라미터(예: 채팅 모델, 프롬프트), 각 행의 구체적인 스코어링 및 집계 점수의 구체적인 계산을 포함하는 표준화된 LLMJudge 클래스를 생성해야 하는 경우입니다. Weave 는 즉시 사용 가능한 Scorer 클래스 목록을 정의하며 커스텀 Scorer를 쉽게 만들 수 있도록 지원합니다. 다음 예제는 커스텀 class CorrectnessLLMJudge(Scorer)를 만드는 방법을 보여줍니다. 상위 수준에서 커스텀 Scorer를 생성하는 단계는 매우 간단합니다:

weave.flow.scorer.Scorer를 상속받는 커스텀 클래스를 정의합니다.
함수의 각 호출을 추적하려면 score 함수를 오버라이드하고 @weave.op()를 추가합니다.
- 이 함수는 평가되는 모델의 예측값이 전달될 output 인수를 정의해야 합니다. 모델이 “None”을 반환할 수 있는 경우를 대비해 Optional[dict] 타입으로 정의하세요.
- 나머지 인수는 일반적인 Any 또는 dict이거나, weave.Evaluate 클래스를 사용하여 모델을 평가하는 데 사용되는 데이터셋에서 특정 컬럼을 선택할 수 있습니다. preprocess_model_input이 사용되는 경우 단일 행의 컬럼 이름 또는 키와 정확히 동일한 이름을 가져야 합니다.
선택 사항: 집계 점수 계산을 커스터마이징하려면 summarize 함수를 오버라이드합니다. 커스텀 함수를 정의하지 않으면 Weave 는 기본적으로 weave.flow.scorer.auto_summarize 함수를 사용합니다.
- 이 함수에는 @weave.op() 데코레이터가 있어야 합니다.

Python
TypeScript

from weave import Scorer

class CorrectnessLLMJudge(Scorer):
    prompt: str
    model_name: str
    device: str

    @weave.op()
    async def score(self, output: Optional[dict], query: str, answer: str) -> Any:
        """pred, query, target을 비교하여 예측의 정확성을 평가합니다.
        인수:
            - output: 평가되는 모델이 제공할 사전(dict)
            - query: 데이터셋에 정의된 질문
            - answer: 데이터셋에 정의된 타겟 답변
        반환:
            - 단일 사전 {metric name: single evaluation value}"""

        # get_model은 제공된 파라미터(OpenAI, HF...)를 기반으로 하는 일반적인 모델 getter로 정의됩니다.
        eval_model = get_model(
            model_name = self.model_name,
            prompt = self.prompt
            device = self.device,
        )
        # 평가 속도를 높이기 위한 async 평가 - 반드시 async일 필요는 없습니다.
        grade = await eval_model.async_predict(
            {
                "query": query,
                "answer": answer,
                "result": output.get("result"),
            }
        )
        # 결과 파싱 - pydantic으로 더 견고하게 수행할 수 있습니다.
        evaluation = "incorrect" not in grade["text"].strip().lower()

        # Weave에 표시될 컬럼 이름
        return {"correct": evaluation}

    @weave.op()
    def summarize(self, score_rows: list) -> Optional[dict]:
        """스코어링 함수에 의해 각 행에 대해 계산된 모든 점수를 집계합니다.
        인수:
            - score_rows: 메트릭과 점수를 포함하는 사전들의 리스트
        반환:
            - 입력과 동일한 구조의 중첩된 사전"""

        # 아무것도 제공되지 않으면 weave.flow.scorer.auto_summarize 함수가 사용됩니다.
        # return auto_summarize(score_rows)

        valid_data = [x.get("correct") for x in score_rows if x.get("correct") is not None]
        count_true = list(valid_data).count(True)
        int_data = [int(x) for x in valid_data]

        sample_mean = np.mean(int_data) if int_data else 0
        sample_variance = np.var(int_data) if int_data else 0
        sample_error = np.sqrt(sample_variance / len(int_data)) if int_data else 0

        # 추가적인 "correct" 레이어는 필수는 아니지만 UI에 구조를 더해줍니다.
        return {
            "correct": {
                "true_count": count_true,
                "true_fraction": sample_mean,
                "stderr": sample_error,
            }
        }

이 기능은 아직 TypeScript에서 사용할 수 없습니다.

이를 스코어러로 사용하려면 다음과 같이 초기화하여 Evaluation의 scorers 인수에 전달합니다:

Python
TypeScript

evaluation = weave.Evaluation(dataset=questions, scorers=[CorrectnessLLMJudge()])

이 기능은 아직 TypeScript에서 사용할 수 없습니다.

모든 과정 종합하기

RAG 앱에 대해 동일한 결과를 얻으려면:

LLM 호출 및 검색 단계 함수를 weave.op()로 감쌉니다.
(선택 사항) predict 함수와 앱 상세 정보를 포함하는 Model 서브클래스를 생성합니다.
평가할 예시를 수집합니다.
하나의 예시를 평가하는 스코어링 함수를 생성합니다.
Evaluation 클래스를 사용하여 예시에 대한 평가를 실행합니다.

참고: 때때로 Evaluation 의 비동기 실행이 OpenAI, Anthropic 등의 모델에 대한 rate limit을 트리거할 수 있습니다. 이를 방지하기 위해 병렬 워커 수를 제한하는 환경 변수(예: WEAVE_PARALLELISM=3)를 설정할 수 있습니다. 전체 코드는 다음과 같습니다.

Python
TypeScript

from openai import OpenAI
import weave
from weave import Model
import numpy as np
import json
import asyncio

# 평가에 사용할 예시
articles = [
    "Novo Nordisk and Eli Lilly rival soars 32 percent after promising weight loss drug results Shares of Denmarks Zealand Pharma shot 32 percent higher in morning trade, after results showed success in its liver disease treatment survodutide, which is also on trial as a drug to treat obesity. The trial “tells us that the 6mg dose is safe, which is the top dose used in the ongoing [Phase 3] obesity trial too,” one analyst said in a note. The results come amid feverish investor interest in drugs that can be used for weight loss.",
    "Berkshire shares jump after big profit gain as Buffetts conglomerate nears $1 trillion valuation Berkshire Hathaway shares rose on Monday after Warren Buffetts conglomerate posted strong earnings for the fourth quarter over the weekend. Berkshires Class A and B shares jumped more than 1.5%, each. Class A shares are higher by more than 17% this year, while Class B has gained more than 18%. Berkshire was last valued at $930.1 billion, up from $905.5 billion where it closed on Friday, according to FactSet. Berkshire on Saturday posted fourth-quarter operating earnings of $8.481 billion, about 28 percent higher than the $6.625 billion from the year-ago period, driven by big gains in its insurance business. Operating earnings refers to profits from businesses across insurance, railroads and utilities. Meanwhile, Berkshires cash levels also swelled to record levels. The conglomerate held $167.6 billion in cash in the fourth quarter, surpassing the $157.2 billion record the conglomerate held in the prior quarter.",
    "Highmark Health says its combining tech from Google and Epic to give doctors easier access to information Highmark Health announced it is integrating technology from Google Cloud and the health-care software company Epic Systems. The integration aims to make it easier for both payers and providers to access key information they need, even if it's stored across multiple points and formats, the company said. Highmark is the parent company of a health plan with 7 million members, a provider network of 14 hospitals and other entities",
    "Rivian and Lucid shares plunge after weak EV earnings reports Shares of electric vehicle makers Rivian and Lucid fell Thursday after the companies reported stagnant production in their fourth-quarter earnings after the bell Wednesday. Rivian shares sank about 25 percent, and Lucids stock dropped around 17 percent. Rivian forecast it will make 57,000 vehicles in 2024, slightly less than the 57,232 vehicles it produced in 2023. Lucid said it expects to make 9,000 vehicles in 2024, more than the 8,428 vehicles it made in 2023.",
    "Mauritius blocks Norwegian cruise ship over fears of a potential cholera outbreak Local authorities on Sunday denied permission for the Norwegian Dawn ship, which has 2,184 passengers and 1,026 crew on board, to access the Mauritius capital of Port Louis, citing “potential health risks.” The Mauritius Ports Authority said Sunday that samples were taken from at least 15 passengers on board the cruise ship. A spokesperson for the U.S.-headquartered Norwegian Cruise Line Holdings said Sunday that 'a small number of guests experienced mild symptoms of a stomach-related illness' during Norwegian Dawns South Africa voyage.",
    "Intuitive Machines lands on the moon in historic first for a U.S. company Intuitive Machines Nova-C cargo lander, named Odysseus after the mythological Greek hero, is the first U.S. spacecraft to soft land on the lunar surface since 1972. Intuitive Machines is the first company to pull off a moon landing — government agencies have carried out all previously successful missions. The company's stock surged in extended trading Thursday, after falling 11 percent in regular trading.",
    "Lunar landing photos: Intuitive Machines Odysseus sends back first images from the moon Intuitive Machines cargo moon lander Odysseus returned its first images from the surface. Company executives believe the lander caught its landing gear sideways on the surface of the moon while touching down and tipped over. Despite resting on its side, the company's historic IM-1 mission is still operating on the moon.",
]

def docs_to_embeddings(docs: list) -> list:
    openai = OpenAI()
    document_embeddings = []
    for doc in docs:
        response = (
            openai.embeddings.create(input=doc, model="text-embedding-3-small")
            .data[0]
            .embedding
        )
        document_embeddings.append(response)
    return document_embeddings

# 참고: 일반적으로는 아티클에 대해 이 작업을 한 번 수행하고 임베딩 및 메타데이터를 데이터베이스에 넣습니다.
article_embeddings = docs_to_embeddings(articles)

# 검색 단계에 데코레이터 추가
@weave.op()
def get_most_relevant_document(query):
    openai = OpenAI()
    query_embedding = (
        openai.embeddings.create(input=query, model="text-embedding-3-small")
        .data[0]
        .embedding
    )
    similarities = [
        np.dot(query_embedding, doc_emb)
        / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb))
        for doc_emb in article_embeddings
    ]
    # 가장 유사한 문서의 인덱스 가져오기
    most_relevant_doc_index = np.argmax(similarities)
    return articles[most_relevant_doc_index]

# 응답을 생성하는 predict 함수와 함께 앱에 대한 상세 정보를 포함하는 Model 서브클래스 생성
class RAGModel(Model):
    system_message: str
    model_name: str = "gpt-3.5-turbo-1106"

    @weave.op()
    def predict(self, question: str) -> dict: # 참고: `question`은 나중에 evaluation 행에서 데이터를 선택하는 데 사용됩니다.
        from openai import OpenAI
        context = get_most_relevant_document(question)
        client = OpenAI()
        query = f"""Use the following information to answer the subsequent question. If the answer cannot be found, write "I don't know."
        Context:
        \"\"\"
        {context}
        \"\"\"
        Question: {question}"""
        response = client.chat.completions.create(
            model=self.model_name,
            messages=[
                {"role": "system", "content": self.system_message},
                {"role": "user", "content": query},
            ],
            temperature=0.0,
            response_format={"type": "text"},
        )
        answer = response.choices[0].message.content
        return {'answer': answer, 'context': context}

# 팀 및 프로젝트 이름 설정
weave.init('<team-name>/rag-quickstart')
model = RAGModel(
    system_message="You are an expert in finance and answer questions related to finance, financial services, and financial markets. When responding based on provided information, be sure to cite the source."
)

# 질문과 결과물을 사용하여 점수를 생성하는 스코어링 함수
@weave.op()
async def context_precision_score(question, output):
    context_precision_prompt = """Given question, answer and context verify if the context was useful in arriving at the given answer. Give verdict as "1" if useful and "0" if not with json output.
    Output in only valid JSON format.

    question: {question}
    context: {context}
    answer: {answer}
    verdict: """
    client = OpenAI()

    prompt = context_precision_prompt.format(
        question=question,
        context=output['context'],
        answer=output['answer'],
    )

    response = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[{"role": "user", "content": prompt}],
        response_format={ "type": "json_object" }
    )
    response_message = response.choices[0].message
    response = json.loads(response_message.content)
    return {
        "verdict": int(response["verdict"]) == 1,
    }

questions = [
    {"question": "What significant result was reported about Zealand Pharma's obesity trial?"},
    {"question": "How much did Berkshire Hathaway's cash levels increase in the fourth quarter?"},
    {"question": "What is the goal of Highmark Health's integration of Google Cloud and Epic Systems technology?"},
    {"question": "What were Rivian and Lucid's vehicle production forecasts for 2024?"},
    {"question": "Why was the Norwegian Dawn cruise ship denied access to Mauritius?"},
    {"question": "Which company achieved the first U.S. moon landing since 1972?"},
    {"question": "What issue did Intuitive Machines' lunar lander encounter upon landing on the moon?"}
]

# Evaluation 오브젝트를 정의하고 스코어링 함수와 함께 예시 질문을 전달합니다.
evaluation = weave.Evaluation(dataset=questions, scorers=[context_precision_score])
asyncio.run(evaluation.evaluate(model))

require('dotenv').config();
import { OpenAI } from 'openai';
import * as weave from 'weave';

interface Article {
    text: string;
    embedding?: number[];
}

const articles: Article[] = [
    { 
        text: `Novo Nordisk and Eli Lilly rival soars 32 percent after promising weight loss drug results Shares of Denmarks Zealand Pharma shot 32 percent higher in morning trade, after results showed success in its liver disease treatment survodutide, which is also on trial as a drug to treat obesity. The trial tells us that the 6mg dose is safe, which is the top dose used in the ongoing [Phase 3] obesity trial too, one analyst said in a note. The results come amid feverish investor interest in drugs that can be used for weight loss.`
    },
    { 
        text: `Berkshire shares jump after big profit gain as Buffetts conglomerate nears $1 trillion valuation Berkshire Hathaway shares rose on Monday after Warren Buffetts conglomerate posted strong earnings for the fourth quarter over the weekend. Berkshires Class A and B shares jumped more than 1.5%, each. Class A shares are higher by more than 17% this year, while Class B has gained more than 18%. Berkshire was last valued at $930.1 billion, up from $905.5 billion where it closed on Friday, according to FactSet. Berkshire on Saturday posted fourth-quarter operating earnings of $8.481 billion, about 28 percent higher than the $6.625 billion from the year-ago period, driven by big gains in its insurance business. Operating earnings refers to profits from businesses across insurance, railroads and utilities. Meanwhile, Berkshires cash levels also swelled to record levels. The conglomerate held $167.6 billion in cash in the fourth quarter, surpassing the $157.2 billion record the conglomerate held in the prior quarter.`
    },
    { 
        text: `Highmark Health says its combining tech from Google and Epic to give doctors easier access to information Highmark Health announced it is integrating technology from Google Cloud and the health-care software company Epic Systems. The integration aims to make it easier for both payers and providers to access key information they need, even if its stored across multiple points and formats, the company said. Highmark is the parent company of a health plan with 7 million members, a provider network of 14 hospitals and other entities`
    }
];

function cosineSimilarity(a: number[], b: number[]): number {
    const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
    const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
    const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
    return dotProduct / (magnitudeA * magnitudeB);
}

const docsToEmbeddings = weave.op(async function(docs: Article[]): Promise<Article[]> {
    const openai = new OpenAI();
    const enrichedDocs = await Promise.all(docs.map(async (doc) => {
        const response = await openai.embeddings.create({
            input: doc.text,
            model: "text-embedding-3-small"
        });
        return {
            ...doc,
            embedding: response.data[0].embedding
        };
    }));
    return enrichedDocs;
});

class RAGModel {
    private openai: OpenAI;
    private systemMessage: string;
    private modelName: string;
    private articleEmbeddings: Article[];

    constructor(config: {
        systemMessage: string;
        modelName?: string;
        articleEmbeddings: Article[];
    }) {
        this.openai = new OpenAI();
        this.systemMessage = config.systemMessage;
        this.modelName = config.modelName || "gpt-3.5-turbo-1106";
        this.articleEmbeddings = config.articleEmbeddings;
        this.predict = weave.op(this, this.predict);
    }

    private async getMostRelevantDocument(query: string): Promise<string> {
        const queryEmbedding = await this.openai.embeddings.create({
            input: query,
            model: "text-embedding-3-small"
        });

        const similarities = this.articleEmbeddings.map(doc => {
            if (!doc.embedding) return 0;
            return cosineSimilarity(queryEmbedding.data[0].embedding, doc.embedding);
        });

        const mostRelevantIndex = similarities.indexOf(Math.max(...similarities));
        return this.articleEmbeddings[mostRelevantIndex].text;
    }

    async predict(question: string): Promise<{
        answer: string;
        context: string;
    }> {
        const context = await this.getMostRelevantDocument(question);
        
        const response = await this.openai.chat.completions.create({
            model: this.modelName,
            messages: [
                { role: "system", content: this.systemMessage },
                { 
                    role: "user", 
                    content: `Use the following information to answer the subsequent question. If the answer cannot be found, write "I don't know."
                    Context:
                    """
                    ${context}
                    """
                    Question: ${question}`
                }
            ],
            temperature: 0
        });

        return {
            answer: response.choices[0].message.content || "",
            context
        };
    }
}

interface ScorerResult {
    verdict: boolean;
}

interface QuestionRow {
    question: string;
}

function createQuestionDataset(): weave.Dataset<QuestionRow> {
    return new weave.Dataset<QuestionRow>({
        id: 'rag-questions',
        rows: [
            { question: "What significant result was reported about Zealand Pharma's obesity trial?" },
            { question: "How much did Berkshire Hathaway's cash levels increase in the fourth quarter?" },
            { question: "What is the goal of Highmark Health's integration of Google Cloud and Epic Systems technology?" }
        ]
    });
}

const contextPrecisionScore = weave.op(async function(args: {
    datasetRow: QuestionRow;
    modelOutput: { answer: string; context: string; }
}): Promise<ScorerResult> {
    const openai = new OpenAI();
    
    const prompt = `Given question, answer and context verify if the context was useful in arriving at the given answer. Give verdict as "1" if useful and "0" if not with json output.
    Output in only valid JSON format.

    question: ${args.datasetRow.question}
    context: ${args.modelOutput.context}
    answer: ${args.modelOutput.answer}
    verdict: `;

    const response = await openai.chat.completions.create({
        model: "gpt-4-turbo-preview",
        messages: [{ role: "user", content: prompt }],
        response_format: { type: "json_object" }
    });

    const result = JSON.parse(response.choices[0].message.content || "{}");
    return {
        verdict: parseInt(result.verdict) === 1
    };
});

async function main() {
    # 팀 및 프로젝트 이름 설정
    await weave.init('<team-name>/rag-quickstart');
    
    const articleEmbeddings = await docsToEmbeddings(articles);
    
    const model = new RAGModel({
        systemMessage: "You are an expert in finance and answer questions related to finance, financial services, and financial markets. When responding based on provided information, be sure to cite the source.",
        articleEmbeddings
    });

    const evaluation = new weave.Evaluation({
        dataset: createQuestionDataset(),
        scorers: [contextPrecisionScore]
    });

    const results = await evaluation.evaluate({
        model: weave.op((args: { datasetRow: QuestionRow }) => 
            model.predict(args.datasetRow.question)
        )
    });
    
    console.log('Evaluation results:', results);
}

if (require.main === module) {
    main().catch(console.error);
}

결론

이 튜토리얼에서는 이 예제의 검색 단계와 같이 애플리케이션의 다양한 단계에 관찰성(observability)을 구축하는 방법을 보여주었습니다. 또한 애플리케이션 응답의 자동 평가를 위해 LLM judge와 같은 복잡한 스코어링 함수를 구축하는 방법도 배웠습니다.

다음 단계

엔지니어를 위한 실용적인 RAG 기술에 대한 심도 있는 학습을 위해 RAG++ 코스를 확인해 보세요. Weights & Biases, Cohere 및 Weaviate가 제공하는 프로덕션 레벨의 솔루션을 통해 성능을 최적화하고 비용을 절감하며 애플리케이션의 정확성과 관련성을 높이는 방법을 배울 수 있습니다.

Get Started

Guides

Cookbooks

Reference

Open Source

Community

학습할 내용:

Prerequisites

지식 베이스 구축

RAG 앱 생성

LLM Judge로 평가하기

스코어링 함수 정의하기

선택 사항: `Scorer` 클래스 정의하기

모든 과정 종합하기

결론

다음 단계

Get Started

Guides

Cookbooks

Reference

Open Source

Community

​학습할 내용:

​Prerequisites

​지식 베이스 구축

​RAG 앱 생성

​LLM Judge로 평가하기

​스코어링 함수 정의하기

​선택 사항: Scorer 클래스 정의하기

​모든 과정 종합하기

​결론

​다음 단계

학습할 내용:

Prerequisites

지식 베이스 구축

RAG 앱 생성

LLM Judge로 평가하기

스코어링 함수 정의하기

선택 사항: `Scorer` 클래스 정의하기

모든 과정 종합하기

결론

다음 단계