Skip to main content
This guide shows you how to use Weave to trace, debug, and evaluate LlamaIndex applications. By following along, you learn how Weave automatically captures calls made through the LlamaIndex Python library so you can monitor RAG pipelines, agent steps, and LLM calls without writing custom logging code. This guide is for developers building LLM applications with LlamaIndex who want visibility into their workflows for debugging, performance analysis, and evaluation. When you work with LLMs, debugging is inevitable. Whether a model call fails, an output is misformatted, or nested model calls create confusion, pinpointing issues can be challenging. LlamaIndex applications often consist of multiple steps and LLM call invocations, making it crucial to understand the inner workings of your chains and agents. Weave streamlines this process by automatically capturing traces for your LlamaIndex applications. This lets you monitor and analyze your application’s performance, helping you debug and optimize your LLM workflows. Weave also helps with your evaluation workflows.

Get started

To get started, call weave.init() at the beginning of your script. This initializes Weave and starts capturing traces for any LlamaIndex calls that follow. The argument in weave.init() is a project name that helps you organize your traces.
import weave
from llama_index.core.chat_engine import SimpleChatEngine

# Initialize Weave with your project name
weave.init("llamaindex_demo")

chat_engine = SimpleChatEngine.from_defaults()
response = chat_engine.chat(
    "Say something profound and romantic about fourth of July"
)
print(response)
The previous example creates a basic LlamaIndex chat engine which, internally, makes an OpenAI call. After you run this code, Weave captures a trace of the chat engine’s execution that you can inspect in the Weave web interface. See the following trace: simple_llamaindex.png

Traces

This section explains how Weave captures multi-step LlamaIndex workflows like RAG pipelines. LlamaIndex is known for its ease of connecting data with LLMs. A basic RAG application requires an embedding step, a retrieval step, and a response synthesis step. As complexity grows, it becomes important to store traces of individual steps in a central database during both development and production. These traces are essential for debugging and improving your application. Weave automatically tracks all calls made through the LlamaIndex library, including prompt templates, LLM calls, tools, and agent steps. You can view the traces in the Weave web interface. The following example shows a basic RAG pipeline from LlamaIndex’s Starter Tutorial (OpenAI):
import weave
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Initialize Weave with your project name
weave.init("llamaindex_demo")

# Assuming you have a `.txt` file in the `data` directory
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)
The trace timeline captures not only the “events” but also the execution time, cost, and token counts where applicable. Drill down into the trace to see the inputs and outputs of each step. llamaindex_rag.png

One-click observability

This section explains how the Weave integration plugs into LlamaIndex’s built-in observability system, so you don’t have to configure handlers manually. LlamaIndex provides one-click observability to let you build principled LLM applications in a production setting. The Weave integration uses this capability of LlamaIndex and automatically sets WeaveCallbackHandler() to llama_index.core.global_handler. As a user of LlamaIndex and Weave, you only need to initialize a Weave run with weave.init([NAME_OF_PROJECT]).

Create a Model for easier experimentation

Organizing and evaluating LLMs in applications for various use cases is challenging when you have multiple components such as prompts, model configurations, and inference parameters. With weave.Model, you can capture and organize experimental details like system prompts or the models you use, which helps you compare different iterations. The following example shows how to build a LlamaIndex query engine in a WeaveModel, using data that you can find in the weave/data folder:
import weave

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.core import PromptTemplate


PROMPT_TEMPLATE = """
You are given with relevant information about Paul Graham. Answer the user query only based on the information provided. Don't make up stuff.

User Query: {query_str}
Context: {context_str}
Answer:
"""

class SimpleRAGPipeline(weave.Model):
    chat_llm: str = "gpt-4"
    temperature: float = 0.1
    similarity_top_k: int = 2
    chunk_size: int = 256
    chunk_overlap: int = 20
    prompt_template: str = PROMPT_TEMPLATE

    def get_llm(self):
        return OpenAI(temperature=self.temperature, model=self.chat_llm)

    def get_template(self):
        return PromptTemplate(self.prompt_template)

    def load_documents_and_chunk(self, data):
        documents = SimpleDirectoryReader(data).load_data()
        splitter = SentenceSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=self.chunk_overlap,
        )
        nodes = splitter.get_nodes_from_documents(documents)
        return nodes

    def get_query_engine(self, data):
        nodes = self.load_documents_and_chunk(data)
        index = VectorStoreIndex(nodes)

        llm = self.get_llm()
        prompt_template = self.get_template()

        return index.as_query_engine(
            similarity_top_k=self.similarity_top_k,
            llm=llm,
            text_qa_template=prompt_template,
        )

    @weave.op()
    def predict(self, query: str):
        query_engine = self.get_query_engine(
            # This data can be found in the weave repo under data/paul_graham
            "data/paul_graham",
        )
        response = query_engine.query(query)
        return {"response": response.response}

weave.init("test-llamaindex-weave")

rag_pipeline = SimpleRAGPipeline()
response = rag_pipeline.predict("What did the author do growing up?")
print(response)
The SimpleRAGPipeline class subclassed from weave.Model organizes the important parameters for this RAG pipeline. Decorating the query method with weave.op() enables tracing. With this structure in place, you can now version, compare, and evaluate different configurations of your RAG pipeline in Weave. llamaindex_model.png

Evaluate with weave.Evaluation

This section shows how to measure your model’s performance on a fixed dataset so you can compare iterations quantitatively. Evaluations help you measure the performance of your applications. With the weave.Evaluation class, you can capture how well your model performs on specific tasks or datasets, which helps you compare different models and iterations of your application. The following example shows how to evaluate the model created in the previous section:
import asyncio
from llama_index.core.evaluation import CorrectnessEvaluator

eval_examples = [
    {
        "id": "0",
        "query": "What programming language did Paul Graham learn to teach himself AI when he was in college?",
        "ground_truth": "Paul Graham learned Lisp to teach himself AI when he was in college.",
    },
    {
        "id": "1",
        "query": "What was the name of the startup Paul Graham co-founded that was eventually acquired by Yahoo?",
        "ground_truth": "The startup Paul Graham co-founded that was eventually acquired by Yahoo was called Viaweb.",
    },
    {
        "id": "2",
        "query": "What is the capital city of France?",
        "ground_truth": "I cannot answer this question because no information was provided in the text.",
    },
]

llm_judge = OpenAI(model="gpt-4", temperature=0.0)
evaluator = CorrectnessEvaluator(llm=llm_judge)

@weave.op()
def correctness_evaluator(query: str, ground_truth: str, output: dict):
    result = evaluator.evaluate(
        query=query, reference=ground_truth, response=output["response"]
    )
    return {"correctness": float(result.score)}

evaluation = weave.Evaluation(dataset=eval_examples, scorers=[correctness_evaluator])

rag_pipeline = SimpleRAGPipeline()

asyncio.run(evaluation.evaluate(rag_pipeline))
This evaluation builds on the example in the previous section. Evaluating with weave.Evaluation requires an evaluation dataset, a scorer function, and a weave.Model. These requirements apply to the three key components:
  • The keys of the evaluation sample dicts must match the arguments of the scorer function and of the weave.Model’s predict method.
  • The weave.Model must have a method named predict, infer, or forward. You must decorate this method with weave.op() for tracing.
  • The scorer function must be decorated with weave.op() and must have output as a named argument.
llamaindex_evaluation.png By integrating Weave with LlamaIndex, you can ensure comprehensive logging and monitoring of your LLM applications, which streamlines debugging and performance optimization through evaluation.