Use Weave with W&B Models

This notebook walks through an end-to-end workflow that combines W&B Models and Weave to build, evaluate, and publish a Retrieval-Augmented Generation (RAG) application. You retrieve a fine-tuned chat model from W&B Registry, swap it into an existing RagModel tracked in Weave, evaluate the updated application with weave.Evaluation, and publish the new RAG model back to the Registry. This workflow is intended for teams who train and fine-tune models with W&B Models and want to integrate them into LLM applications tracked and evaluated with Weave.

This is an interactive notebook. You can run it locally or use the following links:

Prerequisites

First, install the required libraries, set up API keys, log in to W&B, and create a new W&B project.

Install weave, pandas, unsloth, wandb, litellm, pydantic, torch, and faiss-gpu using pip.

%%capture
!pip install weave wandb pandas pydantic litellm faiss-gpu
python
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Add the required API keys from your environment.

import os

from google.colab import userdata

os.environ["WANDB_API_KEY"] = userdata.get("WANDB_API_KEY")  # W&B Models and Weave
os.environ["OPENAI_API_KEY"] = userdata.get(
    "OPENAI_API_KEY"
)  # OpenAI - for retrieval embeddings
os.environ["GEMINI_API_KEY"] = userdata.get(
    "GEMINI_API_KEY"
)  # Gemini - for the base chat model

import pandas as pd
import wandb

import weave

wandb.login()

PROJECT = "weave-cookboook-demo"
ENTITY = "wandb-smle"

weave.init(ENTITY + "/" + PROJECT)

Download `ChatModel` from W&B Registry and implement `UnslothLoRAChatModel`

In this scenario, the Model Team has already fine-tuned the Llama-3.2 model with the unsloth library for performance optimization, and the model is available in the W&B Registry. In this step, you retrieve the fine-tuned ChatModel from the Registry and convert it into a weave.Model to make it compatible with the RagModel.

The RagModel referenced in the following code is a top-level weave.Model that can be considered a complete RAG Application. It contains a ChatModel, vector database, and a prompt. The ChatModel is also a weave.Model, which contains code to download an artifact from the W&B Registry. ChatModel can be changed modularly to support any kind of other LLM chat model as part of the RagModel. For more information, view the model in Weave.

To load the ChatModel, use unsloth.FastLanguageModel or peft.AutoPeftModelForCausalLM with adapters for efficient integration into the app. After you download the model from the Registry, set up the initialization and prediction logic with the model_post_init method. The required code for this step is available in the Use tab of the Registry, and you can copy it directly into your implementation. The following code defines the UnslothLoRAChatModel class to manage, initialize, and use the fine-tuned Llama-3.2 model retrieved from the W&B Registry. UnslothLoRAChatModel uses unsloth.FastLanguageModel for optimized inference. The model_post_init method downloads and sets up the model, while the predict method processes user queries and generates responses. To adapt the code for your use case, update the MODEL_REG_URL with the correct registry path for your fine-tuned model, and adjust parameters like max_seq_length or dtype based on your hardware or requirements.

from typing import Any

from pydantic import PrivateAttr
from unsloth import FastLanguageModel

import weave

class UnslothLoRAChatModel(weave.Model):
    """
    We define an extra ChatModel class to be able store and version more parameters than just the model name.
    Especially, relevant if we consider fine-tuning (locally or aaS) because of specific parameters.
    """

    chat_model: str
    cm_temperature: float
    cm_max_new_tokens: int
    cm_quantize: bool
    inference_batch_size: int
    dtype: Any
    device: str
    _model: Any = PrivateAttr()
    _tokenizer: Any = PrivateAttr()

    def model_post_init(self, __context):
        # we can simply paste this from the "Use" tab from the registry
        run = wandb.init(project=PROJECT, job_type="model_download")
        artifact = run.use_artifact(f"{self.chat_model}")
        model_path = artifact.download()

        # unsloth version (enable native 2x faster inference)
        self._model, self._tokenizer = FastLanguageModel.from_pretrained(
            model_name=model_path,
            max_seq_length=self.cm_max_new_tokens,
            dtype=self.dtype,
            load_in_4bit=self.cm_quantize,
        )
        FastLanguageModel.for_inference(self._model)

    @weave.op()
    async def predict(self, query: list[str]) -> dict:
        # add_generation_prompt = true - Must add for generation
        input_ids = self._tokenizer.apply_chat_template(
            query,
            tokenize=True,
            add_generation_prompt=True,
            return_tensors="pt",
        ).to("cuda")

        output_ids = self._model.generate(
            input_ids=input_ids,
            max_new_tokens=64,
            use_cache=True,
            temperature=1.5,
            min_p=0.1,
        )

        decoded_outputs = self._tokenizer.batch_decode(
            output_ids[0][input_ids.shape[1] :], skip_special_tokens=True
        )

        return "".join(decoded_outputs).strip()
python
MODEL_REG_URL = "wandb32/wandb-registry-RAG Chat Models/Finetuned Llama-3.2:v3"

max_seq_length = 2048  # Choose any! We auto support RoPE Scaling internally!
dtype = (
    None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
)
load_in_4bit = True  # Use 4bit quantization to reduce memory usage. Can be False.

new_chat_model = UnslothLoRAChatModel(
    name="UnslothLoRAChatModelRag",
    chat_model=MODEL_REG_URL,
    cm_temperature=1.0,
    cm_max_new_tokens=max_seq_length,
    cm_quantize=load_in_4bit,
    inference_batch_size=max_seq_length,
    dtype=dtype,
    device="auto",
)
python
await new_chat_model.predict(
    [{"role": "user", "content": "What is the capital of Germany?"}]
)

Integrate the new `ChatModel` version into `RagModel`

Building a RAG application from a fine-tuned chat model lets you reuse tailored components without rebuilding the entire pipeline. In this step, you retrieve the existing RagModel from your Weave project and update its ChatModel to use the fine-tuned model. Swapping in the new chat model leaves the other components, like the vector database and prompts, untouched, preserving the application’s overall structure while improving performance. The following code retrieves the RagModel object using a reference from the Weave project. The code then updates the chat_model attribute of the RagModel to use the new UnslothLoRAChatModel instance created in the previous step. After this, the code publishes the updated RagModel to create a new version. Finally, the code runs a sample prediction query with the updated RagModel to verify that it uses the new chat model.

RagModel = weave.ref(
    "weave://wandb-smle/weave-cookboook-demo/object/RagModel:cqRaGKcxutBWXyM0fCGTR1Yk2mISLsNari4wlGTwERo"
).get()
python
RagModel.chat_model.chat_model
python
await RagModel.predict("When was the first conference on climate change?")
python
# MAGIC: exchange chat_model and publish new version (no need to worry about other RAG components)
RagModel.chat_model = new_chat_model
python
RagModel.chat_model.chat_model
python
# first publish new version so that in prediction we reference new version
PUB_REFERENCE = weave.publish(RagModel, "RagModel")
python
await RagModel.predict("When was the first conference on climate change?")

Run a `weave.Evaluation`

With the updated RagModel published, the next step is to confirm that the new fine-tuned chat model performs as expected within the application. In this step, you evaluate the performance of the updated RagModel with an existing weave.Evaluation. This process confirms that the new fine-tuned chat model performs as expected within the RAG application. To streamline integration and enable collaboration between the Models and Apps teams, you log evaluation results to both the model’s W&B run and the Weave workspace. In Models:

The evaluation summary is logged to the W&B run used to download the fine-tuned chat model. This includes summary metrics and graphs displayed in a workspace view for analysis.
The evaluation trace ID is added to the run’s configuration, linking directly to the Weave page for improved traceability by the Model Team.

In Weave:

The artifact or registry link for the ChatModel is stored as an input to the RagModel.
The W&B run ID is saved as an extra column in the evaluation traces for better context.

The following code demonstrates how to retrieve an evaluation object, run the evaluation with the updated RagModel, and log the results to both W&B and Weave. Ensure that the evaluation reference (WEAVE_EVAL) matches your project setup.

# MAGIC: we can simply get an evaluation with a eval dataset and scorers and use them
WEAVE_EVAL = "weave://wandb-smle/weave-cookboook-demo/object/climate_rag_eval:ntRX6qn3Tx6w3UEVZXdhIh1BWGh7uXcQpOQnIuvnSgo"
climate_rag_eval = weave.ref(WEAVE_EVAL).get()
python
with weave.attributes({"wandb-run-id": wandb.run.id}):
    # use .call attribute to retrieve both the result and the call in order to save eval trace to Models
    summary, call = await climate_rag_eval.evaluate.call(climate_rag_eval, RagModel)
python
# log to models
wandb.run.log(pd.json_normalize(summary, sep="/").to_dict(orient="records")[0])
wandb.run.config.update(
    {"weave_url": f"https://wandb.ai/wandb-smle/weave-cookboook-demo/r/call/{call.id}"}
)
wandb.run.finish()

Save the new RAG model to the Registry

Now that you’ve evaluated the updated RagModel, the final step is to publish it back to the W&B Registry so other teams can discover and reuse it. To make the updated RagModel available for future use by both the Models and Apps teams, push it to the W&B Registry as a reference artifact. The following code retrieves the weave object version and name for the updated RagModel and uses them to create reference links. The code then creates a new artifact in W&B with metadata containing the model’s Weave URL. The code logs this artifact to the W&B Registry and links it to a designated registry path. Before running the code, ensure the ENTITY and PROJECT variables match your W&B setup, and specify the correct target registry path. This process finalizes the workflow by publishing the new RagModel to the W&B ecosystem for collaboration and reuse. After running the code in this section, your updated RagModel is available in the W&B Registry as a referenced artifact, completing the round-trip between W&B Models and Weave.

MODELS_OBJECT_VERSION = PUB_REFERENCE.digest  # weave object version
MODELS_OBJECT_NAME = PUB_REFERENCE.name  # weave object name
python
models_url = f"https://wandb.ai/{ENTITY}/{PROJECT}/weave/objects/{MODELS_OBJECT_NAME}/versions/{MODELS_OBJECT_VERSION}"
models_link = (
    f"weave://{ENTITY}/{PROJECT}/object/{MODELS_OBJECT_NAME}:{MODELS_OBJECT_VERSION}"
)

with wandb.init(project=PROJECT, entity=ENTITY) as run:
    # create new Artifact
    artifact_model = wandb.Artifact(
        name="RagModel",
        type="model",
        description="Models Link from RagModel in Weave",
        metadata={"url": models_url},
    )
    artifact_model.add_reference(models_link, name="model", checksum=False)

    # log new artifact
    run.log_artifact(artifact_model, aliases=[MODELS_OBJECT_VERSION])

    # link to registry
    run.link_artifact(
        artifact_model, target_path="wandb32/wandb-registry-RAG Models/RAG Model"
    )

Get Started

Guides

Cookbooks

Reference

Details & Support

Prerequisites

Download `ChatModel` from W&B Registry and implement `UnslothLoRAChatModel`

Integrate the new `ChatModel` version into `RagModel`

Run a `weave.Evaluation`

Save the new RAG model to the Registry

​Prerequisites

​Download ChatModel from W&B Registry and implement UnslothLoRAChatModel

​Integrate the new ChatModel version into RagModel

​Run a weave.Evaluation

​Save the new RAG model to the Registry

Prerequisites

Download `ChatModel` from W&B Registry and implement `UnslothLoRAChatModel`

Integrate the new `ChatModel` version into `RagModel`

Run a `weave.Evaluation`

Save the new RAG model to the Registry