> ## Documentation Index
> Fetch the complete documentation index at: https://docs.wandb.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# HuggingFace Dataset Evaluations

> Learn how to use huggingface dataset evaluations with W&B Weave

<Note>
  This is an interactive notebook. You can run it locally or use the links below:

  * [Open in Google Colab](https://colab.research.google.com/github/wandb/docs/blob/main/weave/cookbooks/source/hf_dataset_evals.ipynb)
  * [View source on GitHub](https://github.com/wandb/docs/blob/main/weave/cookbooks/source/hf_dataset_evals.ipynb)
</Note>

# Using HuggingFace Datasets in evaluations with `preprocess_model_input`

## Note: This is a temporary workaround

> This guide demonstrates a workaround for using HuggingFace Datasets with Weave evaluations.<br /><br />
> We are actively working on developing more seamless integrations that will simplify this process.\
> While this approach works, expect improvements and updates in the near future that will make working with external datasets more straightforward.

## Setup and imports

First, we initialize Weave and connect to Weights & Biases for tracking experiments.

```python lines theme={null}
!pip install datasets wandb weave
python
# Initialize variables
HUGGINGFACE_DATASET = "wandb/ragbench-test-sample"
WANDB_KEY = ""
WEAVE_TEAM = ""
WEAVE_PROJECT = ""

# Init weave and required libraries
import asyncio

import nest_asyncio
import wandb
from datasets import load_dataset

import weave
from weave import Evaluation

# Login to wandb and initialize weave
wandb.login(key=WANDB_KEY)
client = weave.init(f"{WEAVE_TEAM}/{WEAVE_PROJECT}")

# Apply nest_asyncio to allow nested event loops (needed for some notebook environments)
nest_asyncio.apply()
```

## Load and prepare HuggingFace dataset

* We load a HuggingFace dataset.
* Create an index mapping to reference the dataset rows.
* This index approach allows us to maintain references to the original dataset.

> **Note:**<br />
> In the index, we encode the `hf_hub_name` along with the `hf_id` to ensure each row has a unique identifier.\
> This unique digest value is used for tracking and referencing specific dataset entries during evaluations.

```python lines theme={null}
# Load the HuggingFace dataset
ds = load_dataset(HUGGINGFACE_DATASET)
row_count = ds["train"].num_rows

# Create an index mapping for the dataset
# This creates a list of dictionaries with HF dataset indices
# Example: [{"hf_id": 0}, {"hf_id": 1}, {"hf_id": 2}, ...]
hf_index = [{"hf_id": i, "hf_hub_name": HUGGINGFACE_DATASET} for i in range(row_count)]
```

## Define processing and evaluation functions

### Processing pipeline

* `preprocess_example`: Transforms the index reference into the actual data needed for evaluation
* `hf_eval`: Defines how to score the model outputs
* `function_to_evaluate`: The actual function/model being evaluated

```python lines theme={null}
@weave.op()
def preprocess_example(example):
    """
    Preprocesses each example before evaluation.
    Args:
        example: Dict containing hf_id
    Returns:
        Dict containing the prompt from the HF dataset
    """
    hf_row = ds["train"][example["hf_id"]]
    return {"prompt": hf_row["question"], "answer": hf_row["response"]}

@weave.op()
def hf_eval(hf_id: int, output: dict) -> dict:
    """
    Scoring function for evaluating model outputs.
    Args:
        hf_id: Index in the HF dataset
        output: The output from the model to evaluate
    Returns:
        Dict containing evaluation scores
    """
    hf_row = ds["train"][hf_id]
    return {"scorer_value": True}

@weave.op()
def function_to_evaluate(prompt: str):
    """
    The function that will be evaluated (e.g., your model or pipeline).
    Args:
        prompt: Input prompt from the dataset
    Returns:
        Dict containing model output
    """
    return {"generated_text": "testing "}
```

### Create and run evaluation

* For each index in hf\_index:
  1. `preprocess_example` gets the corresponding data from the HF dataset.
  2. The preprocessed data is passed to `function_to_evaluate`.
  3. The output is scored using `hf_eval`.
  4. Results are tracked in Weave.

```python lines theme={null}
# Create evaluation object
evaluation = Evaluation(
    dataset=hf_index,  # Use our index mapping
    scorers=[hf_eval],  # List of scoring functions
    preprocess_model_input=preprocess_example,  # Function to prepare inputs
)

# Run evaluation asynchronously
async def main():
    await evaluation.evaluate(function_to_evaluate)

asyncio.run(main())
```
