Spin up a single node GPU cluster with Minikube
Set up W&B Launch on a Minikube cluster that can schedule and run GPU workloads.
This tutorial is intended to guide users with direct access to a machine that has multiple GPUs. This tutorial is not intended for users who rent a cloud machine.
W&B recommends you create a Kubernetes cluster with GPU support that uses your cloud provider, if you want to set up a minikube cluster on a cloud machine. For example, AWS, GCP, Azure, Coreweave, and other cloud providers have tools to create Kubernetes clusters with GPU support.
W&B recommends you use a Launch Docker queue if you want to set up a minikube cluster for scheduling GPUs on a machine that has a single GPU. You can still follow the tutorial for fun, but the GPU scheduling will not be very useful.
Background
The Nvidia container toolkit has made it easy to run GPU-enabled workflows on Docker. One limitation is a lack of native support for scheduling GPU by volume. If you want to use a GPU with the docker run
command you must either request specific GPU by ID or all GPU present, which makes many distributed GPU enabled workloads impractical. Kubernetes offers support for scheduling by a volume request, but setting up a local Kubernetes cluster with GPU scheduling can take considerable time and effort, until recently. Minikube, one of the most popular tools for running single node Kubernetes clusters, recently released support for GPU scheduling 🎉 In this tutorial, we will create a Minikube cluster on a multi-GPU machine and launch concurrent stable diffusion inference jobs to the cluster using W&B Launch 🚀
Prerequisites
Before getting started, you will need:
- A W&B account.
- Linux machine with the following installed and running:
- Docker runtime
- Drivers for any GPU you want to use
- Nvidia container toolkit
For testing and creating this tutorial, we used an n1-standard-16
Google Cloud Compute Engine instance with 4 NVIDIA Tesla T4 GPU connected.
Create a queue for launch jobs
First, create a launch queue for our launch jobs.
- Navigate to wandb.ai/launch (or
<your-wandb-url>/launch
if you use a private W&B server). - In the top right corner of your screen, click the blue Create a queue button. A queue creation drawer will slide out from the right side of your screen.
- Select an entity, enter a name, and select Kubernetes as the type for your queue.
- The Config section of the drawer is where you will enter a Kubernetes job specification for the launch queue. Any runs launched from this queue will be created using this job specification, so you can modify this configuration as needed to customize your jobs. For this tutorial, you can copy and paste the sample config below in your queue config as YAML or JSON:
- YAML
- JSON
spec:
template:
spec:
containers:
- image: ${image_uri}
resources:
limits:
cpu: 4
memory: 12Gi
nvidia.com/gpu: '{{gpus}}'
restartPolicy: Never
backoffLimit: 0
{
"spec": {
"template": {
"spec": {
"containers": [
{
"image": "${image_uri}",
"resources": {
"limits": {
"cpu": 4,
"memory": "12Gi",
"nvidia.com/gpu": "{{gpus}}"
}
}
}
],
"restartPolicy": "Never"
}
},
"backoffLimit": 0
}
}
For more information about queue configurations, see the Set up Launch on Kubernetes and the Advanced queue setup guide.
The ${image_uri}
and {{gpus}}
strings are examples of the two kinds of
variable templates that you can use in your queue configuration. The ${image_uri}
template will be replaced with the image URI of the job you are launching by the
agent. The {{gpus}}
template will be used to create a template variable that
you can override from the launch UI, CLI, or SDK when submitting a job. These values
are placed in the job specification so that they will modify the correct fields
to control the image and GPU resources used by the job.
- Click the Parse configuration button to begin customizing your
gpus
template variable. - Set the Type to
Integer
and the Default, Min, and Max to values of your choosing. Attempts to submit a run to this queue which violate the constraints of the template variable will be rejected.
- Click Create queue to create your queue. You will be redirected to the queue page for your new queue.
In the next section, we will set up an agent that can pull and execute jobs from the queue you created.