NVIDIA NeMo Inference Microservice Deploy Job

Deploy a model artifact from W&B to a NVIDIA NeMo Inference Microservice. To do this, use W&B Launch. W&B Launch converts model artifacts to NVIDIA NeMo Model and deploys to a running NIM/Triton server.

W&B Launch currently accepts the following compatible model types:

  1. Llama2
  2. StarCoder
  3. NV-GPT (coming soon)

Quickstart

  1. Create a launch queue if you don’t have one already. See an example queue config below.

    net: host
    gpus: all # can be a specific set of GPUs or `all` to use everything
    runtime: nvidia # also requires nvidia container runtime
    volume:
      - model-store:/model-store/
    
    YAML
    image
  2. Create this job in your project:

    wandb job create -n "deploy-to-nvidia-nemo-inference-microservice" \
       -e $ENTITY \
       -p $PROJECT \
       -E jobs/deploy_to_nvidia_nemo_inference_microservice/job.py \
       -g andrew/nim-updates \
       git https://github.com/wandb/launch-jobs
    
    Bash
  3. Launch an agent on your GPU machine:

    wandb launch-agent -e $ENTITY -p $PROJECT -q $QUEUE
    
    Bash
  4. Submit the deployment launch job with your desired configs from the Launch UI

    1. You can also submit via the CLI:
      wandb launch -d gcr.io/playground-111/deploy-to-nemo:latest \
        -e $ENTITY \
        -p $PROJECT \
        -q $QUEUE \
        -c $CONFIG_JSON_FNAME
      
      Bash
      image
  5. You can track the deployment process in the Launch UI. image

  6. Once complete, you can immediately curl the endpoint to test the model. The model name is always ensemble.

     #!/bin/bash
     curl -X POST "http://0.0.0.0:9999/v1/completions" \
         -H "accept: application/json" \
         -H "Content-Type: application/json" \
         -d '{
             "model": "ensemble",
             "prompt": "Tell me a joke",
             "max_tokens": 256,
             "temperature": 0.5,
             "n": 1,
             "stream": false,
             "stop": "string",
             "frequency_penalty": 0.0
             }'
    
    Bash