Skip to main content

Vision

Load model

Just like loading the Chat model, for the vision model, you need two specific types:

  • the GGUF model
  • the mmproj model.

You can load the model using:

Load Model
curl -X POST 'http://127.0.0.1:3928/inferences/llamacpp/loadmodel' -H 'Content-Type: application/json' -d '{
"llama_model_path": "/path/to/gguf/model/",
"mmproj": "/path/to/mmproj/model/",
"ctx_len": 2048,
"ngl": 100,
"cont_batching": false,
"embedding": false,
"system_prompt": "",
"user_prompt": "\n### Instruction:\n",
"ai_prompt": "\n### Response:\n"
}'

Download the models here:

  • Llava Model: Large Language and Vision Assistant achieves SoTA on 11 benchmarks.
  • Bakllava Model is a Mistral 7B base augmented with the LLaVA architecture.

Inference

Nitro currently only works with images converted to base64 format. Use this base64 converter to prepare your images.

To get the model's understanding of an image, do the following:

Inference
curl http://127.0.0.1:3928/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4-vision-preview",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What’s in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "<base64>"
}
}
]
}
],
"max_tokens": 300
}'

If the base64 string is too long and causes errors, consider using Postman as an alternative.