Skip to main content

Quickstart

Step 1: Install Nitro

Download and install Nitro on your system.

From the release page

You can directly choose the pre-built binary that compatible with your system at

Nitro Release Page

After you have downloaded the binary, you can directly use the binary with "./nitro".

If you want to build from source rather than using the pre-built binary, you can also check: Build from Source

Step 2: Downloading a Model

For this example, we'll use the Llama2 7B chat model.

mkdir model && cd model
wget -O llama-2-7b-model.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf?download=true

Step 3: Run Nitro server

Run Nitro server
nitro

To check if the Nitro server is running:

Nitro Health Status
curl http://localhost:3928/healthz

Step 4: Load model

To load the model to Nitro server, run:

Load model
curl http://localhost:3928/inferences/llamacpp/loadmodel \
-H 'Content-Type: application/json' \
-d '{
"llama_model_path": "/model/llama-2-7b-model.gguf",
"ctx_len": 512,
"ngl": 100,
}'

Step 5: Making an Inference

Finally, let's chat with the model using Nitro.

Nitro Inference
curl http://localhost:3928/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": "Who won the world series in 2020?"
},
]
}'

As you can see, a key benefit of Nitro is its alignment with OpenAI's API structure. Its inference call syntax closely mirrors that of OpenAI's API, facilitating an easier shift for those accustomed to OpenAI's framework.