Skip to main content

Load and Unload models

Load model

The loadmodel in Nitro lets you load a local model into the server. It's an upgrade from llama.cpp, offering more features and customization options.

You can load the model using:

Load Model
curl http://localhost:3928/inferences/llamacpp/loadmodel \
-H 'Content-Type: application/json' \
-d '{
"llama_model_path": "/path/to/your_model.gguf",
"ctx_len": 512,
}'

For more detail on the loading model, please refer to [Table of parameters].(#table-of-parameters).

Enabling GPU Inference

To enable GPU inference in Nitro, a simple POST request is used. This request will instruct Nitro to load the specified model into the GPU, significantly boosting the inference throughput.

GPU enable
curl http://localhost:3928/inferences/llamacpp/loadmodel \
-H 'Content-Type: application/json' \
-d '{
"llama_model_path": "/path/to/your_model.gguf",
"ctx_len": 512,
"ngl": 100,
}'

You can adjust the ngl parameter based on your requirements and GPU capabilities.

Unload model

To unload a model, you can use a similar curl command as loading the model, adjusting the endpoint to /unloadmodel.

Unload the model
curl http://localhost:3928/inferences/llamacpp/unloadmodel

Status

The modelStatus function provides the current status of the model, including whether it is loaded and its properties. This function offers improved monitoring capabilities compared to llama.cpp.

Check Model Status
curl http://localhost:3928/inferences/llamacpp/modelstatus

If you load the model correctly, the response would be

Load Model Sucessfully
{"message":"Model loaded successfully", "code": "ModelloadedSuccessfully"}

In case you got error while loading models. Please check for the correct model path.

Load Model Failed
{"message":"No model loaded", "code": "NoModelLoaded"}

Table of parameters

ParameterTypeDescription
llama_model_pathStringThe file path to the LLaMA model.
nglIntegerThe number of GPU layers to use.
ctx_lenIntegerThe context length for the model operations.
embeddingBooleanWhether to use embedding in the model.
n_parallelIntegerThe number of parallel operations.
cont_batchingBooleanWhether to use continuous batching.
cpu_threadsIntegerThe number of threads for CPU inference.
user_promptStringThe prompt to use for the user.
ai_promptStringThe prompt to use for the AI assistant.
system_promptStringThe prompt for system rules.
pre_promptStringThe prompt to use for internal configuration.
clean_cache_thresholdIntegerNumber of chats that will trigger clean cache action.
grp_attn_nIntegerGroup attention factor in self-extend
grp_attn_wIntegerGroup attention width in self-extend
mlockBooleanPrevent system swapping of the model to disk in macOS