Skip to main content

Nitro API

Download OpenAPI specification:Download

Please see https://nitro.jan.ai/ for documentation.

Chat Completion

Given a list of messages comprising a conversation, the model will return a response.

Create a chat with the model.

Request Body schema: application/json
messages
arrays

Contains input data or prompts for the model to process

stream
boolean
Default: true

Enables continuous output generation, allowing for streaming of model responses

model
string

Specifies the model being used for inference or processing tasks

max_tokens
number
Default: 2048

The maximum number of tokens the model will generate in a single response

stop
arrays

Defines specific tokens or phrases at which the model will stop generating further output

frequency_penalty
number
Default: 0

Adjusts the likelihood of the model repeating words or phrases in its output

presence_penalty
number
Default: 0

Influences the generation of new and varied concepts in the model's output

temperature
number
Default: 0.7

Controls the randomness of the model's output

top_p
number
Default: 0.95

Set probability threshold for more relevant outputs

Responses

Request samples

Content type
application/json
{
  • "messages": [
    ],
  • "stream": true,
  • "model": "gpt-3.5-turbo",
  • "max_tokens": 2048,
  • "stop": [
    ],
  • "frequency_penalty": 0,
  • "presence_penalty": 0,
  • "temperature": 0.7,
  • "top_p": 0.95
}

Response samples

Content type
application/json
{
  • "choices": [
    ],
  • "created": 1700193928,
  • "id": "ebwd2niJvJB1Q2Whyvkz",
  • "model": "_",
  • "object": "chat.completion",
  • "system_fingerprint": "_",
  • "usage": {
    }
}

Embeddings

Get a vector representation of a given input.

Creates an embedding vector representing the input text.

Request Body schema: application/json
input
any

Input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or array of token arrays.

encoding_format
any

Encoding format

Responses

Request samples

Content type
application/json
{
  • "input": "hello",
  • "encoding_format": "float"
}

Response samples

Content type
application/json
{
  • "data": [
    ],
  • "model": "_",
  • "object": "list",
  • "usage": {
    }
}

Health Check

Check current status of the Nitro server.

Check the status of Nitro Server.

Responses

Request samples

curl http://localhost:3928/healthz

Response samples

Content type
application/json
{
  • "message": "Nitro is alive!!!"
}

Load Model

Load model to Nitro Inference Server.

Load model to Nitro Inference Server.

Request Body schema: application/json
llama_model_path
required
string

Path to your local LLM

ngl
number or null [ 0 .. 100 ]
Default: 100

The number of layers to load onto the GPU for acceleration.

ctx_len
number or null
Default: 2048

The context length for model operations varies; the maximum depends on the specific model used.

embedding
boolean or null
Default: true

Whether to enable embedding.

cont_batching
boolean or null
Default: false

Whether to use continuous batching.

n_parallel
integer or null
Default: 1

The number of parallel operations. Only set when enable continuous batching.

cpu_threads
integer or null

The number of threads for CPU-based inference.

pre_prompt
string or null
Default: "A chat between a curious user and an artificial intelligence assistant. The assistant follows the given rules no matter what."

The prompt to use for internal configuration.

system_prompt
string or null
Default: "ASSISTANT's RULE:"

The prefix for system prompt

user_prompt
string or null
Default: "USER:"

The prefix for user prompt.

ai_prompt
string or null
Default: "ASSISTANT:"

The prefix for assistant prompt.

clean_cache_threshold
integer or null
Default: 5

Number of chats that will trigger clean cache action.

Responses

Request samples

Content type
application/json
{
  • "llama_model_path": "nitro/model/zephyr-7b-beta.Q5_K_M.gguf",
  • "ngl": 100,
  • "ctx_len": 2048,
  • "embedding": true,
  • "cont_batching": false,
  • "n_parallel": 1,
  • "cpu_threads": 4,
  • "pre_prompt": "A chat between a curious user and an artificial intelligence assistant. The assistant follows the given rules no matter what.",
  • "system_prompt": "ASSISTANT's RULE:",
  • "user_prompt": "USER:",
  • "ai_prompt": "ASSISTANT:",
  • "clean_cache_threshold": 5
}

Response samples

Content type
application/json
{
  • "message": "Model loaded successfully",
  • "code": "Model loaded successfully"
}

Unload Model

Unload model out of Nitro Inference Server.

Unload model from Nitro Inference Server.

Responses

Request samples

curl http://localhost:3928/inferences/llamacpp/unloadmodel

Response samples

Content type
application/json
{
  • "message": "Model unloaded successfully"
}

Status

Check current status of the model.

Check status of the model on Nitro server

Responses

Request samples

curl http://localhost:3928/inferences/llamacpp/modelstatus

Response samples

Content type
application/json
{
  • "model_data": {
    }
}