36 KiB
obj | repo | website | rev |
---|---|---|---|
application | https://github.com/ollama/ollama | https://ollama.ai | 2024-03-18 |
Ollama
Ollama is a local large language model runner.
CLI
Create a model
ollama create
is used to create a model from a Modelfile.
ollama create mymodel -f ./Modelfile
Pull a model
ollama pull llama2
This command can also be used to update a local model. Only the diff will be pulled. Models will be pulled from ollama.ai
Remove a model
ollama rm llama2
Copy a model
ollama cp llama2 my-llama2
Multimodal models
>>> What's in this image? /Users/jmorgan/Desktop/smile.png
The image features a yellow smiley face, which is likely the central focus of the picture.
Pass in prompt as arguments
$ ollama run llama2 "Summarize this file: $(cat README.md)"
Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications.
List models on your computer
ollama list
Start Ollama
ollama serve
is used when you want to start ollama without running the desktop application.
API
Generate a completion
POST /api/generate
Generate a response for a given prompt with a provided model. This is a streaming endpoint, so there will be a series of responses. The final response object will include statistics and additional data from the request.
Parameters
model
: (required) the model nameprompt
: the prompt to generate a response forimages
: (optional) a list of base64-encoded images (for multimodal models such asllava
)
Advanced parameters (optional):
format
: the format to return a response in. Currently the only accepted value isjson
options
: additional model parameters listed in the documentation for the Modelfile such astemperature
system
: system message to (overrides what is defined in theModelfile
)template
: the prompt template to use (overrides what is defined in theModelfile
)context
: the context parameter returned from a previous request to/generate
, this can be used to keep a short conversational memorystream
: iffalse
the response will be returned as a single response object, rather than a stream of objectsraw
: iftrue
no formatting will be applied to the prompt. You may choose to use theraw
parameter if you are specifying a full templated prompt in your request to the APIkeep_alive
: controls how long the model will stay loaded into memory following the request (default:5m
)
JSON mode
Enable JSON mode by setting the format
parameter to json
. This will structure the response as a valid JSON object. See the JSON mode example below.
Note: it's important to instruct the model to use JSON in the
prompt
. Otherwise, the model may generate large amounts whitespace.
Examples
Generate request (Streaming)
Request
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?"
}'
Response
A stream of JSON objects is returned:
{
"model": "llama2",
"created_at": "2023-08-04T08:52:19.385406455-07:00",
"response": "The",
"done": false
}
The final response in the stream also includes additional data about the generation:
total_duration
: time spent generating the responseload_duration
: time spent in nanoseconds loading the modelprompt_eval_count
: number of tokens in the promptprompt_eval_duration
: time spent in nanoseconds evaluating the prompteval_count
: number of tokens the responseeval_duration
: time in nanoseconds spent generating the responsecontext
: an encoding of the conversation used in this response, this can be sent in the next request to keep a conversational memoryresponse
: empty if the response was streamed, if not streamed, this will contain the full response
To calculate how fast the response is generated in tokens per second (token/s), divide eval_count
/ eval_duration
.
{
"model": "llama2",
"created_at": "2023-08-04T19:22:45.499127Z",
"response": "",
"done": true,
"context": [1, 2, 3],
"total_duration": 10706818083,
"load_duration": 6338219291,
"prompt_eval_count": 26,
"prompt_eval_duration": 130079000,
"eval_count": 259,
"eval_duration": 4232710000
}
Request (No streaming)
Request
A response can be received in one reply when streaming is off.
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?",
"stream": false
}'
Response
If stream
is set to false
, the response will be a single JSON object:
{
"model": "llama2",
"created_at": "2023-08-04T19:22:45.499127Z",
"response": "The sky is blue because it is the color of the sky.",
"done": true,
"context": [1, 2, 3],
"total_duration": 5043500667,
"load_duration": 5025959,
"prompt_eval_count": 26,
"prompt_eval_duration": 325953000,
"eval_count": 290,
"eval_duration": 4709213000
}
Request (JSON mode)
When
format
is set tojson
, the output will always be a well-formed JSON object. It's important to also instruct the model to respond in JSON.
Request
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "What color is the sky at different times of the day? Respond using JSON",
"format": "json",
"stream": false
}'
Response
{
"model": "llama2",
"created_at": "2023-11-09T21:07:55.186497Z",
"response": "{\n\"morning\": {\n\"color\": \"blue\"\n},\n\"noon\": {\n\"color\": \"blue-gray\"\n},\n\"afternoon\": {\n\"color\": \"warm gray\"\n},\n\"evening\": {\n\"color\": \"orange\"\n}\n}\n",
"done": true,
"context": [1, 2, 3],
"total_duration": 4648158584,
"load_duration": 4071084,
"prompt_eval_count": 36,
"prompt_eval_duration": 439038000,
"eval_count": 180,
"eval_duration": 4196918000
}
The value of response
will be a string containing JSON similar to:
{
"morning": {
"color": "blue"
},
"noon": {
"color": "blue-gray"
},
"afternoon": {
"color": "warm gray"
},
"evening": {
"color": "orange"
}
}
Request (with images)
To submit images to multimodal models such as llava
or bakllava
, provide a list of base64-encoded images
:
Request
curl http://localhost:11434/api/generate -d '{
"model": "llava",
"prompt":"What is in this picture?",
"stream": false,
"images": ["base64..."]
}'
Response
{
"model": "llava",
"created_at": "2023-11-03T15:36:02.583064Z",
"response": "A happy cartoon character, which is cute and cheerful.",
"done": true,
"context": [1, 2, 3],
"total_duration": 2938432250,
"load_duration": 2559292,
"prompt_eval_count": 1,
"prompt_eval_duration": 2195557000,
"eval_count": 44,
"eval_duration": 736432000
}
Request (Raw Mode)
In some cases, you may wish to bypass the templating system and provide a full prompt. In this case, you can use the raw
parameter to disable templating. Also note that raw mode will not return a context.
Request
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "[INST] why is the sky blue? [/INST]",
"raw": true,
"stream": false
}'
Response
{
"model": "mistral",
"created_at": "2023-11-03T15:36:02.583064Z",
"response": " The sky appears blue because of a phenomenon called Rayleigh scattering.",
"done": true,
"total_duration": 8493852375,
"load_duration": 6589624375,
"prompt_eval_count": 14,
"prompt_eval_duration": 119039000,
"eval_count": 110,
"eval_duration": 1779061000
}
Generate request (With options)
If you want to set custom options for the model at runtime rather than in the Modelfile, you can do so with the options
parameter. This example sets every available option, but you can set any of them individually and omit the ones you do not want to override.
Request
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?",
"stream": false,
"options": {
"num_keep": 5,
"seed": 42,
"num_predict": 100,
"top_k": 20,
"top_p": 0.9,
"tfs_z": 0.5,
"typical_p": 0.7,
"repeat_last_n": 33,
"temperature": 0.8,
"repeat_penalty": 1.2,
"presence_penalty": 1.5,
"frequency_penalty": 1.0,
"mirostat": 1,
"mirostat_tau": 0.8,
"mirostat_eta": 0.6,
"penalize_newline": true,
"stop": ["\n", "user:"],
"numa": false,
"num_ctx": 1024,
"num_batch": 2,
"num_gqa": 1,
"num_gpu": 1,
"main_gpu": 0,
"low_vram": false,
"f16_kv": true,
"vocab_only": false,
"use_mmap": true,
"use_mlock": false,
"embedding_only": false,
"rope_frequency_base": 1.1,
"rope_frequency_scale": 0.8,
"num_thread": 8
}
}'
Response
{
"model": "llama2",
"created_at": "2023-08-04T19:22:45.499127Z",
"response": "The sky is blue because it is the color of the sky.",
"done": true,
"context": [1, 2, 3],
"total_duration": 4935886791,
"load_duration": 534986708,
"prompt_eval_count": 26,
"prompt_eval_duration": 107345000,
"eval_count": 237,
"eval_duration": 4289432000
}
Load a model
If an empty prompt is provided, the model will be loaded into memory.
Request
curl http://localhost:11434/api/generate -d '{
"model": "llama2"
}'
Response
A single JSON object is returned:
{
"model": "llama2",
"created_at": "2023-12-18T19:52:07.071755Z",
"response": "",
"done": true
}
Generate a chat completion
POST /api/chat
Generate the next message in a chat with a provided model. This is a streaming endpoint, so there will be a series of responses. Streaming can be disabled using "stream": false
. The final response object will include statistics and additional data from the request.
Parameters
model
: (required) the model namemessages
: the messages of the chat, this can be used to keep a chat memory
The message
object has the following fields:
role
: the role of the message, eithersystem
,user
orassistant
content
: the content of the messageimages
(optional): a list of images to include in the message (for multimodal models such asllava
)
Advanced parameters (optional):
format
: the format to return a response in. Currently the only accepted value isjson
options
: additional model parameters listed in the documentation for the Modelfile such astemperature
template
: the prompt template to use (overrides what is defined in theModelfile
)stream
: iffalse
the response will be returned as a single response object, rather than a stream of objectskeep_alive
: controls how long the model will stay loaded into memory following the request (default:5m
)
Examples
Chat Request (Streaming)
Request
Send a chat message with a streaming response.
curl http://localhost:11434/api/chat -d '{
"model": "llama2",
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
}
]
}'
Response
A stream of JSON objects is returned:
{
"model": "llama2",
"created_at": "2023-08-04T08:52:19.385406455-07:00",
"message": {
"role": "assistant",
"content": "The",
"images": null
},
"done": false
}
Final response:
{
"model": "llama2",
"created_at": "2023-08-04T19:22:45.499127Z",
"done": true,
"total_duration": 4883583458,
"load_duration": 1334875,
"prompt_eval_count": 26,
"prompt_eval_duration": 342546000,
"eval_count": 282,
"eval_duration": 4535599000
}
Chat request (No streaming)
Request
curl http://localhost:11434/api/chat -d '{
"model": "llama2",
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
}
],
"stream": false
}'
Response
{
"model": "registry.ollama.ai/library/llama2:latest",
"created_at": "2023-12-12T14:13:43.416799Z",
"message": {
"role": "assistant",
"content": "Hello! How are you today?"
},
"done": true,
"total_duration": 5191566416,
"load_duration": 2154458,
"prompt_eval_count": 26,
"prompt_eval_duration": 383809000,
"eval_count": 298,
"eval_duration": 4799921000
}
Chat request (With History)
Send a chat message with a conversation history. You can use this same approach to start the conversation using multi-shot or chain-of-thought prompting.
Request
curl http://localhost:11434/api/chat -d '{
"model": "llama2",
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
},
{
"role": "assistant",
"content": "due to rayleigh scattering."
},
{
"role": "user",
"content": "how is that different than mie scattering?"
}
]
}'
Response
A stream of JSON objects is returned:
{
"model": "llama2",
"created_at": "2023-08-04T08:52:19.385406455-07:00",
"message": {
"role": "assistant",
"content": "The"
},
"done": false
}
Final response:
{
"model": "llama2",
"created_at": "2023-08-04T19:22:45.499127Z",
"done": true,
"total_duration": 8113331500,
"load_duration": 6396458,
"prompt_eval_count": 61,
"prompt_eval_duration": 398801000,
"eval_count": 468,
"eval_duration": 7701267000
}
Chat request (with images)
Request
Send a chat message with a conversation history.
curl http://localhost:11434/api/chat -d '{
"model": "llava",
"messages": [
{
"role": "user",
"content": "what is in this image?",
"images": ["base64..."]
}
]
}'
Response
{
"model": "llava",
"created_at": "2023-12-13T22:42:50.203334Z",
"message": {
"role": "assistant",
"content": " The image features a cute, little pig with an angry facial expression. It's wearing a heart on its shirt and is waving in the air. This scene appears to be part of a drawing or sketching project.",
"images": null
},
"done": true,
"total_duration": 1668506709,
"load_duration": 1986209,
"prompt_eval_count": 26,
"prompt_eval_duration": 359682000,
"eval_count": 83,
"eval_duration": 1303285000
}
Create a Model
POST /api/create
Create a model from a Modelfile
. It is recommended to set modelfile
to the content of the Modelfile rather than just set path
. This is a requirement for remote create. Remote model creation must also create any file blobs, fields such as FROM
and ADAPTER
, explicitly with the server using Create a Blob and the value to the path indicated in the response.
Parameters
name
: name of the model to createmodelfile
(optional): contents of the Modelfilestream
: (optional) iffalse
the response will be returned as a single response object, rather than a stream of objectspath
(optional): path to the Modelfile
Examples
Create a new model
Create a new model from a Modelfile
.
Request
curl http://localhost:11434/api/create -d '{
"name": "mario",
"modelfile": "FROM llama2\nSYSTEM You are mario from Super Mario Bros."
}'
Response
A stream of JSON objects. Notice that the final JSON object shows a "status": "success"
.
{"status":"reading model metadata"}
{"status":"creating system layer"}
{"status":"using already created layer sha256:22f7f8ef5f4c791c1b03d7eb414399294764d7cc82c7e94aa81a1feb80a983a2"}
{"status":"using already created layer sha256:8c17c2ebb0ea011be9981cc3922db8ca8fa61e828c5d3f44cb6ae342bf80460b"}
{"status":"using already created layer sha256:7c23fb36d80141c4ab8cdbb61ee4790102ebd2bf7aeff414453177d4f2110e5d"}
{"status":"using already created layer sha256:2e0493f67d0c8c9c68a8aeacdf6a38a2151cb3c4c1d42accf296e19810527988"}
{"status":"using already created layer sha256:2759286baa875dc22de5394b4a925701b1896a7e3f8e53275c36f75a877a82c9"}
{"status":"writing layer sha256:df30045fe90f0d750db82a058109cecd6d4de9c90a3d75b19c09e5f64580bb42"}
{"status":"writing layer sha256:f18a68eb09bf925bb1b669490407c1b1251c5db98dc4d3d81f3088498ea55690"}
{"status":"writing manifest"}
{"status":"success"}
Check if a Blob Exists
HEAD /api/blobs/:digest
Ensures that the file blob used for a FROM or ADAPTER field exists on the server. This is checking your Ollama server and not Ollama.ai.
Query Parameters
digest
: the SHA256 digest of the blob
Examples
Request
curl -I http://localhost:11434/api/blobs/sha256:29fdb92e57cf0827ded04ae6461b5931d01fa595843f55d36f5b275a52087dd2
Response
Return 200 OK if the blob exists, 404 Not Found if it does not.
Create a Blob
POST /api/blobs/:digest
Create a blob from a file on the server. Returns the server file path.
Query Parameters
digest
: the expected SHA256 digest of the file
Examples
Request
curl -T model.bin -X POST http://localhost:11434/api/blobs/sha256:29fdb92e57cf0827ded04ae6461b5931d01fa595843f55d36f5b275a52087dd2
Response
Return 201 Created if the blob was successfully created, 400 Bad Request if the digest used is not expected.
List Local Models
GET /api/tags
List models that are available locally.
Examples
Request
curl http://localhost:11434/api/tags
Response
A single JSON object will be returned.
{
"models": [
{
"name": "codellama:13b",
"modified_at": "2023-11-04T14:56:49.277302595-07:00",
"size": 7365960935,
"digest": "9f438cb9cd581fc025612d27f7c1a6669ff83a8bb0ed86c94fcf4c5440555697",
"details": {
"format": "gguf",
"family": "llama",
"families": null,
"parameter_size": "13B",
"quantization_level": "Q4_0"
}
},
{
"name": "llama2:latest",
"modified_at": "2023-12-07T09:32:18.757212583-08:00",
"size": 3825819519,
"digest": "fe938a131f40e6f6d40083c9f0f430a515233eb2edaa6d72eb85c50d64f2300e",
"details": {
"format": "gguf",
"family": "llama",
"families": null,
"parameter_size": "7B",
"quantization_level": "Q4_0"
}
}
]
}
Show Model Information
POST /api/show
Show information about a model including details, modelfile, template, parameters, license, and system prompt.
Parameters
name
: name of the model to show
Examples
Request
curl http://localhost:11434/api/show -d '{
"name": "llama2"
}'
Response
{
"modelfile": "# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM llava:latest\n\nFROM /Users/matt/.ollama/models/blobs/sha256:200765e1283640ffbd013184bf496e261032fa75b99498a9613be4e94d63ad52\nTEMPLATE \"\"\"{{ .System }}\nUSER: {{ .Prompt }}\nASSSISTANT: \"\"\"\nPARAMETER num_ctx 4096\nPARAMETER stop \"\u003c/s\u003e\"\nPARAMETER stop \"USER:\"\nPARAMETER stop \"ASSSISTANT:\"",
"parameters": "num_ctx 4096\nstop \u003c/s\u003e\nstop USER:\nstop ASSSISTANT:",
"template": "{{ .System }}\nUSER: {{ .Prompt }}\nASSSISTANT: ",
"details": {
"format": "gguf",
"family": "llama",
"families": ["llama", "clip"],
"parameter_size": "7B",
"quantization_level": "Q4_0"
}
}
Copy a Model
POST /api/copy
Copy a model. Creates a model with another name from an existing model.
Examples
Request
curl http://localhost:11434/api/copy -d '{
"source": "llama2",
"destination": "llama2-backup"
}'
Response
Returns a 200 OK if successful, or a 404 Not Found if the source model doesn't exist.
Delete a Model
DELETE /api/delete
Delete a model and its data.
Parameters
name
: model name to delete
Examples
Request
curl -X DELETE http://localhost:11434/api/delete -d '{
"name": "llama2:13b"
}'
Response
Returns a 200 OK if successful, 404 Not Found if the model to be deleted doesn't exist.
Pull a Model
POST /api/pull
Download a model from the ollama library. Cancelled pulls are resumed from where they left off, and multiple calls will share the same download progress.
Parameters
name
: name of the model to pullinsecure
: (optional) allow insecure connections to the library. Only use this if you are pulling from your own library during development.stream
: (optional) iffalse
the response will be returned as a single response object, rather than a stream of objects
Examples
Request
curl http://localhost:11434/api/pull -d '{
"name": "llama2"
}'
Response
If stream
is not specified, or set to true
, a stream of JSON objects is returned:
The first object is the manifest:
{ "status": "pulling manifest" }
Then there is a series of downloading responses. Until any of the download is completed, the completed
key may not be included. The number of files to be downloaded depends on the number of layers specified in the manifest.
{
"status": "downloading digestname",
"digest": "digestname",
"total": 2142590208,
"completed": 241970
}
After all the files are downloaded, the final responses are:
{ "status": "verifying sha256 digest" }
{ "status": "writing manifest" }
{ "status": "removing any unused layers" }
{ "status": "success" }
if stream
is set to false, then the response is a single JSON object:
{ "status": "success" }
Push a Model
POST /api/push
Upload a model to a model library. Requires registering for ollama.ai and adding a public key first.
Parameters
name
: name of the model to push in the form of<namespace>/<model>:<tag>
insecure
: (optional) allow insecure connections to the library. Only use this if you are pushing to your library during development.stream
: (optional) iffalse
the response will be returned as a single response object, rather than a stream of objects
Examples
Request
curl http://localhost:11434/api/push -d '{
"name": "mattw/pygmalion:latest"
}'
Response
If stream
is not specified, or set to true
, a stream of JSON objects is returned:
{ "status": "retrieving manifest" }
and then:
{
"status": "starting upload",
"digest": "sha256:bc07c81de745696fdf5afca05e065818a8149fb0c77266fb584d9b2cba3711ab",
"total": 1928429856
}
Then there is a series of uploading responses:
{
"status": "starting upload",
"digest": "sha256:bc07c81de745696fdf5afca05e065818a8149fb0c77266fb584d9b2cba3711ab",
"total": 1928429856
}
Finally, when the upload is complete:
{"status":"pushing manifest"}
{"status":"success"}
If stream
is set to false
, then the response is a single JSON object:
{ "status": "success" }
Generate Embeddings
POST /api/embeddings
Generate embeddings from a model
Parameters
model
: name of model to generate embeddings fromprompt
: text to generate embeddings for
Advanced parameters:
options
: additional model parameters listed in the documentation for the Modelfile such astemperature
keep_alive
: controls how long the model will stay loaded into memory following the request (default:5m
)
Examples
Request
curl http://localhost:11434/api/embeddings -d '{
"model": "llama2",
"prompt": "Here is an article about llamas..."
}'
Response
{
"embedding": [
0.5670403838157654, 0.009260174818336964, 0.23178744316101074, -0.2916173040866852, -0.8924556970596313,
0.8785552978515625, -0.34576427936553955, 0.5742510557174683, -0.04222835972905159, -0.137906014919281
]
}
Modelfile
A model file is the blueprint to create and share models with Ollama.
Format
The format of the Modelfile
:
# comment
INSTRUCTION arguments
Instruction | Description |
---|---|
FROM (required) |
Defines the base model to use. |
PARAMETER |
Sets the parameters for how Ollama will run the model. |
TEMPLATE |
The full prompt template to be sent to the model. |
SYSTEM |
Specifies the system message that will be set in the template. |
ADAPTER |
Defines the (Q)LoRA adapters to apply to the model. |
LICENSE |
Specifies the legal license. |
MESSAGE |
Specify message history. |
Basic Modelfile
An example of a Modelfile
creating a mario blueprint:
FROM llama2
# sets the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 1
# sets the context window size to 4096, this controls how many tokens the LLM can use as context to generate the next token
PARAMETER num_ctx 4096
# sets a custom system message to specify the behavior of the chat assistant
SYSTEM You are Mario from super mario bros, acting as an assistant.
To use this:
- Save it as a file (e.g.
Modelfile
) ollama create choose-a-model-name -f <location of the file e.g. ./Modelfile>'
ollama run choose-a-model-name
- Start using the model!
FROM (Required)
The FROM
instruction defines the base model to use when creating a model.
FROM <model name>:<tag>
Build from llama2
FROM llama2
Build from a bin
file
FROM ./ollama-model.bin
This bin file location should be specified as an absolute path or relative to the Modelfile
location.
PARAMETER
The PARAMETER
instruction defines a parameter that can be set when the model is run.
PARAMETER <parameter> <parametervalue>
Valid Parameters and Values
Parameter | Description | Value Type | Example Usage |
---|---|---|---|
mirostat | Enable Mirostat sampling for controlling perplexity. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) | int | mirostat 0 |
mirostat_eta | Influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive. (Default: 0.1) | float | mirostat_eta 0.1 |
mirostat_tau | Controls the balance between coherence and diversity of the output. A lower value will result in more focused and coherent text. (Default: 5.0) | float | mirostat_tau 5.0 |
num_ctx | Sets the size of the context window used to generate the next token. (Default: 2048) | int | num_ctx 4096 |
num_gqa | The number of GQA groups in the transformer layer. Required for some models, for example it is 8 for llama2:70b | int | num_gqa 1 |
num_gpu | The number of layers to send to the GPU(s). On macOS it defaults to 1 to enable metal support, 0 to disable. | int | num_gpu 50 |
num_thread | Sets the number of threads to use during computation. By default, Ollama will detect this for optimal performance. It is recommended to set this value to the number of physical CPU cores your system has (as opposed to the logical number of cores). | int | num_thread 8 |
repeat_last_n | Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx) | int | repeat_last_n 64 |
repeat_penalty | Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1) | float | repeat_penalty 1.1 |
temperature | The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8) | float | temperature 0.7 |
seed | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0) | int | seed 42 |
stop | Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate stop parameters in a modelfile. |
string | stop "AI assistant:" |
tfs_z | Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting. (default: 1) | float | tfs_z 1 |
num_predict | Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context) | int | num_predict 42 |
top_k | Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40) | int | top_k 40 |
top_p | Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9) | float | top_p 0.9 |
TEMPLATE
TEMPLATE
of the full prompt template to be passed into the model. It may include (optionally) a system message and a user's prompt. This is used to create a full custom prompt, and syntax may be model specific. You can usually find the template for a given model in the readme for that model.
Template Variables
Variable | Description |
---|---|
{{ .System }} |
The system message used to specify custom behavior, this must also be set in the Modelfile as an instruction. |
{{ .Prompt }} |
The incoming prompt, this is not specified in the model file and will be set based on input. |
{{ .Response }} |
The response from the LLM, if not specified response is appended to the end of the template. |
{{ .First }} |
A boolean value used to render specific template information for the first generation of a session. |
TEMPLATE """
{{- if .First }}
### System:
{{ .System }}
{{- end }}
### User:
{{ .Prompt }}
### Response:
"""
SYSTEM """<system message>"""
SYSTEM
The SYSTEM
instruction specifies the system message to be used in the template, if applicable.
SYSTEM """<system message>"""
ADAPTER
The ADAPTER
instruction specifies the LoRA adapter to apply to the base model. The value of this instruction should be an absolute path or a path relative to the Modelfile and the file must be in a GGML file format. The adapter should be tuned from the base model otherwise the behaviour is undefined.
ADAPTER ./ollama-lora.bin
LICENSE
The LICENSE
instruction allows you to specify the legal license under which the model used with this Modelfile is shared or distributed.
LICENSE """
<license text>
"""
MESSAGE
The MESSAGE
instruction allows you to specify a message history for the model to use when responding:
MESSAGE user Is Toronto in Canada?
MESSAGE assistant yes
MESSAGE user Is Sacramento in Canada?
MESSAGE assistant no
MESSAGE user Is Ontario in Canada?
MESSAGE assistant yes
OpenAI Compatability
Ollama now has built-in compatibility with the OpenAI Chat Completions API, making it possible to use more tooling and applications with Ollama locally.
Usage
OpenAI Python library
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1/',
# required but ignored
api_key='ollama',
)
chat_completion = client.chat.completions.create(
messages=[
{
'role': 'user',
'content': 'Say this is a test',
}
],
model='llama2',
)
OpenAI JavaScript library
import OpenAI from 'openai'
const openai = new OpenAI({
baseURL: 'http://localhost:11434/v1/',
// required but ignored
apiKey: 'ollama',
})
const chatCompletion = await openai.chat.completions.create({
messages: [{ role: 'user', content: 'Say this is a test' }],
model: 'llama2',
})
curl
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}'
Model names
For tooling that relies on default OpenAI model names such as gpt-3.5-turbo
, use ollama cp
to copy an existing model name to a temporary name:
ollama cp llama2 gpt-3.5-turbo
Afterwards, this new model name can be specified the model
field:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}'
Libraries & Applications
Docker
Run Ollama with docker:
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
To run Ollama with the GPU either use a rocm
tagged docker image or the NVIDIA container runtime:
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama