`ollama create` is used to create a model from a Modelfile.
```shell
ollama create mymodel -f ./Modelfile
```
### Pull a model
```shell
ollama pull llama2
```
> This command can also be used to update a local model. Only the diff will be pulled. Models will be pulled from ollama.ai
### Remove a model
```shell
ollama rm llama2
```
### Copy a model
```shell
ollama cp llama2 my-llama2
```
### Multimodal models
```
>>> What's in this image? /Users/jmorgan/Desktop/smile.png
The image features a yellow smiley face, which is likely the central focus of the picture.
```
### Pass in prompt as arguments
```shell
$ ollama run llama2 "Summarize this file: $(cat README.md)"
Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications.
```
### List models on your computer
```shell
ollama list
```
### Start Ollama
`ollama serve` is used when you want to start ollama without running the desktop application.
## API
### Generate a completion
```http
POST /api/generate
```
Generate a response for a given prompt with a provided model. This is a streaming endpoint, so there will be a series of responses. The final response object will include statistics and additional data from the request.
#### Parameters
-`model`: (required) the model name
-`prompt`: the prompt to generate a response for
-`images`: (optional) a list of [base64](../../files/Base64.md)-encoded images (for multimodal models such as `llava`)
Advanced parameters (optional):
-`format`: the format to return a response in. Currently the only accepted value is `json`
-`options`: additional model parameters listed in the documentation for the Modelfile such as `temperature`
-`system`: system message to (overrides what is defined in the `Modelfile`)
-`template`: the prompt template to use (overrides what is defined in the `Modelfile`)
-`context`: the context parameter returned from a previous request to `/generate`, this can be used to keep a short conversational memory
-`stream`: if `false` the response will be returned as a single response object, rather than a stream of objects
-`raw`: if `true` no formatting will be applied to the prompt. You may choose to use the `raw` parameter if you are specifying a full templated prompt in your request to the API
-`keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`)
##### JSON mode
Enable [JSON](../../files/JSON.md) mode by setting the `format` parameter to `json`. This will structure the response as a valid [JSON](../../files/JSON.md) object. See the JSON mode example below.
> Note: it's important to instruct the model to use [JSON](../../files/JSON.md) in the `prompt`. Otherwise, the model may generate large amounts whitespace.
#### Examples
##### Generate request (Streaming)
###### Request
```shell
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?"
}'
```
###### Response
A stream of [JSON](../../files/JSON.md) objects is returned:
The final response in the stream also includes additional data about the generation:
-`total_duration`: time spent generating the response
-`load_duration`: time spent in nanoseconds loading the model
-`prompt_eval_count`: number of tokens in the prompt
-`prompt_eval_duration`: time spent in nanoseconds evaluating the prompt
-`eval_count`: number of tokens the response
-`eval_duration`: time in nanoseconds spent generating the response
-`context`: an encoding of the conversation used in this response, this can be sent in the next request to keep a conversational memory
-`response`: empty if the response was streamed, if not streamed, this will contain the full response
To calculate how fast the response is generated in tokens per second (token/s), divide `eval_count` / `eval_duration`.
```json
{
"model": "llama2",
"created_at": "2023-08-04T19:22:45.499127Z",
"response": "",
"done": true,
"context": [1, 2, 3],
"total_duration": 10706818083,
"load_duration": 6338219291,
"prompt_eval_count": 26,
"prompt_eval_duration": 130079000,
"eval_count": 259,
"eval_duration": 4232710000
}
```
##### Request (No streaming)
###### Request
A response can be received in one reply when streaming is off.
```shell
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?",
"stream": false
}'
```
###### Response
If `stream` is set to `false`, the response will be a single [JSON](../../files/JSON.md) object:
```json
{
"model": "llama2",
"created_at": "2023-08-04T19:22:45.499127Z",
"response": "The sky is blue because it is the color of the sky.",
"done": true,
"context": [1, 2, 3],
"total_duration": 5043500667,
"load_duration": 5025959,
"prompt_eval_count": 26,
"prompt_eval_duration": 325953000,
"eval_count": 290,
"eval_duration": 4709213000
}
```
##### Request (JSON mode)
> When `format` is set to `json`, the output will always be a well-formed [JSON](../../files/JSON.md) object. It's important to also instruct the model to respond in [JSON](../../files/JSON.md).
###### Request
```shell
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "What color is the sky at different times of the day? Respond using JSON",
The value of `response` will be a string containing [JSON](../../files/JSON.md) similar to:
```json
{
"morning": {
"color": "blue"
},
"noon": {
"color": "blue-gray"
},
"afternoon": {
"color": "warm gray"
},
"evening": {
"color": "orange"
}
}
```
##### Request (with images)
To submit images to multimodal models such as `llava` or `bakllava`, provide a list of [base64](../../files/Base64.md)-encoded `images`:
##### Request
```shell
curl http://localhost:11434/api/generate -d '{
"model": "llava",
"prompt":"What is in this picture?",
"stream": false,
"images": ["base64..."]
}'
```
##### Response
```json
{
"model": "llava",
"created_at": "2023-11-03T15:36:02.583064Z",
"response": "A happy cartoon character, which is cute and cheerful.",
"done": true,
"context": [1, 2, 3],
"total_duration": 2938432250,
"load_duration": 2559292,
"prompt_eval_count": 1,
"prompt_eval_duration": 2195557000,
"eval_count": 44,
"eval_duration": 736432000
}
```
##### Request (Raw Mode)
In some cases, you may wish to bypass the templating system and provide a full prompt. In this case, you can use the `raw` parameter to disable templating. Also note that raw mode will not return a context.
###### Request
```shell
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "[INST] why is the sky blue? [/INST]",
"raw": true,
"stream": false
}'
```
###### Response
```json
{
"model": "mistral",
"created_at": "2023-11-03T15:36:02.583064Z",
"response": " The sky appears blue because of a phenomenon called Rayleigh scattering.",
"done": true,
"total_duration": 8493852375,
"load_duration": 6589624375,
"prompt_eval_count": 14,
"prompt_eval_duration": 119039000,
"eval_count": 110,
"eval_duration": 1779061000
}
```
##### Generate request (With options)
If you want to set custom options for the model at runtime rather than in the Modelfile, you can do so with the `options` parameter. This example sets every available option, but you can set any of them individually and omit the ones you do not want to override.
###### Request
```shell
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?",
"stream": false,
"options": {
"num_keep": 5,
"seed": 42,
"num_predict": 100,
"top_k": 20,
"top_p": 0.9,
"tfs_z": 0.5,
"typical_p": 0.7,
"repeat_last_n": 33,
"temperature": 0.8,
"repeat_penalty": 1.2,
"presence_penalty": 1.5,
"frequency_penalty": 1.0,
"mirostat": 1,
"mirostat_tau": 0.8,
"mirostat_eta": 0.6,
"penalize_newline": true,
"stop": ["\n", "user:"],
"numa": false,
"num_ctx": 1024,
"num_batch": 2,
"num_gqa": 1,
"num_gpu": 1,
"main_gpu": 0,
"low_vram": false,
"f16_kv": true,
"vocab_only": false,
"use_mmap": true,
"use_mlock": false,
"embedding_only": false,
"rope_frequency_base": 1.1,
"rope_frequency_scale": 0.8,
"num_thread": 8
}
}'
```
###### Response
```json
{
"model": "llama2",
"created_at": "2023-08-04T19:22:45.499127Z",
"response": "The sky is blue because it is the color of the sky.",
"done": true,
"context": [1, 2, 3],
"total_duration": 4935886791,
"load_duration": 534986708,
"prompt_eval_count": 26,
"prompt_eval_duration": 107345000,
"eval_count": 237,
"eval_duration": 4289432000
}
```
##### Load a model
If an empty prompt is provided, the model will be loaded into memory.
###### Request
```shell
curl http://localhost:11434/api/generate -d '{
"model": "llama2"
}'
```
###### Response
A single JSON object is returned:
```json
{
"model": "llama2",
"created_at": "2023-12-18T19:52:07.071755Z",
"response": "",
"done": true
}
```
### Generate a chat completion
```http
POST /api/chat
```
Generate the next message in a chat with a provided model. This is a streaming endpoint, so there will be a series of responses. Streaming can be disabled using `"stream": false`. The final response object will include statistics and additional data from the request.
#### Parameters
-`model`: (required) the model name
-`messages`: the messages of the chat, this can be used to keep a chat memory
The `message` object has the following fields:
-`role`: the role of the message, either `system`, `user` or `assistant`
-`content`: the content of the message
-`images` (optional): a list of images to include in the message (for multimodal models such as `llava`)
Advanced parameters (optional):
-`format`: the format to return a response in. Currently the only accepted value is `json`
-`options`: additional model parameters listed in the documentation for the Modelfile such as `temperature`
-`template`: the prompt template to use (overrides what is defined in the `Modelfile`)
-`stream`: if `false` the response will be returned as a single response object, rather than a stream of objects
-`keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`)
Send a chat message with a conversation history. You can use this same approach to start the conversation using multi-shot or chain-of-thought prompting.
###### Request
```shell
curl http://localhost:11434/api/chat -d '{
"model": "llama2",
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
},
{
"role": "assistant",
"content": "due to rayleigh scattering."
},
{
"role": "user",
"content": "how is that different than mie scattering?"
"content": " The image features a cute, little pig with an angry facial expression. It's wearing a heart on its shirt and is waving in the air. This scene appears to be part of a drawing or sketching project.",
"images": null
},
"done": true,
"total_duration": 1668506709,
"load_duration": 1986209,
"prompt_eval_count": 26,
"prompt_eval_duration": 359682000,
"eval_count": 83,
"eval_duration": 1303285000
}
```
### Create a Model
```http
POST /api/create
```
Create a model from a `Modelfile`. It is recommended to set `modelfile` to the content of the Modelfile rather than just set `path`. This is a requirement for remote create. Remote model creation must also create any file blobs, fields such as `FROM` and `ADAPTER`, explicitly with the server using Create a Blob and the value to the path indicated in the response.
#### Parameters
-`name`: name of the model to create
-`modelfile` (optional): contents of the Modelfile
-`stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects
-`path` (optional): path to the Modelfile
#### Examples
##### Create a new model
Create a new model from a `Modelfile`.
###### Request
```shell
curl http://localhost:11434/api/create -d '{
"name": "mario",
"modelfile": "FROM llama2\nSYSTEM You are mario from Super Mario Bros."
}'
```
###### Response
A stream of [JSON](../../files/JSON.md) objects. Notice that the final [JSON](../../files/JSON.md) object shows a `"status": "success"`.
```json
{"status":"reading model metadata"}
{"status":"creating system layer"}
{"status":"using already created layer sha256:22f7f8ef5f4c791c1b03d7eb414399294764d7cc82c7e94aa81a1feb80a983a2"}
{"status":"using already created layer sha256:8c17c2ebb0ea011be9981cc3922db8ca8fa61e828c5d3f44cb6ae342bf80460b"}
{"status":"using already created layer sha256:7c23fb36d80141c4ab8cdbb61ee4790102ebd2bf7aeff414453177d4f2110e5d"}
{"status":"using already created layer sha256:2e0493f67d0c8c9c68a8aeacdf6a38a2151cb3c4c1d42accf296e19810527988"}
{"status":"using already created layer sha256:2759286baa875dc22de5394b4a925701b1896a7e3f8e53275c36f75a877a82c9"}
Show information about a model including details, modelfile, template, parameters, license, and system prompt.
#### Parameters
-`name`: name of the model to show
#### Examples
##### Request
```shell
curl http://localhost:11434/api/show -d '{
"name": "llama2"
}'
```
##### Response
```json
{
"modelfile": "# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM llava:latest\n\nFROM /Users/matt/.ollama/models/blobs/sha256:200765e1283640ffbd013184bf496e261032fa75b99498a9613be4e94d63ad52\nTEMPLATE \"\"\"{{ .System }}\nUSER: {{ .Prompt }}\nASSSISTANT: \"\"\"\nPARAMETER num_ctx 4096\nPARAMETER stop \"\u003c/s\u003e\"\nPARAMETER stop \"USER:\"\nPARAMETER stop \"ASSSISTANT:\"",
Returns a 200 OK if successful, 404 Not Found if the model to be deleted doesn't exist.
### Pull a Model
```http
POST /api/pull
```
Download a model from the ollama library. Cancelled pulls are resumed from where they left off, and multiple calls will share the same download progress.
#### Parameters
-`name`: name of the model to pull
-`insecure`: (optional) allow insecure connections to the library. Only use this if you are pulling from your own library during development.
-`stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects
#### Examples
##### Request
```shell
curl http://localhost:11434/api/pull -d '{
"name": "llama2"
}'
```
##### Response
If `stream` is not specified, or set to `true`, a stream of JSON objects is returned:
The first object is the manifest:
```json
{ "status": "pulling manifest" }
```
Then there is a series of downloading responses. Until any of the download is completed, the `completed` key may not be included. The number of files to be downloaded depends on the number of layers specified in the manifest.
```json
{
"status": "downloading digestname",
"digest": "digestname",
"total": 2142590208,
"completed": 241970
}
```
After all the files are downloaded, the final responses are:
```json
{ "status": "verifying sha256 digest" }
{ "status": "writing manifest" }
{ "status": "removing any unused layers" }
{ "status": "success" }
```
if `stream` is set to false, then the response is a single [JSON](../../files/JSON.md) object:
```json
{ "status": "success" }
```
### Push a Model
```http
POST /api/push
```
Upload a model to a model library. Requires registering for ollama.ai and adding a public key first.
#### Parameters
-`name`: name of the model to push in the form of `<namespace>/<model>:<tag>`
-`insecure`: (optional) allow insecure connections to the library. Only use this if you are pushing to your library during development.
-`stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects
#### Examples
##### Request
```shell
curl http://localhost:11434/api/push -d '{
"name": "mattw/pygmalion:latest"
}'
```
##### Response
If `stream` is not specified, or set to `true`, a stream of [JSON](../../files/JSON.md) objects is returned:
| mirostat_eta | Influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive. (Default: 0.1) | float | mirostat_eta 0.1 |
| mirostat_tau | Controls the balance between coherence and diversity of the output. A lower value will result in more focused and coherent text. (Default: 5.0) | float | mirostat_tau 5.0 |
| num_ctx | Sets the size of the context window used to generate the next token. (Default: 2048) | int | num_ctx 4096 |
| num_gqa | The number of GQA groups in the transformer layer. Required for some models, for example it is 8 for llama2:70b | int | num_gqa 1 |
| num_gpu | The number of layers to send to the GPU(s). On macOS it defaults to 1 to enable metal support, 0 to disable. | int | num_gpu 50 |
| num_thread | Sets the number of threads to use during computation. By default, Ollama will detect this for optimal performance. It is recommended to set this value to the number of physical CPU cores your system has (as opposed to the logical number of cores). | int | num_thread 8 |
| repeat_last_n | Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx) | int | repeat_last_n 64 |
| repeat_penalty | Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1) | float | repeat_penalty 1.1 |
| temperature | The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8) | float | temperature 0.7 |
| seed | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0) | int | seed 42 |
| stop | Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate `stop` parameters in a modelfile. | string | stop "AI assistant:" |
| tfs_z | Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting. (default: 1) | float | tfs_z 1 |
| num_predict | Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context) | int | num_predict 42 |
| top_k | Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40) | int | top_k 40 |
| top_p | Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9) | float | top_p 0.9 |
### TEMPLATE
`TEMPLATE` of the full prompt template to be passed into the model. It may include (optionally) a system message and a user's prompt. This is used to create a full custom prompt, and syntax may be model specific. You can usually find the template for a given model in the readme for that model.
| `{{ .System }}` | The system message used to specify custom behavior, this must also be set in the Modelfile as an instruction. |
| `{{ .Prompt }}` | The incoming prompt, this is not specified in the model file and will be set based on input. |
| `{{ .Response }}` | The response from the LLM, if not specified response is appended to the end of the template. |
| `{{ .First }}` | A boolean value used to render specific template information for the first generation of a session. |
```modelfile
TEMPLATE """
{{- if .First }}
### System:
{{ .System }}
{{- end }}
### User:
{{ .Prompt }}
### Response:
"""
SYSTEM """<systemmessage>"""
```
### SYSTEM
The `SYSTEM` instruction specifies the system message to be used in the template, if applicable.
```modelfile
SYSTEM """<systemmessage>"""
```
### ADAPTER
The `ADAPTER` instruction specifies the LoRA adapter to apply to the base model. The value of this instruction should be an absolute path or a path relative to the Modelfile and the file must be in a GGML file format. The adapter should be tuned from the base model otherwise the behaviour is undefined.
```modelfile
ADAPTER ./ollama-lora.bin
```
### LICENSE
The `LICENSE` instruction allows you to specify the legal license under which the model used with this Modelfile is shared or distributed.
```modelfile
LICENSE """
<licensetext>
"""
```
### MESSAGE
The `MESSAGE` instruction allows you to specify a message history for the model to use when responding:
Ollama now has built-in compatibility with the OpenAIChat Completions API, making it possible to use more tooling and applications with Ollama locally.