From 4c31a4e960f2ec27842df28ff8724bb9baa6dbfd Mon Sep 17 00:00:00 2001 From: JMARyA Date: Mon, 19 Feb 2024 10:09:18 +0100 Subject: [PATCH] add ollama --- technology/applications/utilities/Ollama.md | 1065 +++++++++++++++++++ 1 file changed, 1065 insertions(+) create mode 100644 technology/applications/utilities/Ollama.md diff --git a/technology/applications/utilities/Ollama.md b/technology/applications/utilities/Ollama.md new file mode 100644 index 0000000..2c61ebc --- /dev/null +++ b/technology/applications/utilities/Ollama.md @@ -0,0 +1,1065 @@ +--- +obj: application +repo: https://github.com/ollama/ollama +website: https://ollama.ai +--- + +# Ollama +Ollama is a local large language model runner. + +## CLI +### Create a model +`ollama create` is used to create a model from a Modelfile. +```shell +ollama create mymodel -f ./Modelfile +``` + +### Pull a model +```shell +ollama pull llama2 +``` + +> This command can also be used to update a local model. Only the diff will be pulled. Models will be pulled from ollama.ai + +### Remove a model +```shell +ollama rm llama2 +``` + +### Copy a model +```shell +ollama cp llama2 my-llama2 +``` + +### Multimodal models +``` +>>> What's in this image? /Users/jmorgan/Desktop/smile.png +The image features a yellow smiley face, which is likely the central focus of the picture. +``` + +### Pass in prompt as arguments +```shell +$ ollama run llama2 "Summarize this file: $(cat README.md)" +Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. +``` + +### List models on your computer +```shell +ollama list +``` + +### Start Ollama +`ollama serve` is used when you want to start ollama without running the desktop application. + +## API +### Generate a completion +```http +POST /api/generate +``` + +Generate a response for a given prompt with a provided model. This is a streaming endpoint, so there will be a series of responses. The final response object will include statistics and additional data from the request. + +#### Parameters +- `model`: (required) the model name +- `prompt`: the prompt to generate a response for +- `images`: (optional) a list of [base64](../../files/Base64.md)-encoded images (for multimodal models such as `llava`) + +Advanced parameters (optional): +- `format`: the format to return a response in. Currently the only accepted value is `json` +- `options`: additional model parameters listed in the documentation for the Modelfile such as `temperature` +- `system`: system message to (overrides what is defined in the `Modelfile`) +- `template`: the prompt template to use (overrides what is defined in the `Modelfile`) +- `context`: the context parameter returned from a previous request to `/generate`, this can be used to keep a short conversational memory +- `stream`: if `false` the response will be returned as a single response object, rather than a stream of objects +- `raw`: if `true` no formatting will be applied to the prompt. You may choose to use the `raw` parameter if you are specifying a full templated prompt in your request to the API +- `keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`) + +##### JSON mode +Enable [JSON](../../files/JSON.md) mode by setting the `format` parameter to `json`. This will structure the response as a valid [JSON](../../files/JSON.md) object. See the JSON mode example below. + +> Note: it's important to instruct the model to use [JSON](../../files/JSON.md) in the `prompt`. Otherwise, the model may generate large amounts whitespace. + +#### Examples +##### Generate request (Streaming) +###### Request +```shell +curl http://localhost:11434/api/generate -d '{ + "model": "llama2", + "prompt": "Why is the sky blue?" +}' +``` + +###### Response +A stream of [JSON](../../files/JSON.md) objects is returned: +```json +{ + "model": "llama2", + "created_at": "2023-08-04T08:52:19.385406455-07:00", + "response": "The", + "done": false +} +``` + +The final response in the stream also includes additional data about the generation: +- `total_duration`: time spent generating the response +- `load_duration`: time spent in nanoseconds loading the model +- `prompt_eval_count`: number of tokens in the prompt +- `prompt_eval_duration`: time spent in nanoseconds evaluating the prompt +- `eval_count`: number of tokens the response +- `eval_duration`: time in nanoseconds spent generating the response +- `context`: an encoding of the conversation used in this response, this can be sent in the next request to keep a conversational memory +- `response`: empty if the response was streamed, if not streamed, this will contain the full response + +To calculate how fast the response is generated in tokens per second (token/s), divide `eval_count` / `eval_duration`. +```json +{ + "model": "llama2", + "created_at": "2023-08-04T19:22:45.499127Z", + "response": "", + "done": true, + "context": [1, 2, 3], + "total_duration": 10706818083, + "load_duration": 6338219291, + "prompt_eval_count": 26, + "prompt_eval_duration": 130079000, + "eval_count": 259, + "eval_duration": 4232710000 +} +``` + +##### Request (No streaming) +###### Request +A response can be received in one reply when streaming is off. + +```shell +curl http://localhost:11434/api/generate -d '{ + "model": "llama2", + "prompt": "Why is the sky blue?", + "stream": false +}' +``` + +###### Response +If `stream` is set to `false`, the response will be a single [JSON](../../files/JSON.md) object: + +```json +{ + "model": "llama2", + "created_at": "2023-08-04T19:22:45.499127Z", + "response": "The sky is blue because it is the color of the sky.", + "done": true, + "context": [1, 2, 3], + "total_duration": 5043500667, + "load_duration": 5025959, + "prompt_eval_count": 26, + "prompt_eval_duration": 325953000, + "eval_count": 290, + "eval_duration": 4709213000 +} +``` + +##### Request (JSON mode) + +> When `format` is set to `json`, the output will always be a well-formed [JSON](../../files/JSON.md) object. It's important to also instruct the model to respond in [JSON](../../files/JSON.md). + +###### Request +```shell +curl http://localhost:11434/api/generate -d '{ + "model": "llama2", + "prompt": "What color is the sky at different times of the day? Respond using JSON", + "format": "json", + "stream": false +}' +``` + +###### Response +```json +{ + "model": "llama2", + "created_at": "2023-11-09T21:07:55.186497Z", + "response": "{\n\"morning\": {\n\"color\": \"blue\"\n},\n\"noon\": {\n\"color\": \"blue-gray\"\n},\n\"afternoon\": {\n\"color\": \"warm gray\"\n},\n\"evening\": {\n\"color\": \"orange\"\n}\n}\n", + "done": true, + "context": [1, 2, 3], + "total_duration": 4648158584, + "load_duration": 4071084, + "prompt_eval_count": 36, + "prompt_eval_duration": 439038000, + "eval_count": 180, + "eval_duration": 4196918000 +} +``` + +The value of `response` will be a string containing [JSON](../../files/JSON.md) similar to: +```json +{ + "morning": { + "color": "blue" + }, + "noon": { + "color": "blue-gray" + }, + "afternoon": { + "color": "warm gray" + }, + "evening": { + "color": "orange" + } +} +``` + +##### Request (with images) +To submit images to multimodal models such as `llava` or `bakllava`, provide a list of [base64](../../files/Base64.md)-encoded `images`: + +##### Request +```shell +curl http://localhost:11434/api/generate -d '{ + "model": "llava", + "prompt":"What is in this picture?", + "stream": false, + "images": ["base64..."] +}' +``` + +##### Response +```json +{ + "model": "llava", + "created_at": "2023-11-03T15:36:02.583064Z", + "response": "A happy cartoon character, which is cute and cheerful.", + "done": true, + "context": [1, 2, 3], + "total_duration": 2938432250, + "load_duration": 2559292, + "prompt_eval_count": 1, + "prompt_eval_duration": 2195557000, + "eval_count": 44, + "eval_duration": 736432000 +} +``` + +##### Request (Raw Mode) +In some cases, you may wish to bypass the templating system and provide a full prompt. In this case, you can use the `raw` parameter to disable templating. Also note that raw mode will not return a context. + +###### Request +```shell +curl http://localhost:11434/api/generate -d '{ + "model": "mistral", + "prompt": "[INST] why is the sky blue? [/INST]", + "raw": true, + "stream": false +}' +``` + +###### Response +```json +{ + "model": "mistral", + "created_at": "2023-11-03T15:36:02.583064Z", + "response": " The sky appears blue because of a phenomenon called Rayleigh scattering.", + "done": true, + "total_duration": 8493852375, + "load_duration": 6589624375, + "prompt_eval_count": 14, + "prompt_eval_duration": 119039000, + "eval_count": 110, + "eval_duration": 1779061000 +} +``` + +##### Generate request (With options) +If you want to set custom options for the model at runtime rather than in the Modelfile, you can do so with the `options` parameter. This example sets every available option, but you can set any of them individually and omit the ones you do not want to override. + +###### Request +```shell +curl http://localhost:11434/api/generate -d '{ + "model": "llama2", + "prompt": "Why is the sky blue?", + "stream": false, + "options": { + "num_keep": 5, + "seed": 42, + "num_predict": 100, + "top_k": 20, + "top_p": 0.9, + "tfs_z": 0.5, + "typical_p": 0.7, + "repeat_last_n": 33, + "temperature": 0.8, + "repeat_penalty": 1.2, + "presence_penalty": 1.5, + "frequency_penalty": 1.0, + "mirostat": 1, + "mirostat_tau": 0.8, + "mirostat_eta": 0.6, + "penalize_newline": true, + "stop": ["\n", "user:"], + "numa": false, + "num_ctx": 1024, + "num_batch": 2, + "num_gqa": 1, + "num_gpu": 1, + "main_gpu": 0, + "low_vram": false, + "f16_kv": true, + "vocab_only": false, + "use_mmap": true, + "use_mlock": false, + "embedding_only": false, + "rope_frequency_base": 1.1, + "rope_frequency_scale": 0.8, + "num_thread": 8 + } +}' +``` + +###### Response +```json +{ + "model": "llama2", + "created_at": "2023-08-04T19:22:45.499127Z", + "response": "The sky is blue because it is the color of the sky.", + "done": true, + "context": [1, 2, 3], + "total_duration": 4935886791, + "load_duration": 534986708, + "prompt_eval_count": 26, + "prompt_eval_duration": 107345000, + "eval_count": 237, + "eval_duration": 4289432000 +} +``` + +##### Load a model +If an empty prompt is provided, the model will be loaded into memory. + +###### Request +```shell +curl http://localhost:11434/api/generate -d '{ + "model": "llama2" +}' +``` + +###### Response +A single JSON object is returned: + +```json +{ + "model": "llama2", + "created_at": "2023-12-18T19:52:07.071755Z", + "response": "", + "done": true +} +``` + +### Generate a chat completion +```http +POST /api/chat +``` + +Generate the next message in a chat with a provided model. This is a streaming endpoint, so there will be a series of responses. Streaming can be disabled using `"stream": false`. The final response object will include statistics and additional data from the request. + +#### Parameters +- `model`: (required) the model name +- `messages`: the messages of the chat, this can be used to keep a chat memory + +The `message` object has the following fields: +- `role`: the role of the message, either `system`, `user` or `assistant` +- `content`: the content of the message +- `images` (optional): a list of images to include in the message (for multimodal models such as `llava`) + +Advanced parameters (optional): +- `format`: the format to return a response in. Currently the only accepted value is `json` +- `options`: additional model parameters listed in the documentation for the Modelfile such as `temperature` +- `template`: the prompt template to use (overrides what is defined in the `Modelfile`) +- `stream`: if `false` the response will be returned as a single response object, rather than a stream of objects +- `keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`) + +#### Examples +##### Chat Request (Streaming) +###### Request +Send a chat message with a streaming response. + +```shell +curl http://localhost:11434/api/chat -d '{ + "model": "llama2", + "messages": [ + { + "role": "user", + "content": "why is the sky blue?" + } + ] +}' +``` + +###### Response +A stream of JSON objects is returned: + +```json +{ + "model": "llama2", + "created_at": "2023-08-04T08:52:19.385406455-07:00", + "message": { + "role": "assistant", + "content": "The", + "images": null + }, + "done": false +} +``` + +Final response: +```json +{ + "model": "llama2", + "created_at": "2023-08-04T19:22:45.499127Z", + "done": true, + "total_duration": 4883583458, + "load_duration": 1334875, + "prompt_eval_count": 26, + "prompt_eval_duration": 342546000, + "eval_count": 282, + "eval_duration": 4535599000 +} +``` + +##### Chat request (No streaming) +###### Request +```shell +curl http://localhost:11434/api/chat -d '{ + "model": "llama2", + "messages": [ + { + "role": "user", + "content": "why is the sky blue?" + } + ], + "stream": false +}' +``` + +###### Response +```json +{ + "model": "registry.ollama.ai/library/llama2:latest", + "created_at": "2023-12-12T14:13:43.416799Z", + "message": { + "role": "assistant", + "content": "Hello! How are you today?" + }, + "done": true, + "total_duration": 5191566416, + "load_duration": 2154458, + "prompt_eval_count": 26, + "prompt_eval_duration": 383809000, + "eval_count": 298, + "eval_duration": 4799921000 +} +``` + +##### Chat request (With History) +Send a chat message with a conversation history. You can use this same approach to start the conversation using multi-shot or chain-of-thought prompting. + +###### Request +```shell +curl http://localhost:11434/api/chat -d '{ + "model": "llama2", + "messages": [ + { + "role": "user", + "content": "why is the sky blue?" + }, + { + "role": "assistant", + "content": "due to rayleigh scattering." + }, + { + "role": "user", + "content": "how is that different than mie scattering?" + } + ] +}' +``` + +###### Response +A stream of JSON objects is returned: + +```json +{ + "model": "llama2", + "created_at": "2023-08-04T08:52:19.385406455-07:00", + "message": { + "role": "assistant", + "content": "The" + }, + "done": false +} +``` + +Final response: +```json +{ + "model": "llama2", + "created_at": "2023-08-04T19:22:45.499127Z", + "done": true, + "total_duration": 8113331500, + "load_duration": 6396458, + "prompt_eval_count": 61, + "prompt_eval_duration": 398801000, + "eval_count": 468, + "eval_duration": 7701267000 +} +``` + +##### Chat request (with images) +###### Request +Send a chat message with a conversation history. +```shell +curl http://localhost:11434/api/chat -d '{ + "model": "llava", + "messages": [ + { + "role": "user", + "content": "what is in this image?", + "images": ["base64..."] + } + ] +}' +``` + +###### Response +```json +{ + "model": "llava", + "created_at": "2023-12-13T22:42:50.203334Z", + "message": { + "role": "assistant", + "content": " The image features a cute, little pig with an angry facial expression. It's wearing a heart on its shirt and is waving in the air. This scene appears to be part of a drawing or sketching project.", + "images": null + }, + "done": true, + "total_duration": 1668506709, + "load_duration": 1986209, + "prompt_eval_count": 26, + "prompt_eval_duration": 359682000, + "eval_count": 83, + "eval_duration": 1303285000 +} +``` + +### Create a Model +```http +POST /api/create +``` + +Create a model from a `Modelfile`. It is recommended to set `modelfile` to the content of the Modelfile rather than just set `path`. This is a requirement for remote create. Remote model creation must also create any file blobs, fields such as `FROM` and `ADAPTER`, explicitly with the server using Create a Blob and the value to the path indicated in the response. + +#### Parameters +- `name`: name of the model to create +- `modelfile` (optional): contents of the Modelfile +- `stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects +- `path` (optional): path to the Modelfile + +#### Examples +##### Create a new model +Create a new model from a `Modelfile`. + +###### Request +```shell +curl http://localhost:11434/api/create -d '{ + "name": "mario", + "modelfile": "FROM llama2\nSYSTEM You are mario from Super Mario Bros." +}' +``` + +###### Response +A stream of [JSON](../../files/JSON.md) objects. Notice that the final [JSON](../../files/JSON.md) object shows a `"status": "success"`. + +```json +{"status":"reading model metadata"} +{"status":"creating system layer"} +{"status":"using already created layer sha256:22f7f8ef5f4c791c1b03d7eb414399294764d7cc82c7e94aa81a1feb80a983a2"} +{"status":"using already created layer sha256:8c17c2ebb0ea011be9981cc3922db8ca8fa61e828c5d3f44cb6ae342bf80460b"} +{"status":"using already created layer sha256:7c23fb36d80141c4ab8cdbb61ee4790102ebd2bf7aeff414453177d4f2110e5d"} +{"status":"using already created layer sha256:2e0493f67d0c8c9c68a8aeacdf6a38a2151cb3c4c1d42accf296e19810527988"} +{"status":"using already created layer sha256:2759286baa875dc22de5394b4a925701b1896a7e3f8e53275c36f75a877a82c9"} +{"status":"writing layer sha256:df30045fe90f0d750db82a058109cecd6d4de9c90a3d75b19c09e5f64580bb42"} +{"status":"writing layer sha256:f18a68eb09bf925bb1b669490407c1b1251c5db98dc4d3d81f3088498ea55690"} +{"status":"writing manifest"} +{"status":"success"} +``` + +#### Check if a Blob Exists +```http +HEAD /api/blobs/:digest +``` + +Ensures that the file blob used for a FROM or ADAPTER field exists on the server. This is checking your Ollama server and not Ollama.ai. + +##### Query Parameters +- `digest`: the SHA256 digest of the blob + +##### Examples +###### Request +```shell +curl -I http://localhost:11434/api/blobs/sha256:29fdb92e57cf0827ded04ae6461b5931d01fa595843f55d36f5b275a52087dd2 +``` + +###### Response +Return 200 OK if the blob exists, 404 Not Found if it does not. + +#### Create a Blob +```http +POST /api/blobs/:digest +``` + +Create a blob from a file on the server. Returns the server file path. + +##### Query Parameters +- `digest`: the expected SHA256 digest of the file + +##### Examples +###### Request +```shell +curl -T model.bin -X POST http://localhost:11434/api/blobs/sha256:29fdb92e57cf0827ded04ae6461b5931d01fa595843f55d36f5b275a52087dd2 +``` + +###### Response +Return 201 Created if the blob was successfully created, 400 Bad Request if the digest used is not expected. + +### List Local Models +```http +GET /api/tags +``` + +List models that are available locally. + +#### Examples +##### Request +```shell +curl http://localhost:11434/api/tags +``` + +##### Response +A single [JSON](../../files/JSON.md) object will be returned. + +```json +{ + "models": [ + { + "name": "codellama:13b", + "modified_at": "2023-11-04T14:56:49.277302595-07:00", + "size": 7365960935, + "digest": "9f438cb9cd581fc025612d27f7c1a6669ff83a8bb0ed86c94fcf4c5440555697", + "details": { + "format": "gguf", + "family": "llama", + "families": null, + "parameter_size": "13B", + "quantization_level": "Q4_0" + } + }, + { + "name": "llama2:latest", + "modified_at": "2023-12-07T09:32:18.757212583-08:00", + "size": 3825819519, + "digest": "fe938a131f40e6f6d40083c9f0f430a515233eb2edaa6d72eb85c50d64f2300e", + "details": { + "format": "gguf", + "family": "llama", + "families": null, + "parameter_size": "7B", + "quantization_level": "Q4_0" + } + } + ] +} +``` + +### Show Model Information +```http +POST /api/show +``` + +Show information about a model including details, modelfile, template, parameters, license, and system prompt. + +#### Parameters +- `name`: name of the model to show + +#### Examples +##### Request +```shell +curl http://localhost:11434/api/show -d '{ + "name": "llama2" +}' +``` + +##### Response +```json +{ + "modelfile": "# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM llava:latest\n\nFROM /Users/matt/.ollama/models/blobs/sha256:200765e1283640ffbd013184bf496e261032fa75b99498a9613be4e94d63ad52\nTEMPLATE \"\"\"{{ .System }}\nUSER: {{ .Prompt }}\nASSSISTANT: \"\"\"\nPARAMETER num_ctx 4096\nPARAMETER stop \"\u003c/s\u003e\"\nPARAMETER stop \"USER:\"\nPARAMETER stop \"ASSSISTANT:\"", + "parameters": "num_ctx 4096\nstop \u003c/s\u003e\nstop USER:\nstop ASSSISTANT:", + "template": "{{ .System }}\nUSER: {{ .Prompt }}\nASSSISTANT: ", + "details": { + "format": "gguf", + "family": "llama", + "families": ["llama", "clip"], + "parameter_size": "7B", + "quantization_level": "Q4_0" + } +} +``` + +### Copy a Model +```http +POST /api/copy +``` + +Copy a model. Creates a model with another name from an existing model. + +#### Examples +##### Request +```shell +curl http://localhost:11434/api/copy -d '{ + "source": "llama2", + "destination": "llama2-backup" +}' +``` + +##### Response +Returns a 200 OK if successful, or a 404 Not Found if the source model doesn't exist. + +### Delete a Model +```http +DELETE /api/delete +``` + +Delete a model and its data. + +#### Parameters +- `name`: model name to delete + +#### Examples +##### Request +```shell +curl -X DELETE http://localhost:11434/api/delete -d '{ + "name": "llama2:13b" +}' +``` + +##### Response +Returns a 200 OK if successful, 404 Not Found if the model to be deleted doesn't exist. + +### Pull a Model +```http +POST /api/pull +``` + +Download a model from the ollama library. Cancelled pulls are resumed from where they left off, and multiple calls will share the same download progress. + +#### Parameters +- `name`: name of the model to pull +- `insecure`: (optional) allow insecure connections to the library. Only use this if you are pulling from your own library during development. +- `stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects + +#### Examples +##### Request +```shell +curl http://localhost:11434/api/pull -d '{ + "name": "llama2" +}' +``` + +##### Response +If `stream` is not specified, or set to `true`, a stream of JSON objects is returned: + +The first object is the manifest: + +```json +{ "status": "pulling manifest" } +``` + +Then there is a series of downloading responses. Until any of the download is completed, the `completed` key may not be included. The number of files to be downloaded depends on the number of layers specified in the manifest. + +```json +{ + "status": "downloading digestname", + "digest": "digestname", + "total": 2142590208, + "completed": 241970 +} +``` + +After all the files are downloaded, the final responses are: +```json +{ "status": "verifying sha256 digest" } +{ "status": "writing manifest" } +{ "status": "removing any unused layers" } +{ "status": "success" } +``` + +if `stream` is set to false, then the response is a single [JSON](../../files/JSON.md) object: +```json +{ "status": "success" } +``` + +### Push a Model +```http +POST /api/push +``` + +Upload a model to a model library. Requires registering for ollama.ai and adding a public key first. + +#### Parameters +- `name`: name of the model to push in the form of `/:` +- `insecure`: (optional) allow insecure connections to the library. Only use this if you are pushing to your library during development. +- `stream`: (optional) if `false` the response will be returned as a single response object, rather than a stream of objects + +#### Examples +##### Request +```shell +curl http://localhost:11434/api/push -d '{ + "name": "mattw/pygmalion:latest" +}' +``` + +##### Response +If `stream` is not specified, or set to `true`, a stream of [JSON](../../files/JSON.md) objects is returned: + +```json +{ "status": "retrieving manifest" } +``` + +and then: +```json +{ + "status": "starting upload", + "digest": "sha256:bc07c81de745696fdf5afca05e065818a8149fb0c77266fb584d9b2cba3711ab", + "total": 1928429856 +} +``` + +Then there is a series of uploading responses: +```json +{ + "status": "starting upload", + "digest": "sha256:bc07c81de745696fdf5afca05e065818a8149fb0c77266fb584d9b2cba3711ab", + "total": 1928429856 +} +``` + +Finally, when the upload is complete: +```json +{"status":"pushing manifest"} +{"status":"success"} +``` + +If `stream` is set to `false`, then the response is a single [JSON](../../files/JSON.md) object: +```json +{ "status": "success" } +``` + +### Generate Embeddings +```http +POST /api/embeddings +``` + +Generate embeddings from a model + +#### Parameters +- `model`: name of model to generate embeddings from +- `prompt`: text to generate embeddings for + +Advanced parameters: +- `options`: additional model parameters listed in the documentation for the Modelfile such as `temperature` +- `keep_alive`: controls how long the model will stay loaded into memory following the request (default: `5m`) + +#### Examples +##### Request +```shell +curl http://localhost:11434/api/embeddings -d '{ + "model": "llama2", + "prompt": "Here is an article about llamas..." +}' +``` + +##### Response +```json +{ + "embedding": [ + 0.5670403838157654, 0.009260174818336964, 0.23178744316101074, -0.2916173040866852, -0.8924556970596313, + 0.8785552978515625, -0.34576427936553955, 0.5742510557174683, -0.04222835972905159, -0.137906014919281 + ] +} +``` + +## Modelfile +A model file is the blueprint to create and share models with Ollama. + +### Format +The format of the `Modelfile`: + +```modelfile +# comment +INSTRUCTION arguments +``` + +| Instruction | Description | +| ----------------- | -------------------------------------------------------------- | +| `FROM` (required) | Defines the base model to use. | +| `PARAMETER` | Sets the parameters for how Ollama will run the model. | +| `TEMPLATE` | The full prompt template to be sent to the model. | +| `SYSTEM` | Specifies the system message that will be set in the template. | +| `ADAPTER` | Defines the (Q)LoRA adapters to apply to the model. | +| `LICENSE` | Specifies the legal license. | +| `MESSAGE` | Specify message history. | + +### Basic `Modelfile` +An example of a `Modelfile` creating a mario blueprint: + +```modelfile +FROM llama2 +# sets the temperature to 1 [higher is more creative, lower is more coherent] +PARAMETER temperature 1 +# sets the context window size to 4096, this controls how many tokens the LLM can use as context to generate the next token +PARAMETER num_ctx 4096 + +# sets a custom system message to specify the behavior of the chat assistant +SYSTEM You are Mario from super mario bros, acting as an assistant. +``` + +To use this: + +1. Save it as a file (e.g. `Modelfile`) +2. `ollama create choose-a-model-name -f '` +3. `ollama run choose-a-model-name` +4. Start using the model! + +### FROM (Required) +The `FROM` instruction defines the base model to use when creating a model. + +```modelfile +FROM : +``` + +#### Build from llama2 + +```modelfile +FROM llama2 +``` + +#### Build from a `bin` file + +```modelfile +FROM ./ollama-model.bin +``` + +This bin file location should be specified as an absolute path or relative to the `Modelfile` location. + +### PARAMETER +The `PARAMETER` instruction defines a parameter that can be set when the model is run. + +```modelfile +PARAMETER +``` + +#### Valid Parameters and Values + +| Parameter | Description | Value Type | Example Usage | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- | -------------------- | +| mirostat | Enable Mirostat sampling for controlling perplexity. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) | int | mirostat 0 | +| mirostat_eta | Influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive. (Default: 0.1) | float | mirostat_eta 0.1 | +| mirostat_tau | Controls the balance between coherence and diversity of the output. A lower value will result in more focused and coherent text. (Default: 5.0) | float | mirostat_tau 5.0 | +| num_ctx | Sets the size of the context window used to generate the next token. (Default: 2048) | int | num_ctx 4096 | +| num_gqa | The number of GQA groups in the transformer layer. Required for some models, for example it is 8 for llama2:70b | int | num_gqa 1 | +| num_gpu | The number of layers to send to the GPU(s). On macOS it defaults to 1 to enable metal support, 0 to disable. | int | num_gpu 50 | +| num_thread | Sets the number of threads to use during computation. By default, Ollama will detect this for optimal performance. It is recommended to set this value to the number of physical CPU cores your system has (as opposed to the logical number of cores). | int | num_thread 8 | +| repeat_last_n | Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx) | int | repeat_last_n 64 | +| repeat_penalty | Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1) | float | repeat_penalty 1.1 | +| temperature | The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8) | float | temperature 0.7 | +| seed | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0) | int | seed 42 | +| stop | Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate `stop` parameters in a modelfile. | string | stop "AI assistant:" | +| tfs_z | Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting. (default: 1) | float | tfs_z 1 | +| num_predict | Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context) | int | num_predict 42 | +| top_k | Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40) | int | top_k 40 | +| top_p | Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9) | float | top_p 0.9 | + +### TEMPLATE +`TEMPLATE` of the full prompt template to be passed into the model. It may include (optionally) a system message and a user's prompt. This is used to create a full custom prompt, and syntax may be model specific. You can usually find the template for a given model in the readme for that model. + +#### Template Variables + +| Variable | Description | +| ----------------- | ------------------------------------------------------------------------------------------------------------- | +| `{{ .System }}` | The system message used to specify custom behavior, this must also be set in the Modelfile as an instruction. | +| `{{ .Prompt }}` | The incoming prompt, this is not specified in the model file and will be set based on input. | +| `{{ .Response }}` | The response from the LLM, if not specified response is appended to the end of the template. | +| `{{ .First }}` | A boolean value used to render specific template information for the first generation of a session. | + +```modelfile +TEMPLATE """ +{{- if .First }} +### System: +{{ .System }} +{{- end }} + +### User: +{{ .Prompt }} + +### Response: +""" + +SYSTEM """""" +``` + +### SYSTEM +The `SYSTEM` instruction specifies the system message to be used in the template, if applicable. + +```modelfile +SYSTEM """""" +``` + +### ADAPTER +The `ADAPTER` instruction specifies the LoRA adapter to apply to the base model. The value of this instruction should be an absolute path or a path relative to the Modelfile and the file must be in a GGML file format. The adapter should be tuned from the base model otherwise the behaviour is undefined. + +```modelfile +ADAPTER ./ollama-lora.bin +``` + +### LICENSE +The `LICENSE` instruction allows you to specify the legal license under which the model used with this Modelfile is shared or distributed. + +```modelfile +LICENSE """ + +""" +``` + +### MESSAGE +The `MESSAGE` instruction allows you to specify a message history for the model to use when responding: + +```modelfile +MESSAGE user Is Toronto in Canada? +MESSAGE assistant yes +MESSAGE user Is Sacramento in Canada? +MESSAGE assistant no +MESSAGE user Is Ontario in Canada? +MESSAGE assistant yes +``` + +## Libraries & Applications +- [ollama-rs](https://github.com/pepperoni21/ollama-rs) +- [ollama-webui](https://github.com/ollama-webui/ollama-webui) + +## Docker +Run Ollama with [docker](../../tools/Docker.md): +```shell +docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama +``` + +To run Ollama with the GPU either use a `rocm` tagged docker image or the NVIDIA container runtime: +```shell +sudo apt-get install -y nvidia-container-toolkit +sudo nvidia-ctk runtime configure --runtime=docker +sudo systemctl restart docker +docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama +``` \ No newline at end of file