ML | Ollama

Ollama

Ollama是一個開源工具,旨在直接在本地機器上運行大型語言模型(LLMs),增強用戶的隱私和性能。它使開發者、研究人員和企業能夠利用強大的人工智能模型,而無需依賴基於雲的解決方案,從而保持對數據的完全控制並降低與外部伺服器相關的潛在安全風險。

Ollama的主要特點

  • 本地執行:Ollama使得在本地執行LLMs成為可能,這減輕了隱私問題並增強了數據安全性。這意味著用戶不需要將敏感信息上傳到雲端,確保所有處理都在其設備上進行。

  • 廣泛的模型庫:該平台支持多種預訓練模型,包括流行的LLaMA 2和Code Llama。用戶可以輕鬆選擇適合特定任務的模型,確保人工智能應用的多樣性。

  • 自定義和微調:Ollama允許用戶根據需求自定義和微調語言模型,包括提示工程和少量學習,使輸出更符合用戶目標。

  • 無縫集成:該工具與各種編程語言和框架良好集成,使開發者能夠輕鬆將LLMs納入其項目中。

  • 無使用限制:與許多在線人工智能服務施加使用上限不同,Ollama不限制生成文本的量,為用戶提供了更大的靈活性。

Ollama的應用

Ollama可以在多個領域中使用,包括:

  • 聊天機器人和虛擬助手:通過智能自動回應增強客戶服務體驗。
  • 代碼生成和輔助:通過生成代碼片段或提供調試幫助來簡化開發工作流程。
  • 自然語言處理:促進翻譯、摘要和內容生成等任務。
  • 研究和知識發現:分析大型數據集以提取見解或生成假設。

總之,Ollama是一個強大的本地運行LLMs的工具,在隱私、性能和自定義方面提供顯著優勢。它能夠在不依賴雲基礎設施的情況下運作,使其對於關心數據安全的人士在使用先進人工智能技術時特別具吸引力。

Ollama is a popular open-source command-line tool and engine that allows you to download quantized versions of the most popular LLM chat models.

Ollama is a separate application that you need to download first and connect to. Ollama supports both running LLMs on CPU and GPU.

本地部署

Ollama

我們可以透過Ollama來進行安裝

  • Ollama 官方版:https://ollama.com/
    • Mac
      • Apple Silicon 可用
      • 如果是 Mac, 下載後解開執行,簡單照著指示放到應用程式即可。雖然是應用程式,不過實際要跑模型的時候是用命令列
    • Linux
      • curl -fsSL https://ollama.com/install.sh | sh # Need sudo priviledge
      • Enable installation without root priviledge
        • https://github.com/ollama/ollama/releases & e.g. `wget https://github.com/ollama/ollama/releases/download/v0.5.7/ollama-linux-amd64.tgz'
          • ollama-linux-amd64.tgz
        • ./ollama serve &
          • source=routes.go:1238 msg="Listening on 127.0.0.1:11434 (version 0.5.7)"
        • ./ollama run llama2
        • The model is stored on $HOME/.ollama
    • Ollama 是一個開源軟體,讓使用者可以在自己的硬體上運行、創建和分享大型語言模型服務。這個平台適合希望在本地端運行模型的使用者,因為它不僅可以保護隱私,還允許用戶透過命令行介面輕鬆地設置和互動。Ollama 支援包括 Llama 2 和 Mistral 等多種模型,並提供彈性的客製化選項,例如從其他格式導入模型並設置運行參數。
  • Web UI 控制端: Page Assist - A Web UI for Local AI Models | Chrome Extension
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
 $ ./ollama --help
Large language model runner

Usage:
ollama [flags]
ollama [command]

Available Commands:
serve Start ollama
create Create a model from a Modelfile
show Show information for a model
run Run a model
stop Stop a running model
pull Pull a model from a registry
push Push a model to a registry
list List models
ps List running models
cp Copy a model
rm Remove a model
help Help about any command

Flags:
-h, --help help for ollama
-v, --version Show version information

Use "ollama [command] --help" for more information about a command.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
 $ ./ollama list
[GIN] 2025/02/04 - 10:53:10 | 200 | 171.11µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/02/04 - 10:53:11 | 200 | 43.04762ms | 127.0.0.1 | GET "/api/tags"
NAME ID SIZE MODIFIED
llama3.2:1b baf6a787fdff 1.3 GB 3 minutes ago

$ ./ollama show llama3.2:1b
[GIN] 2025/02/04 - 10:53:25 | 200 | 60.457µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/02/04 - 10:53:25 | 200 | 99.470023ms | 127.0.0.1 | POST "/api/show"
Model
architecture llama
parameters 1.2B
context length 131072
embedding length 2048
quantization Q8_0

License
LLAMA 3.2 COMMUNITY LICENSE AGREEMENT
Llama 3.2 Version Release Date: September 25, 2024

$ ./ollama ps
[GIN] 2025/02/04 - 10:53:58 | 200 | 75.923µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/02/04 - 10:53:58 | 200 | 303.327µs | 127.0.0.1 | GET "/api/ps"
NAME ID SIZE PROCESSOR UNTIL
llama3.2:1b baf6a787fdff 2.2 GB 100% CPU 58 seconds from now
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
 $ ./ollama help serve
Start ollama

Usage:
ollama serve [flags]

Aliases:
serve, start

Flags:
-h, --help help for serve

Environment Variables:
OLLAMA_DEBUG Show additional debug information (e.g. OLLAMA_DEBUG=1)
OLLAMA_HOST IP Address for the ollama server (default 127.0.0.1:11434)
OLLAMA_KEEP_ALIVE The duration that models stay loaded in memory (default "5m")
OLLAMA_MAX_LOADED_MODELS Maximum number of loaded models per GPU
OLLAMA_MAX_QUEUE Maximum number of queued requests
OLLAMA_MODELS The path to the models directory
OLLAMA_NUM_PARALLEL Maximum number of parallel requests
OLLAMA_NOPRUNE Do not prune model blobs on startup
OLLAMA_ORIGINS A comma separated list of allowed origins
OLLAMA_SCHED_SPREAD Always schedule model across all GPUs

OLLAMA_FLASH_ATTENTION Enabled flash attention
OLLAMA_KV_CACHE_TYPE Quantization type for the K/V cache (default: f16)
OLLAMA_LLM_LIBRARY Set LLM library to bypass autodetection
OLLAMA_GPU_OVERHEAD Reserve a portion of VRAM per GPU (bytes)
OLLAMA_LOAD_TIMEOUT How long to allow model loads to stall before giving up (default "5m")

ollama where is model stored

  • macOS: ~/.ollama/models
  • Linux: /usr/share/ollama/.ollama/models
  • Windows: C:\Users<username>.ollama\models
1
2
3
4
5
6
$ ls ~/.ollama
history id_ed25519 id_ed25519.pub logs models

[~/.ollama/models]$ du -sh ./*
39G ./blobs
24K ./manifests

WebUI

  1. Web UI 控制端: Page Assist - A Web UI for Local AI Models | Chrome Extension
  2. Open WebUI
    1. micromamba install python=3.11
    2. pip install open-webui
    3. open-webui serve # you can access at http://localhost:8080
1
2
3
4
5
6
7
8
9
10
11
12
13
14
INFO  [open_webui.env] Embedding model set: sentence-transformers/all-MiniLM-L6-v2
WARNI [langchain_community.utils.user_agent] USER_AGENT environment variable not set, consider setting it to identify your requests.

___ __ __ _ _ _ ___
/ _ \ _ __ ___ _ __ \ \ / /__| |__ | | | |_ _|
| | | | '_ \ / _ \ '_ \ \ \ /\ / / _ \ '_ \| | | || |
| |_| | |_) | __/ | | | \ V V / __/ |_) | |_| || |
\___/| .__/ \___|_| |_| \_/\_/ \___|_.__/ \___/|___|
|_|


v0.5.7 - building the best open-source AI user interface.

https://github.com/open-webui/open-webui
  • ollama serve on server (default:11434 port)
    • source=routes.go:1238 msg="Listening on 127.0.0.1:11434 (version 0.5.7)"
  • ssh tunnel to server (like ssh -N -L 11434:localhost:11434 10.4.7.1)
  • http://localhost:11434/ --> show Ollama is running
  • http://localhost:8080/ enters Open-WebUI page (port=8080).

Installation with Default Configuration

  • If Ollama is on your computer, use this command:

    1
    docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
  • If Ollama is on a Different Server, use this command:

    To connect to Ollama on another server, change the OLLAMA_BASE_URL to the server's URL:

    1
    docker run -d -p 3000:8080 -e OLLAMA_BASE_URL=https://example.com -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
  • To run Open WebUI with Nvidia GPU support, use this command:

    1
    docker run -d -p 3000:8080 --gpus all --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:cuda

Installation for OpenAI API Usage Only

  • If you're only using OpenAI API, use this command:

    1
    docker run -d -p 3000:8080 -e OPENAI_API_KEY=your_secret_key -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

Installing Open WebUI with Bundled Ollama Support

This installation method uses a single container image that bundles Open WebUI with Ollama, allowing for a streamlined setup via a single command. Choose the appropriate command based on your hardware setup:

  • With GPU Support: Utilize GPU resources by running the following command:

    1
    docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama
  • For CPU Only: If you're not using a GPU, use this command instead:

    1
    docker run -d -p 3000:8080 -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

Both commands facilitate a built-in, hassle-free installation of both Open WebUI and Ollama, ensuring that you can get everything up and running swiftly.

After installation, you can access Open WebUI at http://localhost:3000.

Run

1
$ ollama run phi3
  • ollama pull <名字> 只下載模型不跑
  • ollama list 可顯示本機下載了哪些模型
  • ollama rm <名字> 可刪除下載的模型

模型儲存~/.ollama/models/; log 也在 ~/.ollama/logs/server.log

Stop Ollama

「這什麼蠢問題?不是 ctrl-c 或是 ctrl-d 就好了嗎?」

  • 其實 ollama run 「結束」以後,服務還在 port 照 bind,可以 ps 看到。雖然一段時間沒跑模型以後,模型就會從記憶體裡卸載,但如果有潔癖的話,可以在 Mac menu bar 右上角看到 Ollama 的圖示,點下去 Quit 就行

  • Linux
    • Identify the Process: ps aux | grep ollama
    • pkill ollama or kill -9 <PID>

Remove Ollama

  • Stop the Ollama Service
  • Remove Ollama
    1
    2
    3
    4
    5
    6
    systemctl stop ollama.service  # Stop the service
    sudo apt remove ollama # Remove (Debian/Ubuntu)
    sudo dnf remove ollama # Remove (Fedora/RHEL)
    sudo snap remove ollama # Remove (Snap)
    brew uninstall ollama # Remove (Homebrew)
    rm -rf ~/.ollama # Remove configuration files (optional)
    or
    1
    2
    sudo rm /usr/local/bin/ollama   # Adjust the path as necessary
    rm -rf ~/.ollama # Remove configuration files

Ollama-library

  • https://ollama.com/library

  • The Llama 3.2 1B and 3B models support context length of 128K tokens and are state-of-the-art in their class for on-device use cases like summarization, instruction following, and rewriting tasks running locally at the edge. These models are enabled on day one for Qualcomm and MediaTek hardware and optimized for Arm processors.

Model examples

LLaMA (Large Language Model Meta AI)

LLaMA (Large Language Model Meta AI) is a family of large language models developed by Meta AI, with its initial release in February 2023. The latest version, LLaMA 3.3, was launched in December 2024. (The largest model, LLaMA 3.1 with 405 billion parameters, requires approximately 854 GB of memory without quantization. With techniques like 8-bit quantization, this can be reduced to around 427 GB, though it still demands substantial computational resources.)

Llama 3.2 1B and 3B models:

Model Total Parameters Context Length Memory Requirements (GB)
Llama 3.2 1B 1 billion 128,000 tokens BF16/FP16: ~2.5 GB
FP8: ~1.25 GB
INT4: ~0.75 GB
Llama 3.2 3B 3 billion 128,000 tokens BF16/FP16: ~6.5 GB
FP8: ~3.2 GB
INT4: ~1.75 GB

Key Features:

  • Multilingual Support: Trained on up to 9 trillion tokens, supporting languages like English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
  • Optimized for Efficiency: Designed for on-device applications such as prompt rewriting and knowledge retrieval.
  • High Performance: Outperforms many existing open-access models of similar sizes and is competitive with larger models.

Customize a model (.gguf)

Import from GGUF

Ollama supports importing GGUF models in the Modelfile:

  1. Create a file named Modelfile, with a FROM instruction with the local filepath to the model you want to import.

    1
    FROM ./vicuna-33b.Q4_0.gguf
  2. Create the model in Ollama

    1
    ollama create example -f Modelfile
  3. Run the model

    1
    ollama run example

References

  1. github | ollama
  2. 本地部署 DeepSeek-R1 大模型!免费开源,媲美OpenAI-o1能力
  3. ollama | deepseek-r1
  4. huggingface | DeepSeek-R1-Distill-Qwen-1.5B
  5. Day 03】Ollama UI 本機建置
  6. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
  7. ollama | llama3.2

ML | Ollama
https://waipangsze.github.io/2025/01/30/ML-Ollama/
Author
wpsze
Posted on
January 30, 2025
Updated on
February 5, 2025
Licensed under