ML | DeepSeek

DeepSeek (深度求索)，全稱杭州深度求索人工智能基礎技術研究有限公司，是中國的一家人工智能與大型語言模型公司。該公司的總部位於中國大陸浙江省杭州市，由中資對沖基金幻方量化創立，創始人和行政總裁為梁文鋒。DeepSeek因其在推理任務上的表現與OpenAI的ChatGPT相當，但開發成本和資源消耗卻僅為其一小部分而受到廣泛關注。

DeepSeek Model

DeepSeek-R1 and DeepSeek-V3 are advanced language models developed by DeepSeek-AI, leveraging a Mixture of Experts (MoE) architecture to optimize performance and resource efficiency.

DeepSeek-V3
- Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., ... & Piao, Y. (2024). Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437.
  - We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2.
- 【0410 LLM新知】Mixture of Expert，2024年LLM的新標竿
DeepSeek-R1
- DeepSeek-R1 requires at least 800 GB of HBM memory in FP8 format for inference.

Model	#Total Params	#Activated Params	Context Length	Download	VRAM FP8
DeepSeek-R1-Zero	671B	37B	128k	https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero	at least 800 GB
DeepSeek-R1	671B	37B	128k	https://huggingface.co/deepseek-ai/DeepSeek-R1	at least 800 GB
DeepSeek-V3	671B				at least 800 GB

DeepSeek-R1-Zero & DeepSeek-R1 are trained based on DeepSeek-V3-Base. For more details regarding the model architecture, please refer to DeepSeek-V3 repository. (DeepSeek-V3-671B, 671 billion, VRAM (FP16) ~1,543 GB, VRAM (4-bit Quantization) ~386 GB)

FP16 Precision: Higher VRAM GPUs or multiple GPUs are required due to the larger memory footprint.
4-bit Quantization: Lower VRAM GPUs can handle larger models more efficiently, reducing the need for extensive multi-GPU setups.
Lower Spec GPUs: Models can still be run on GPUs with lower specifications than the above recommendations, as long as the GPU is equal or more than VRAM requirements. However, the setup would not be optimal and likely requires some tuning, such as adjusting batch sizes and processing settings.

本地部署

Ollama

我們可以透過Ollama來進行安裝

Ollama 官方版：https://ollama.com/
- curl -fsSL https://ollama.com/install.sh | sh
- Ollama 是一個開源軟體，讓使用者可以在自己的硬體上運行、創建和分享大型語言模型服務。這個平台適合希望在本地端運行模型的使用者，因為它不僅可以保護隱私，還允許用戶透過命令行介面輕鬆地設置和互動。Ollama 支援包括 Llama 2 和 Mistral 等多種模型，並提供彈性的客製化選項，例如從其他格式導入模型並設置運行參數。
Web UI 控制端: Page Assist - A Web UI for Local AI Models | Chrome Extension

ollama where is model stored

macOS: ~/.ollama/models
Linux: /usr/share/ollama/.ollama/models
Windows: C:\Users<username>.ollama\models

$ ls ~/.ollama
history        id_ed25519     id_ed25519.pub logs           models

[~/.ollama/models]$ du -sh ./*
 39G	./blobs
 24K	./manifests

Examples

The Llama 3.2 1B and 3B models support context length of 128K tokens and are state-of-the-art in their class for on-device use cases like summarization, instruction following, and rewriting tasks running locally at the edge. These models are enabled on day one for Qualcomm and MediaTek hardware and optimized for Arm processors.

Distilled models

DeepSeek team has demonstrated that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models.

Below are the models created via fine-tuning against several dense models widely used in the research community using reasoning data generated by DeepSeek-R1. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks.

# 1.5B Qwen DeepSeek R1
ollama run deepseek-r1:1.5b

#7B Qwen DeepSeek R1
ollama run deepseek-r1:7b

# 8B Llama DeepSeek R1
ollama run deepseek-r1:8b

# 14B Qwen DeepSeek R1
ollama run deepseek-r1:14b

# 32B Qwen DeepSeek R1
ollama run deepseek-r1:32b

# 70B Llama DeepSeek R1
ollama run deepseek-r1:70b