網易首頁 > 網易號 > 正文申請入駐

MiniMax-M2.7 開源了，本地部署指南

2026-04-12 10:55:52　來源: Ai學習的老章

北京舉報

分享至

MiniMax-M2.7 上月推出，時隔半個多月，剛剛開源了

這次開源，可以發現很多更細節的內容，不過我就不過多介紹了

因為我簡單測試之后，沒達到我的預期，主要介紹一下本地部署相關的內容吧

我是用的 Nvidia 提供的線上測試，用例依然是閱讀理解+svg 代碼生成 + 審美

結果是比較跌眼睛的，甚至感覺有 Qwen3 的水平

與 GLM-5.1 半斤八兩

它倆都遠不及 Qwen3.6 Plus（僅代表本人觀點，僅此測試用例感受）

簡介

M2.7 的核心亮點：

模型自我進化 ：M2.7 能自主更新記憶、構建技能、改進學習流程，經過 100+ 輪自主優化，性能提升 30%
專業軟件工程 ：SWE-Pro 達到 56.22%，與 GPT-5.3-Codex 持平，生產環境事故恢復時間壓縮到 3 分鐘以內
專業辦公能力 ：GDPval-AA ELO 1495，開源模型最高，Word/Excel/PPT 高保真多輪編輯
原生 Agent Teams ：支持多智能體協作，角色穩定、自主決策

M2.7 模型技術規格藍圖

代碼能力號稱媲美 GPT-5.3-Codex

M2.7 基準性能全景部署方式：生態全面開花 M2.7 部署生態全景

成本是 230GB 起步，我覺得 2 張 H200 可能都勉強，官方建議至少 4 張 H200

目前量化版本應該都在加急中，截止此刻，還都只創建好了文件夾而已

按照以往 unSloth 的戰績，壓縮到幾十 GB 不是難事

MLE Bench Lite 自我進化性能 Ollama

Ollama 最新版已經有 minimax-m2.7:cloud 可以免費使用了

M2.7 已登錄 Ollama 云端，商業許可可用

# 與 OpenClaw 一起使用
ollama launch openclaw --model minimax-m2.7:cloud


 # 直接聊天
ollama run minimax-m2.7:cloud

Ollama 支持 MiniMax M2.7

這里要注意，目前 Ollama 上的 M2.7 走的是云端推理（:cloud 標簽），原因是 230B 參數的 MoE 模型本地跑起來需要的顯存實在太大

等后續量化版出來，應該會有本地可跑的版本

vLLM

vLLM 提供了 Day-0 支持，是目前最成熟的部署方案之一

# 基礎部署（4 卡 H200/H100/A100）
vllm serve MiniMaxAI/MiniMax-M2.7 \
  --tensor-parallel-size 4 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
  --enable-auto-tool-choice \
  --trust-remote-code


 # 8 卡部署（DP+EP 模式）
vllm serve MiniMaxAI/MiniMax-M2.7 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice

Docker 一鍵啟動：

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:minimax27 MiniMaxAI/MiniMax-M2.7 \
  --tensor-parallel-size 4 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice \
  --trust-remote-code

vLLM 支持 NVIDIA 和 AMD 兩大平臺：

NVIDIA ：4×H200/H100/A100 張量并行，或 8 卡 DP+EP/TP+EP 模式
AMD ：2× 或 4× MI300X/MI325X/MI350X/MI355X，支持 AITER 加速

系統需求：權重需要約 220GB 顯存，每 100 萬上下文 token 額外需要 240GB。

SGLang

SGLang 同樣提供了 Day-0 支持

sglang serve \
  --model-path MiniMaxAI/MiniMax-M2.7 \
  --tp 4 \
  --tool-call-parser minimax-m2 \
  --reasoning-parser minimax-append-think \
  --trust-remote-code \
  --mem-fraction-static 0.85

SGLang 的一個特點是支持 Thinking 模式，通過 minimax-append-think 解析器，可以把思考過程和最終內容分開展示。

快速測試部署是否成功：

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMaxAI/MiniMax-M2.7",
    "messages": [
      {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
      {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
    ]
  }'

M2.7 在 SGLang 上的推薦推理參數：temperature=1.0，top_p=0.95，top_k=40。

NVIDIA 加持

NVIDIA 這次給了 MiniMax 很大的支持力度

NVIDIA 支持 MiniMax M2.7

GPU 加速端點：在 build.nvidia.com/minimaxai/minimax-m2.7 可以免費試用 M2.7

推理優化：NVIDIA 和開源社區合作，為 vLLM 和 SGLang 做了兩個關鍵優化：

QK RMS Norm Kernel ：將計算和通信操作融合到單個內核中，減少了內核啟動和顯存讀寫開銷
FP8 MoE ：集成了 TensorRT-LLM 的 FP8 MoE 模塊化內核，專門針對 MoE 模型優化

結果非常驚人——在 NVIDIA Blackwell Ultra GPU 上：

vLLM 吞吐量提升 2.5 倍 （一個月內實現）
SGLang 吞吐量提升 2.7 倍 （一個月內實現）

NemoClaw：NVIDIA 提供了開源參考棧 NemoClaw，一鍵部署 OpenClaw 持續運行助手

微調支持：通過 NeMo AutoModel 庫進行后訓練，支持 EP + PP 訓練方案。NeMo RL 庫還提供了 GRPO 強化學習的樣例配方（8K 和 16K 序列長度）

微調配方：

# NeMo AutoModel 微調配方
https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/minimax_m2/minimax_m2.7_hellaswag_pp.yaml


 # 分布式訓練文檔
https://github.com/NVIDIA-NeMo/Automodel/discussions/1786

Transformers

也可以用 HuggingFace Transformers 直接加載模型，參考 Transformers 部署指南 (huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/docs/transformers_deploy_guide.md)

ModelScope

國內用戶也可以從 ModelScope(modelscope.cn/models/MiniMax/MiniMax-M2.7) 下載模型權重

Tool Calling 和 Thinking 模式

M2.7 同時支持工具調用和思考模式，這讓它在 Agent 場景下更加靈活。

工具調用示例（以 SGLang 為例）：

from openai import OpenAI

 client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

 tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

 response = client.chat.completions.create(
    model="MiniMaxAI/MiniMax-M2.7",
    messages=[
        {"role": "user", "content": "What's the weather in Beijing?"}
    ],
    tools=tools
)

 message = response.choices[0].message
if message.tool_calls:
    for tool_call in message.tool_calls:
        print(f"Tool Call: {tool_call.function.name}")
        print(f"   Arguments: {tool_call.function.arguments}")

Thinking 模式：通過 ... 標簽把思考過程包裹在內容中。在流式輸出場景下，可以實時解析這些標簽，把思考和最終回答分開展示。

快速上手

如果你想快速體驗 M2.7，最簡單的方式：

方式一：API 調用

訪問 platform.minimax.io 注冊開發者賬號，通過 API 調用。

方式二：MiniMax Agent

訪問 agent.minimax.io 直接在線對話。

方式三：Ollama 云端

ollama run minimax-m2.7:cloud

方式四：NVIDIA 免費端點

訪問 build.nvidia.com/minimaxai/minimax-m2.7 在瀏覽器中直接測試。

制作不易，如果這篇文章覺得對你有用，可否點個關注。給我個三連擊：點贊、轉發和在看。若可以再給我加個，謝謝你看我的文章，我們下篇再見！

特別聲明：以上內容(如有圖片或視頻亦包括在內)為自媒體平臺“網易號”用戶上傳并發布，本平臺僅提供信息存儲服務。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.