Qwen2.5 模型使用初体验
1. 环境准备
硬件环境:3张A100,40G;
开发环境:CUDA-12.2,conda虚拟环境
conda create -n my_vllm python==3.9.19 pip
conda activate my_vllm
pip install modelscope
pip install vllm
2. 模型下载
因为硬件环境限制,经多次尝试,只能部署Qwen2.5-72B-Instruct-GPTQ-Int4版本;
也可能是我部署方式不对,导致部署更大版本时一直OOM。。。
# 模型下载
# modelscope默认安装路径:/root/.cache/modelscope/hub/qwen/Qwen2.5-72B-Instruct-GPTQ-Int4
from modelscope import snapshot_download
model_dir = snapshot_download('qwen/Qwen2.5-72B-Instruct-GPTQ-Int4', local_dir='/home/models/qwen/Qwen2.5-72B-Instruct-GPTQ-Int4')
参考文档:
魔搭社区
效率评估 - Qwen
3. 直接服务器vllm方式启动测试
vllm serve /home/models/qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 --tensor-parallel-size 2
启动成功结果如下:
参考文档:
https://qwen.readthedocs.io/zh-cn/latest/deployment/vllm.html
4. 测试代码
import json
import requestsurl = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"
}question= 'balabala'data = {"model": "/home/models/qwen/Qwen2.5-72B-Instruct-GPTQ-Int4","messages": [{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},{"role": "user", "content": question}],"temperature": 0.7,"top_p": 0.8,"repetition_penalty": 1.05,"max_tokens": 512
}response = requests.post(url, headers=headers, data=json.dumps(data))# Print the response
print(response.json())
print(response.json()['choices'][0]['message']['content'][8:-4])
可以正常返回结果。