当前位置：首页 > news >正文

微软最新轻量级、多模态Phi-3.5-vision-instruct模型部署

news 2025/7/12 13:17:42

Phi-3.5-vision-instruct是微软最新发布的 Phi-3.5 系列中的一个AI模型，专注于多模态任务处理，尤其是视觉推理方面的能力。

Phi-3.5-vision-instruct模型具备广泛的图像理解、光学字符识别（OCR）、图表和表格解析、多图像或视频剪辑摘要等功能，非常适合多种AI驱动的应用，在图像和视频处理相关的基准测试中表现出显著的性能提升。

Phi-3.5-vision-instruct模型的架构包括一个42亿参数的系统，集成了图像编码器、连接器、投影器和Phi-3 Mini语言模型，训练使用了256个NVIDIA A100-80G GPU，训练时间为6天。

Phi-3.5-vision在多模态多图像理解（MMMU）中的得分为43.0，相较于之前版本有所提升，显示了其在处理复杂图像理解任务时的增强能力。

github项目地址：https://github.com/microsoft/Phi-3CookBook。

一、环境安装

1、python环境

建议安装python版本在3.10以上。

2、pip库安装

pip install torch==2.3.0+cu118 torchvision==0.18.0+cu118 torchaudio==2.3.0 --extra-index-url https://download.pytorch.org/whl/cu118pip install upgrade transformers -i https://pypi.tuna.tsinghua.edu.cn/simplepip install flash-attn --no-build-isolation

3、模型下载：

git lfs installgit clone https://modelscope.cn/models/LLM-Research/Phi-3.5-vision-instruct

二、功能测试

1、运行测试：

（1）python代码调用测试

from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
import argparseclass VisionInstructModel:def __init__(self, model_path, local_image_path, torch_dtype='auto'):self.model_path = model_pathself.local_image_path = local_image_pathself.torch_dtype = torch_dtypeself.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')self._load_model_and_processor()def _load_model_and_processor(self):self.processor = AutoProcessor.from_pretrained(self.model_path, trust_remote_code=True)self.model = AutoModelForCausalLM.from_pretrained(self.model_path,trust_remote_code=True,torch_dtype=self.torch_dtype,_attn_implementation='flash_attention_2').to(self.device)def _prepare_input(self, prompt, image_path):image = Image.open(image_path)return self.processor(prompt, image, return_tensors="pt").to(self.device)def generate_response(self, prompt, max_new_tokens=1000):inputs = self._prepare_input(prompt, self.local_image_path)generate_ids = self.model.generate(**inputs,max_new_tokens=max_new_tokens,eos_token_id=self.processor.tokenizer.eos_token_id)generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]response = self.processor.batch_decode(generate_ids,skip_special_tokens=True,clean_up_tokenization_spaces=False)[0]return responsedef describe_image(self):user_prompt = '<|user|>\n'assistant_prompt = '<|assistant|>\n'prompt_suffix = "<|end|>\n"prompt = f"{user_prompt}<|image_1|>\nDescribe the picture{prompt_suffix}{assistant_prompt}"response = self.generate_response(prompt)print("response:", response)return responsedef main(model_path, image_path):model = VisionInstructModel(model_path, image_path, torch_dtype='bfloat16')model.describe_image()if __name__ == "__main__":parser = argparse.ArgumentParser(description="Run VisionInstructModel to describe an image.")parser.add_argument("--model_path", type=str, required=True, help="Path to the model directory.")parser.add_argument("--image_path", type=str, required=True, help="Path to the image file.")args = parser.parse_args()main(args.model_path, args.image_path)

未完......

更多详细的欢迎关注：杰哥新技术