当前位置：首页 > news >正文

TensorRT-LLM七日谈 Day2

news 2025/7/10 1:29:50

利用7天时间熟悉tensort-llm的代码架构，cublas的使用方式以及flash attention的调优。

昨天卡在了环境配置上，经过一天的等待，pip基本完成后，我们基于链接1，继续配置环境。

pip install -r requirements.txt
pip install --upgrade transformers # Llama 3.1 requires transformer 4.43.0+ version.

接下来，我们试图通过llama3.1进行模型推理，因此，需要进行模型下载。

在下载完成后，我们需要进行模型的转换。

python3 convert_checkpoint.py --model_dir Meta-Llama-3.1-8B-Instruct \
--output_dir llama-3.1-8b-ckpt

然后，我们需要进行模型的编译

#Compile model
trtllm-build --checkpoint_dir llama-3.1-8b-ckpt \--gemm_plugin float16 \--output_dir ./llama-3.1-8b-engine

可以看到，它是一个python文件，调用了from tensorrt_llm.commands.build import main。那这又是什么呢？

经过查看，可以看到这是一个bash脚本文件。

# trtllm-build#!/usr/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from tensorrt_llm.commands.build import main
if __name__ == '__main__':sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])sys.exit(main())

经过查看，可以看到，tensorrt_llm是一个package，

$user: pip show tensorrt_llm
Name: tensorrt-llm
Version: 0.14.0.dev2024092401
Summary: TensorRT-LLM: A TensorRT Toolbox for Large Language Models
Home-page: https://github.com/NVIDIA/TensorRT-LLM
Author: NVIDIA Corporation
Author-email: 
License: Apache License 2.0
Location: /usr/local/lib/python3.10/dist-packages

进一步观察package的内容，可以看到，它由多个python文件组成，此外，在libs中也包含多个.so文件。

root@6952aaf5d371:/usr/local/lib/python3.10/dist-packages/tensorrt_llm# tree -L 1
.
|-- __init__.py
|-- __pycache__
|-- _common.py
|-- _ipc_utils.py
|-- _utils.py
|-- auto_parallel
|-- bench
|-- bin
|-- bindings
|-- bindings.cpython-310-x86_64-linux-gnu.so
|-- builder.py
|-- commands
|-- executor.py
|-- functional.py
|-- graph_rewriting.py
|-- hlapi
|-- layers
|-- libs
|-- logger.py
|-- lora_manager.py
|-- mapping.py
|-- models
|-- module.py
|-- network.py
|-- parameter.py
|-- plugin
|-- profiler.py
|-- quantization
|-- runtime
|-- tools
|-- top_model_mixin.py
`-- version.py

其实，目前为止，我们对于tensorrt-llm 这个库就已经有基本的认知了，它通过python作为接口，但是底层还是依赖tensorrt,cuda闭源的高性能实现，如果查看tensorrt-llm 的源码，还可以看到pybind的文件夹，里面包含了python和cpp的一些接口的转换和交互。

然后，我们进行模型转换，

#Compile model
trtllm-build --checkpoint_dir llama-3.1-8b-ckpt\--gemm_plugin float16\    --output_dir ./llama-3.1-8b-engine

接着，执行推理，但是发现存在OOM的运行时错误。

python3 ../run.py --engine_dir ./llama-3.1-8b-engine  --max_output_len 50 --tokenizer_dir Meta-Llama-3.1-8B-Instruct --input_text "Hello"

RuntimeError: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mMemPool->getPool(), mCudaStream->get()): out of memory (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:125)

针对上面的问题，似乎是因为模型大小太大造成，于是更换为较小的模型TinyLlama（2.1G）。

基于下列脚本，我们可以成功实现模型推理。

from tensorrt_llm import LLM, SamplingParams
prompts = ["Hello, my name is","The president of the United States is","The capital of France is","The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)model_path = "/data/TinyLlama-1.1B-Chat-v1.0"llm = LLM(model= model_path)outputs = llm.generate(prompts, sampling_params)# Print the outputs.
for output in outputs:prompt = output.promptgenerated_text = output.outputs[0].textprint(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")