vLLM 分布式服务部署与压测指南
目录
- Docker 部署
- 服务启动
- API 测试
- 压力测试
- PD分离
https://www.modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
https://docs.vllm.com.cn/en/latest/getting_started/quickstart.html#installation
Docker 部署
启动 vLLM 服务容器
1 2 3 4 5 6 7 8 9 10 11
| docker run -t -d \ --name="vllm" \ --ipc=host \ --cap-add=SYS_PTRACE \ --network=host \ --gpus all \ --privileged \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -v /mnt:/mnt \ registry.cn-hangzhou.aliyuncs.com/lky-deploy/llm:vllm-server-0.7.2
|
关键参数说明:
--gpus all:启用所有 GPU
--network=host:使用主机网络模式
-v /mnt:/mnt:挂载宿主机存储卷
--privileged:赋予容器特权模式
服务启动
1. 初始化环境
1 2 3 4 5 6 7 8 9
| export GLOO_SOCKET_IFNAME=eth0 export NCCL_SOCKET_IFNAME=eth0
ray start --head --dashboard-host 0.0.0.0
ray start --address='${master_ip}:6379'
ray status
|
2. 启动 vLLM 服务
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| vllm serve /mnt/7B \ --tensor-parallel-size 8 \ --served-model-name vllm \ --trust-remote-code \ --enable-chunked-prefill \ --host 0.0.0.0
vllm serve /mnt/DeepSeek-R1 \ --tensor-parallel-size 16 \ --trust-remote-code \ --enable-chunked-prefill \ --pipeline-parallel-size 2 \ --host 0.0.0.0 \ --max-num-batched-tokens 131072 &> vllm.log &
|
参数文档:https://docs.vllm.com.cn/en/latest/configuration/engine_args.html#modelconfig
核心参数:
| 参数 |
说明 |
--tensor-parallel-size 16 |
张量并行度 |
--pipeline-parallel-size 2 |
流水线并行度 |
--max-num-batched-tokens 131072 |
最大批处理 token 数(约 128K) |
API 测试
1. 聊天接口测试
1 2 3 4 5 6
| curl http://localhost:8000/v1/chat/completions \ -H "Content-Type:application/json" \ -d '{ "model":"/mnt/DeepSeek-V3/DeepSeek-V3/", "messages":[{"role": "user", "content": "写一首20字的月亮诗"}], "stream":false}'
|
2. 补全接口测试
1 2 3 4 5 6
| curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/mnt/DeepSeek-R1", "prompt": "巴黎是法国的首都吗?", "max_tokens": 50}'
|
压力测试
1. 启动压测容器
1 2
| docker run --name vllm-benchmark -it -v /mnt:/mnt -d \ registry.cn-hangzhou.aliyuncs.com/lky-deploy/llm:vllm-benchmark-tagv0.7.2 bash
|
2. 准备测试数据集
1 2
| modelscope download --dataset gliang1001/ShareGPT_V3_unfiltered_cleaned_split \ ShareGPT_V3_unfiltered_cleaned_split.json --local_dir /root/
|
3. 执行压测脚本
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
| #!/bin/bash
NAME="/mnt/DeepSeek-R1/"
PAIRS=( "1:10" "8:80" "16:160" "24:240" "32:320" )
RANDOM_INPUT_LEN=1 RANDOM_OUTPUT_LEN=128 RANDOM_RANGE_RATIO=1
for pair in "${PAIRS[@]}"; do IFS=":" read -r concurrency num_prompt <<< "$pair" output_file="${RANDOM_INPUT_LEN}_${RANDOM_OUTPUT_LEN}_${RANDOM_RANGE_RATIO}_c${concurrency}_n${num_prompt}.txt" python3 /root/vllm/benchmarks/benchmark_serving.py \ --backend vllm \ --model "$NAME" \ --served-model-name "$NAME" \ --trust-remote-code \ --dataset-name random \ --dataset-path /root/ShareGPT_V3_unfiltered_cleaned_split.json \ --random-input-len $RANDOM_INPUT_LEN \ --random-output-len $RANDOM_OUTPUT_LEN \ --random-range-ratio $RANDOM_RANGE_RATIO \ --num-prompts $num_prompt \ --max-concurrency $concurrency \ --request-rate inf \ --host 10.112.201.25 \ --port 8000 \ --endpoint /v1/completions \ &> "$output_file" echo "已完成测试:concurrency=${concurrency}, num_prompts=${num_prompt}" sleep 3 done
echo "所有测试任务已完成!"
|
压测配置说明:
- 测试模式:随机输入/输出长度
- 输入长度:1 token
- 输出长度:128 tokens
- 并发梯度:1~32 并发
- 请求量梯度:10~320 请求
PD分离
官方文档参考
1. 预填充节点配置(Prefill)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| vLLM 目前支持 5 种类型的连接器: SharedStorageConnector, 通过共享存储路径进行 KV Cache 共享 LMCacheConnectorV1,结合 LMCache 缓存和 NIXL 传输 KV Cache NixlConnector,基于 NIXL 传输 KV Cache P2pNcclConnector,利用 NVIDIA NCCL 进行 KV Cache 传输 MultiConnector,多种连接器组合 参考 https://docs.vllm.ai/en/latest/features/disagg_prefill.html?h=prefill#why-disaggregated-prefilling
CUDA_VISIBLE_DEVICES=0 vllm serve /data/7B \ --port 8100 \ --max-model-len 100 \ --gpu-memory-utilization 0.9 \ --kv-transfer-config '{ "kv_connector": "PyNcclConnector", "kv_role": "kv_producer", "kv_rank": 0, "kv_parallel_size": 2, "kv_ip": "172.25.79.18", "kv_port": 14579 }'
CUDA_VISIBLE_DEVICES=1 vllm serve /data/7B \ --port 8101 \ ... --kv-transfer-config '{ ... "kv_rank": 1, ... }'
|
2. 解码节点配置(Decode)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| CUDA_VISIBLE_DEVICES=1 vllm serve /data/7B \ --port 8200 \ ... --kv-transfer-config '{ "kv_role": "kv_consumer", "kv_rank": 1, ... }'
CUDA_VISIBLE_DEVICES=2 vllm serve /data/7B \ --port 8201 \ ... --kv-transfer-config '{ ... "kv_rank": 0, ... }'
|
3. 路由服务配置
https://github.com/vllm-project/vllm/blob/main/examples/online_serving/disaggregated_serving/disagg_proxy_demo.py
1 2 3 4 5 6 7 8 9 10 11
| python3 examples/online_serving/disaggregated_serving/disagg_proxy_demo.py \ --model your_model_name \ --prefill localhost:8100 localhost:8101 \ --decode localhost:8200 localhost:8201 \ --port 8000
wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/benchmarks/disagg_benchmarks/disagg_prefill_proxy_server.py -O disagg_prefill_proxy_server.py python3 -m pip install --ignore-installed blinker quart -i https://pypi.tuna.tsinghua.edu.cn/simple wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/benchmarks/disagg_benchmarks/rate_limiter.py -O rate_limiter.py wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/benchmarks/disagg_benchmarks/request_queue.py -O request_queue.py
|
参数说明
公共参数
| 参数 |
说明 |
--max-model-len |
最大序列长度(需与内存匹配) |
--gpu-memory-utilization |
GPU 显存利用率阈值(0.0~1.0) |
--kv-parallel-size |
KV 传输并行度(需与节点数一致) |
KV 传输参数
1 2 3 4 5 6 7
| { "kv_connector": "PyNcclConnector", "kv_role": ["producer"/"consumer"], "kv_rank": 0, "kv_ip": "master_ip", "kv_port": 14579 }
|