1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
| (vllm) root@autodl-container-50604192c4-d9d01c36:~/autodl-tmp# # 修改最后的--model模型本地路径和--port绑定端口号 python3 -m vllm.entrypoints.openai.api_server \ --served-model-name autoglm-phone-9b \ --allowed-local-media-path / \ --mm-encoder-tp-mode data \ --mm_processor_cache_type shm \ --mm_processor_kwargs "{\"max_pixels\":5000000}" \ --max-model-len 25480 \ --chat-template-content-format string \ --limit-mm-per-prompt "{\"image\":10}" \ --model /root/autodl-tmp/autoglm-phone-9b \ --port 6006 (APIServer pid=1433) INFO 12-10 22:50:46 [api_server.py:1772] vLLM API server version 0.12.0 (APIServer pid=1433) INFO 12-10 22:50:46 [utils.py:253] non-default args: {'port': 6006, 'chat_template_content_format': 'string', 'model': '/root/autodl-tmp/autoglm-phone-9b', 'allowed_local_media_path': '/', 'max_model_len': 25480, 'served_model_name': ['autoglm-phone-9b'], 'limit_mm_per_prompt': {'image': 10}, 'mm_processor_kwargs': {'max_pixels': 5000000}, 'mm_processor_cache_type': 'shm', 'mm_encoder_tp_mode': 'data'} (APIServer pid=1433) Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'partial_rotary_factor', 'mrope_section'} (APIServer pid=1433) INFO 12-10 22:50:46 [model.py:637] Resolved architecture: Glm4vForConditionalGeneration (APIServer pid=1433) INFO 12-10 22:50:46 [model.py:1750] Using max model len 25480 (APIServer pid=1433) INFO 12-10 22:50:46 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048. (EngineCore_DP0 pid=1485) INFO 12-10 22:50:55 [core.py:93] Initializing a V1 LLM engine (v0.12.0) with config: model='/root/autodl-tmp/autoglm-phone-9b', speculative_config=None, tokenizer='/root/autodl-tmp/autoglm-phone-9b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=25480, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01), seed=0, served_model_name=autoglm-phone-9b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None} (EngineCore_DP0 pid=1485) INFO 12-10 22:50:57 [parallel_state.py:1200] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.10:47959 backend=nccl (EngineCore_DP0 pid=1485) INFO 12-10 22:50:58 [parallel_state.py:1408] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0 (EngineCore_DP0 pid=1485) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. (EngineCore_DP0 pid=1485) Keyword argument `max_pixels` is not a valid argument for this processor and will be ignored. (EngineCore_DP0 pid=1485) INFO 12-10 22:51:05 [gpu_model_runner.py:3467] Starting to load model /root/autodl-tmp/autoglm-phone-9b... (EngineCore_DP0 pid=1485) INFO 12-10 22:51:05 [cuda.py:411] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'] Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:00, 4.26it/s] Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:02, 1.43it/s] Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:02<00:01, 1.15it/s] Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:03<00:00, 1.03it/s] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00, 1.06s/it] Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00, 1.07it/s] (EngineCore_DP0 pid=1485) (EngineCore_DP0 pid=1485) INFO 12-10 22:51:10 [default_loader.py:308] Loading weights took 4.87 seconds (EngineCore_DP0 pid=1485) INFO 12-10 22:51:11 [gpu_model_runner.py:3549] Model loading took 19.2562 GiB memory and 5.143751 seconds (EngineCore_DP0 pid=1485) INFO 12-10 22:51:11 [gpu_model_runner.py:4306] Encoder cache will be initialized with a budget of 18622 tokens, and profiled with 1 video items of the maximum feature size. (EngineCore_DP0 pid=1485) INFO 12-10 22:51:22 [backends.py:655] Using cache directory: /root/.cache/vllm/torch_compile_cache/19b1386448/rank_0_0/backbone for vLLM's torch.compile (EngineCore_DP0 pid=1485) INFO 12-10 22:51:22 [backends.py:715] Dynamo bytecode transform time: 7.25 s (EngineCore_DP0 pid=1485) INFO 12-10 22:51:22 [backends.py:257] Cache the graph for dynamic shape for later use (EngineCore_DP0 pid=1485) INFO 12-10 22:51:29 [backends.py:288] Compiling a graph for dynamic shape takes 6.80 s (EngineCore_DP0 pid=1485) INFO 12-10 22:51:30 [monitor.py:34] torch.compile takes 14.05 s in total (EngineCore_DP0 pid=1485) INFO 12-10 22:51:31 [gpu_worker.py:359] Available KV cache memory: 19.17 GiB (EngineCore_DP0 pid=1485) INFO 12-10 22:51:32 [kv_cache_utils.py:1286] GPU KV cache size: 502,496 tokens (EngineCore_DP0 pid=1485) INFO 12-10 22:51:32 [kv_cache_utils.py:1291] Maximum concurrency for 25,480 tokens per request: 19.72x Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:03<00:00, 16.86it/s] Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:01<00:00, 23.12it/s] (EngineCore_DP0 pid=1485) INFO 12-10 22:51:37 [gpu_model_runner.py:4466] Graph capturing finished in 5 secs, took 0.69 GiB (EngineCore_DP0 pid=1485) INFO 12-10 22:51:37 [core.py:254] init engine (profile, create kv cache, warmup model) took 26.32 seconds (APIServer pid=1433) INFO 12-10 22:51:40 [api_server.py:1520] Supported tasks: ['generate'] (APIServer pid=1433) INFO 12-10 22:51:40 [api_server.py:1847] Starting vLLM API server 0 on http://0.0.0.0:6006 (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:38] Available routes are: (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /docs, Methods: GET, HEAD (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /redoc, Methods: GET, HEAD (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /health, Methods: GET (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /load, Methods: GET (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /pause, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /resume, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /is_paused, Methods: GET (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /tokenize, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /detokenize, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/models, Methods: GET (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /version, Methods: GET (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/responses, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/messages, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/chat/completions, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/completions, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/audio/translations, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /inference/v1/generate, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /ping, Methods: GET (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /ping, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /invocations, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /metrics, Methods: GET (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /classify, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/embeddings, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /score, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/score, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /rerank, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/rerank, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v2/rerank, Methods: POST (APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /pooling, Methods: POST (APIServer pid=1433) INFO: Started server process [1433] (APIServer pid=1433) INFO: Waiting for application startup. (APIServer pid=1433) INFO: Application startup complete.
|