【AI】智谱AutoGLM部署教程：AutoDL云服务器+本地PhoneAgent配置

慕雪的小助手正在绞尽脑汁···

慕雪小助手的总结

DeepSeek & LongCat

1. 引言

开源地址：https://github.com/zai-org/Open-AutoGLM

一般情况下呢，这里得要介绍一下这个模型，背景信息啊，什么什么的。但慕雪最近很忙没时间写，直接跳过步入正题吧！

总而言之言而总之，这是智谱在25年12月9日开源的，一个专门为手机UI自动化操作开发的大模型，在今年早些时候AutoGLM的手机App就已经上线并可以通过里面的云端虚拟手机进行测试。现在，智谱把模型和本地Agent框架PhoneAgent一并开源，让我们可以自己部署AutoGLM并将其运用到各类UI自动化操作上。

因为是自部署的，数据都在你本地，也就不用担心大模型云端操作的隐私泄露问题了。

2. AutoDL部署AutoGLM模型

2.1. 创建镜像

AutoDL：https://www.autodl.com/home

根据官方在issue里面的回复，AutoGLM模型使用24G显存勉强可以运行，但实际上会占用27G的显存+共享内存，所以，需要在AutoDL上选一个32GB或48GB显存的服务器，cuda版本为12.8以上的，镜像选择PyTorch 2.8.0、Python 3.12(ubuntu22.04)、CUDA 12.8，就可以了。

创建镜像并开机之后，可以用ssh工具连这个服务器，也可以直接用控制台里面的jupyterLab链接，jupyterLab本身就带了终端持久运行的能力，不再需要我们安装tmux等其他守护进程工具了。

2.2. 下载模型

因为AutoDL是境内服务器，所以推荐去阿里的魔搭社区上下载模型：https://modelscope.cn/models/ZhipuAI/AutoGLM-Phone-9B

下载方式在魔搭社区上也有教程，执行如下命令即可。注意，在AutoDL上一定要进入/root/autodl-tmp数据盘进行操作，否则模型会直接把系统盘塞满，影响系统运行了。

1
2
3

pip install modelscope
# 下载模型到本地，注意一定要有--local_dir参数
modelscope download --model ZhipuAI/AutoGLM-Phone-9B --local_dir /root/autodl-tmp/autoglm-phone-9b

模型大约20GB，在AutoDL上下载大概需要半个小时，耐心等待一下吧。

2.3. 配置vllm运行环境

等待模型下载期间也别闲着，开另外一个终端配置一下vllm的运行环境。

AutoGLM依赖于vllm 0.12.0和transformers 5.0.0rc0，我们可以创建一个conda环境来安装。

vllm 0.12.0在官方release中说明强依赖pytorch 3.9.0和cuda 12.9，但实测在AutoDL的cuda 12.8的环境里面是能可以正常运行无报错的。

执行如下命令，创建一个conda虚拟环境

# 创建虚拟环境
conda create -n vllm python=3.12 -y
# 初始化
conda init

首次执行完毕conda init之后，会提示你开另外一个新终端，开一个新终端之后，执行如下命令

1
2
3

conda init
# 激活刚刚创建的虚拟环境
conda activate vllm

执行完毕后，我们就已经进入刚刚新创建的虚拟环境里面了，执行下面两个命令即可。AutoDL的镜像已经默认设置了阿里pypi源，不需要我们修改镜像源了。

1 2	pip install vllm==0.12.0 # 一定要先安装这个 pip install transformers==5.0.0rc0

注意，一定需要先安装vllm，然后再安装transformers^[1]，安装transformers==5.0.0rc0的时候会出现依赖不匹配的报错，因为vllm 0.12.0依赖的是4.x版本的transformers。可以直接忽略这个依赖版本不匹配的报错，智谱官方在issue里面提到了是能够兼容的，实测也确实OK。

其实不安装transformers==5.0.0rc0我也试过，模型也能运行，似乎也没啥问题。但是控制台会有加载解析器正则错误的告警，估计这就是截图里面提到的“新写法”导致的问题了。所以还是老实安装升级吧！

安装完毕这俩库之后，环境就搞定了，可以运行模型了！（当然得等模型下完了才行）

2.4. 运行模型

使用AutoGLM仓库里面给出的vllm命令，运行模型。注意这个命令需要修改我们下载好的模型本地路径，和端口号。AutoDL平台上只有6006和6008端口号是被映射到公网上的，其他端口号都不能使用。

另外，AutoDL租用的服务器提供外网服务需进行实名认证，请确保你的大模型服务不会被滥用生成违规违禁内容，避免罪责到你身上。

# 修改最后的--model模型本地路径和--port绑定端口号
python3 -m vllm.entrypoints.openai.api_server \
     --served-model-name autoglm-phone-9b \
     --allowed-local-media-path /   \
     --mm-encoder-tp-mode data \
     --mm_processor_cache_type shm \
     --mm_processor_kwargs "{\"max_pixels\":5000000}" \
     --max-model-len 25480  \
     --chat-template-content-format string \
     --limit-mm-per-prompt "{\"image\":10}" \
     --model /root/autodl-tmp/autoglm-phone-9b \
     --port 6006

执行这个命令后，vllm就会开始运行并加载模型，出现服务已上线，就是模型加载成功了。

回到控制台，点击自定义服务

把这里的6006端口号映射的URL复制一份，输入到浏览器里面。如果出现json的返回信息，且终端里面出现了请求日志，那就是模型服务部署成功了！

2.5. 完整模型加载日志

完整的模型加载日志

(vllm) root@autodl-container-50604192c4-d9d01c36:~/autodl-tmp# # 修改最后的--model模型本地路径和--port绑定端口号
python3 -m vllm.entrypoints.openai.api_server \
     --served-model-name autoglm-phone-9b \
     --allowed-local-media-path /   \
     --mm-encoder-tp-mode data \
     --mm_processor_cache_type shm \
     --mm_processor_kwargs "{\"max_pixels\":5000000}" \
     --max-model-len 25480  \
     --chat-template-content-format string \
     --limit-mm-per-prompt "{\"image\":10}" \
     --model /root/autodl-tmp/autoglm-phone-9b \
     --port 6006
(APIServer pid=1433) INFO 12-10 22:50:46 [api_server.py:1772] vLLM API server version 0.12.0
(APIServer pid=1433) INFO 12-10 22:50:46 [utils.py:253] non-default args: {'port': 6006, 'chat_template_content_format': 'string', 'model': '/root/autodl-tmp/autoglm-phone-9b', 'allowed_local_media_path': '/', 'max_model_len': 25480, 'served_model_name': ['autoglm-phone-9b'], 'limit_mm_per_prompt': {'image': 10}, 'mm_processor_kwargs': {'max_pixels': 5000000}, 'mm_processor_cache_type': 'shm', 'mm_encoder_tp_mode': 'data'}
(APIServer pid=1433) Unrecognized keys in `rope_parameters` for 'rope_type'='default': {'partial_rotary_factor', 'mrope_section'}
(APIServer pid=1433) INFO 12-10 22:50:46 [model.py:637] Resolved architecture: Glm4vForConditionalGeneration
(APIServer pid=1433) INFO 12-10 22:50:46 [model.py:1750] Using max model len 25480
(APIServer pid=1433) INFO 12-10 22:50:46 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048.
(EngineCore_DP0 pid=1485) INFO 12-10 22:50:55 [core.py:93] Initializing a V1 LLM engine (v0.12.0) with config: model='/root/autodl-tmp/autoglm-phone-9b', speculative_config=None, tokenizer='/root/autodl-tmp/autoglm-phone-9b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=25480, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01), seed=0, served_model_name=autoglm-phone-9b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None}
(EngineCore_DP0 pid=1485) INFO 12-10 22:50:57 [parallel_state.py:1200] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.10:47959 backend=nccl
(EngineCore_DP0 pid=1485) INFO 12-10 22:50:58 [parallel_state.py:1408] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=1485) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(EngineCore_DP0 pid=1485) Keyword argument `max_pixels` is not a valid argument for this processor and will be ignored.
(EngineCore_DP0 pid=1485) INFO 12-10 22:51:05 [gpu_model_runner.py:3467] Starting to load model /root/autodl-tmp/autoglm-phone-9b...
(EngineCore_DP0 pid=1485) INFO 12-10 22:51:05 [cuda.py:411] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:00,  4.26it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.43it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.15it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:03<00:00,  1.03it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.06s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:04<00:00,  1.07it/s]
(EngineCore_DP0 pid=1485) 
(EngineCore_DP0 pid=1485) INFO 12-10 22:51:10 [default_loader.py:308] Loading weights took 4.87 seconds
(EngineCore_DP0 pid=1485) INFO 12-10 22:51:11 [gpu_model_runner.py:3549] Model loading took 19.2562 GiB memory and 5.143751 seconds
(EngineCore_DP0 pid=1485) INFO 12-10 22:51:11 [gpu_model_runner.py:4306] Encoder cache will be initialized with a budget of 18622 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore_DP0 pid=1485) INFO 12-10 22:51:22 [backends.py:655] Using cache directory: /root/.cache/vllm/torch_compile_cache/19b1386448/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=1485) INFO 12-10 22:51:22 [backends.py:715] Dynamo bytecode transform time: 7.25 s
(EngineCore_DP0 pid=1485) INFO 12-10 22:51:22 [backends.py:257] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=1485) INFO 12-10 22:51:29 [backends.py:288] Compiling a graph for dynamic shape takes 6.80 s
(EngineCore_DP0 pid=1485) INFO 12-10 22:51:30 [monitor.py:34] torch.compile takes 14.05 s in total
(EngineCore_DP0 pid=1485) INFO 12-10 22:51:31 [gpu_worker.py:359] Available KV cache memory: 19.17 GiB
(EngineCore_DP0 pid=1485) INFO 12-10 22:51:32 [kv_cache_utils.py:1286] GPU KV cache size: 502,496 tokens
(EngineCore_DP0 pid=1485) INFO 12-10 22:51:32 [kv_cache_utils.py:1291] Maximum concurrency for 25,480 tokens per request: 19.72x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:03<00:00, 16.86it/s]
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:01<00:00, 23.12it/s]
(EngineCore_DP0 pid=1485) INFO 12-10 22:51:37 [gpu_model_runner.py:4466] Graph capturing finished in 5 secs, took 0.69 GiB
(EngineCore_DP0 pid=1485) INFO 12-10 22:51:37 [core.py:254] init engine (profile, create kv cache, warmup model) took 26.32 seconds
(APIServer pid=1433) INFO 12-10 22:51:40 [api_server.py:1520] Supported tasks: ['generate']
(APIServer pid=1433) INFO 12-10 22:51:40 [api_server.py:1847] Starting vLLM API server 0 on http://0.0.0.0:6006
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:38] Available routes are:
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=1433) INFO 12-10 22:51:40 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=1433) INFO:     Started server process [1433]
(APIServer pid=1433) INFO:     Waiting for application startup.
(APIServer pid=1433) INFO:     Application startup complete.

3. 本地使用AutoGLM

3.1. 项目克隆

AutoGLM是一个定制的模型，必须要配合智谱开源的PhoneAgent SDK一起使用，需要本地有Python3.10+的环境

# 克隆仓库
git clone https://github.com/zai-org/Open-AutoGLM.git
# 安装依赖
cd Open-AutoGLM
# 注意不要修改这个文件，只用安装里面给出的openai>=2.9.0和Pillow>=12.0.0
pip install -r requirements.txt

这里只是安装好了SDK的依赖，我们还需要给当前电脑配置ADB、链接手机到电脑上、给手机安装ADBKeyBoard等等操作。

考虑到AutoGLM模型面向的客户群体应该都会配置这些环境，本文就不多赘述了。如果你不太清楚咋配置ADB命令环境，请参考AutoGLM仓库的README，这里直接把README拷贝了过来：

安装ADB：

下载官方 ADB 安装包，并解压到自定义路径
配置环境变量：
1. MacOS 配置方法：在 Terminal 或者任何命令行工具里执行export PATH=${PATH}:~/Downloads/platform-tools，这里假设解压后的目录为 ~/Downlaods/platform-tools。如果不是请自行调整命令。
2. Windows 配置方法：可参考第三方教程进行配置。

安卓设备开启调试模式：

开发者模式启用：通常启用方法是，找到 设置-关于手机-版本号 然后连续快速点击 10
次左右，直到弹出弹窗显示“开发者模式已启用”。不同手机会有些许差别，如果找不到，可以上网搜索一下教程。
USB 调试启用：启用开发者模式之后，会出现 设置-开发者选项-USB 调试，勾选启用
部分机型在设置开发者选项以后, 可能需要重启设备才能生效. 可以测试一下: 将手机用USB数据线连接到电脑后, adb devices查看是否有设备信息, 如果没有说明连接失败。请务必仔细检查相关权限

安装 ADB Keyboard（用于文本输入，搞UI自动化基本都要装这个）：

下载安装包并在对应的安卓设备中进行安装。
注意，安装完成后还需要到 设置-输入法 或者 设置-键盘列表 中启用 ADB Keyboard 才能生效

3.2. 运行Agent

安装完毕依赖之后，就可以直接运行了，把模型的base-url改成AutoDL上部署的外网url就可以了

python main.py \
    --base-url http://你的AutoDL服务地址:8443/v1 \
    --model "autoglm-phone-9b" \
    "帮我打开美团，买一杯瑞幸的椰香拿铁"

这里提醒一下，手机安装好ADBKeyBoard之后，必须要把手机默认输入法改成ADBKeyBoard，否则Agent在操作的时候还是会呼出给人用的输入法，导致没办法正常输入文字

main.py启动的时候，会对环境进行检查，模型url是否有效进行检查，检查通过了，就会开始任务（如下图所示），这时候你就可以看看你的手机，他是不是真运行起来啦！

注意，PhoneAgent是和AutoGLM绑定的，使用其他VL模型是没有用的！

4. The end

部署和使用到这里就结束啦！有什么问题欢迎评论区交流。

我现在就希望AutoGLM能有一个量化版本，能在Mac机器上用ollama之类的工具运行，这样就更好了。9b的模型理论上是可以被32GB内存的Mac加载运行的。不过我个人对大模型不太了解，不确定AutoGLM是否会强依赖Cuda环境，所以我的这个想法可能有失偏颇。

这里额外提一嘴：可能有朋友疑惑，为啥AutoGLM只支持安卓呢？

那是因为iOS的UI自动化，涉及到的Xcode配置、WDA配置、连手机、证书配置、开发者app认证那叫一个繁琐麻烦，不同iOS版本的很多系统级别弹窗样式都不一样，也得专门做适配。

总结来说：就是iOS的生态封闭，自动化配置麻烦。不同iOS系统版本之间变化大，为iOS做适配投入产出比不搞，纯纯是吃力不讨好。可不是安卓这边所有手机都内置的ADB那么方便的！

慕雪个人觉得，为鸿蒙做适配都比iOS容易！所有纯血鸿蒙的手机也都内置了hdc能力，本质上和安卓的adb是一套类似的工具！

参考：https://github.com/zai-org/Open-AutoGLM/issues/5 ↩︎