Failed to find a kernel that can implement the WNA16 linear layer

#1
by traphix - opened

2 * L40s

vllm == 0.13.0rc2.dev6+g434ac76a7

params

python3 -m vllm.entrypoints.openai.api_server \
    --served-model-name qwen3-next-80b-a3b-instruct \
    --model /data/model-cache/Qwen3-Next-80B-A3B-Instruct-int4-AutoRound \
    --tensor-parallel-size 2 \
    --enable-expert-parallel \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --host 0.0.0.0 \
    --port 30522

error

(APIServer pid=7091) INFO 12-14 17:29:29 [api_server.py:1351] vLLM API server version 0.13.0rc2.dev6+g434ac76a7
(APIServer pid=7091) INFO 12-14 17:29:29 [utils.py:253] non-default args: {'host': '0.0.0.0', 'port': 30522, 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'model': '/data/model-cache/Qwen3-Next-80B-A3B-Instruct-int4-AutoRound', 'served_model_name': ['qwen3-next-80b-a3b-instruct'], 'tensor_parallel_size': 2, 'enable_expert_parallel': True}
(APIServer pid=7091) INFO 12-14 17:29:29 [model.py:629] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=7091) INFO 12-14 17:29:29 [model.py:1755] Using max model len 262144
(APIServer pid=7091) INFO 12-14 17:29:29 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=7091) INFO 12-14 17:29:29 [config.py:310] Disabling cascade attention since it is not supported for hybrid models.
(APIServer pid=7091) INFO 12-14 17:29:30 [config.py:437] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=7091) INFO 12-14 17:29:30 [config.py:461] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
(EngineCore_DP0 pid=7248) INFO 12-14 17:29:38 [core.py:93] Initializing a V1 LLM engine (v0.13.0rc2.dev6+g434ac76a7) with config: model='/data/model-cache/Qwen3-Next-80B-A3B-Instruct-int4-AutoRound', speculative_config=None, tokenizer='/data/model-cache/Qwen3-Next-80B-A3B-Instruct-int4-AutoRound', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=auto-round, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=qwen3-next-80b-a3b-instruct, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=7248) WARNING 12-14 17:29:38 [multiproc_executor.py:880] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
/usr/local/lib/python3.12/dist-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
  _C._set_float32_matmul_precision(precision)
/usr/local/lib/python3.12/dist-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
  _C._set_float32_matmul_precision(precision)
INFO 12-14 17:29:47 [parallel_state.py:1203] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:46489 backend=nccl
INFO 12-14 17:29:47 [parallel_state.py:1203] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:46489 backend=nccl
[rank1]:[W1214 17:29:52.854884589 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank0]:[W1214 17:29:52.875819062 ProcessGroupGloo.cpp:516] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO 12-14 17:29:53 [pynccl.py:111] vLLM is using nccl==2.27.5
WARNING 12-14 17:29:53 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.9 not supported, communicator is not available.
WARNING 12-14 17:29:53 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.9 not supported, communicator is not available.
INFO 12-14 17:29:55 [parallel_state.py:1411] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
INFO 12-14 17:29:55 [parallel_state.py:1411] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1
(Worker_TP0_EP0 pid=7399) INFO 12-14 17:29:56 [gpu_model_runner.py:3551] Starting to load model /data/model-cache/Qwen3-Next-80B-A3B-Instruct-int4-AutoRound...
(Worker_TP0_EP0 pid=7399) INFO 12-14 17:29:56 [gptq_marlin.py:376] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(Worker_TP1_EP1 pid=7400) INFO 12-14 17:29:56 [gptq_marlin.py:376] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750] WorkerProc failed to start.
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750] Traceback (most recent call last):
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 722, in worker_main
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]     worker = WorkerProc(*args, **kwargs)
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 562, in __init__
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]     self.worker.load_model()
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 284, in load_model
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3568, in load_model
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]     self.model = model_loader.load_model(
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]     model = initialize_model(
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]             ^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 48, in initialize_model
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]     return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1197, in __init__
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]     self.model = Qwen3NextModel(
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]                  ^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 291, in __init__
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]     old_init(self, **kwargs)
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 983, in __init__
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]     self.start_layer, self.end_layer, self.layers = make_layers(
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]                                                     ^^^^^^^^^^^^
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 606, in make_layers
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 977, in get_layer
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]     return Qwen3NextDecoderLayer(
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]            ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 832, in __init__
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]     self.linear_attn = Qwen3NextGatedDeltaNet(
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]                        ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 297, in __init__
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]     self.in_proj_ba = ColumnParallelLinear(
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]                       ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 484, in __init__
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]     self.quant_method.create_weights(
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 373, in create_weights
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]     kernel_type = choose_mp_linear_kernel(mp_linear_kernel_config)
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/kernels/mixed_precision/__init__.py", line 106, in choose_mp_linear_kernel
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]     raise ValueError(
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750] ValueError: Failed to find a kernel that can implement the WNA16 linear layer. Reasons:
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750] CutlassW4A8LinearKernel requires capability 90, current compute  capability is 89
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750] MacheteLinearKernel requires capability 90, current compute  capability is 89
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]  AllSparkLinearKernel cannot implement due to: For Ampere GPU, AllSpark does not support group_size = 128. Only group_size = -1 are supported.
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]  MarlinLinearKernel cannot implement due to: Weight output_size_per_partition = 32 is not divisible by  min_thread_n = 64. Consider reducing tensor_parallel_size or running with --quantization gptq.
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]  Dynamic4bitLinearKernel cannot implement due to: Only CPU is supported
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]  BitBLASLinearKernel cannot implement due to: bitblas is not installed. Please install bitblas by running `pip install bitblas>=0.1.0`
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]  ConchLinearKernel cannot implement due to: conch-triton-kernels is not installed, please install it via `pip install conch-triton-kernels` and try again!
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]  ExllamaLinearKernel cannot implement due to: Exllama only supports float16 activations
(Worker_TP0_EP0 pid=7399) ERROR 12-14 17:29:57 [multiproc_executor.py:750]  XPUwNa16LinearKernel cannot implement due to: IPEX wNa16 only supported on XPU/CPU devices
(Worker_TP0_EP0 pid=7399) INFO 12-14 17:29:57 [multiproc_executor.py:709] Parent process exited, terminating worker
(Worker_TP1_EP1 pid=7400) INFO 12-14 17:29:57 [multiproc_executor.py:709] Parent process exited, terminating worker

Sign up or log in to comment