agray3 commited on
Commit
4816f6a
·
1 Parent(s): 1a5606e

Avoid unnecessarily disabling CUDA graphs (llama/7302)

Browse files

As discussed in PR #6766, CUDA graphs were being disabled in the presence of long prompts.
This fixes the issue by avoiding the consective update counter from incrementing unnecessarily
for tokens in which cuda graphs are disabled due to batch size > 1.

Files changed (1) hide show
  1. ggml-cuda.cu +1 -1
ggml-cuda.cu CHANGED
@@ -2558,7 +2558,7 @@ GGML_CALL static enum ggml_status ggml_backend_cuda_graph_compute(ggml_backend_t
2558
  }
2559
 
2560
  // Disable CUDA graphs (from the next token) if the use-case is demanding too many consecutive graph updates.
2561
- if (cuda_graph_update_required) {
2562
  cuda_ctx->cuda_graph->number_consecutive_updates++;
2563
  } else {
2564
  cuda_ctx->cuda_graph->number_consecutive_updates = 0;
 
2558
  }
2559
 
2560
  // Disable CUDA graphs (from the next token) if the use-case is demanding too many consecutive graph updates.
2561
+ if (use_cuda_graph && cuda_graph_update_required) {
2562
  cuda_ctx->cuda_graph->number_consecutive_updates++;
2563
  } else {
2564
  cuda_ctx->cuda_graph->number_consecutive_updates = 0;