OccamRazor SlyEcho Concedo slaren ggerganov commited on
Commit
5a97aba
·
unverified ·
1 Parent(s): 01169e0

ggml : add Vulkan backend (llama/2059)

Browse files

* Vulkan loader code

* Fix matmul kernel, continue implementation

* Continue implementation

* Vulkan memory management

* Vulkan development

* Matmul call

* Add aligned malloc and free for VMA

* Continue implementation

* First matmul success

* GEMM Kernel optimization

* 1D Blocktiling

* 2D Blocktiling

* Write coalescing

* Continue vulkan implementation and optimization

* First FP16 attempt, disabled for now

* Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel

* Enable device extensions properly, restore fp16 matmul op

* Fix mulmat_f16

* Output FP32 in fp16 matmul shader

* Fix f16_to_f32 kernel

* dequant_q4_0 kernel

* Add VMA library

* Avoid requesting dedicated memory, VMA can decide that by itself

* Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly

* add cmake commands

* Add 2d write operation, profiling code

* Fix 2d write

* Fix queue selection for AMD RADV

* Fix trailing whitespace in vk_mem_alloc.h

* Add WIP warp tile mat mul shaders

* Disable glslc optimization

* Disable glslc optimization for CMake

* Optimize warptile matmul shader, replace blocktile with it

* Add split-k optimization for small matrix multiplication

Use semaphores for synchronization instead of fences or waitidle

Rework async write/read for synchronization

* Fix validation errors, improve compatibility with AMD GPUs

* Rework command buffer handling

* Variable matmul kernel using specialization constants

* Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints

* Reuse semaphores

* Handle stage flags during command buffer submission properly

* Increase matmul test runs for consistent results

* Fix F32 matmul

* Add vectorized loading and zeropadding for matrix multiplication

* Use pinned memory for f16 preprocessing

* Don't force aligned matmul

* Don't free before queue done

* Replace VMA library with native Vulkan buffer management

* Basic offloading support with mul_f32 and dmmv for q4_0

* Run glslc commands in parallel

* Unroll loops in dmmv shader

* Reduce usage of waitIdle

* Reuse pinned allocation for f16 conversion

* Handle devices with only a single queue

* Fix trailing whitespace in CMakeLists.txt

* Allow parallel execution of kernels, parallelize third and fourth dimension calls

* Add fallback for devices only supporting one DescriptorSet per DescriptorPool

* Move to graph function similar to CUDA implementation

* Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function

* Add F32 dmmv shaders

* Batch submissions

* Add .spv to gitignore

* Split off matrix vector multiplication for separate optimization

* Use single command buffer for matrix vector multiplication ops

* Reduce overhead of mul_f32 calls by using a single command buffer

* Add submission batching to mul_f32

* Fix tests

* Add missing barrier

* Add further missing barrier

* Add further ops

* Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions

* Remove unnecessary cblas link

* Fix descriptor set pre-allocation assert

* Add runtime shader compilation, start transferring shaders to this approach

* Transfer remaining shaders to header and compile on runtime

* Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16

* Add support for q4_1, q5_0, q5_1 and q8_0

* Remove unnecessary scalar layout extension

* Parse graph early to pre-record command buffers

* Add q6_k support

* Add multi-submit for command buffers

* Fix q6_k dequant shader for AMD

* Fix q6_k for GPUs without fp16 support

* Simplify q6_k fp16 fix

* Minor fixes

* Fix wg_denom of m-mulmat shaders

* Add Python-based Vulkan shader generator

* Replace shaderc dependency with precompiled shaders

Fix python script to generate shaders

* Clean up code

* Fix shader generator script Windows compatibility

Co-authored-by: Concedo <[email protected]>

* Close file before deletion

* Fix vulkan shader fp32 name

* Add q2_k and q3_k support

Add validation check to compare shader results to cpu results

* Add q4_k support

* Add q5_k support

* Bake SPIR-V bytecode into the library instead of loading shaders from file

* Switch to signal semaphores for flexibility

Prepare broadcasting support for mul mat

* Finish broadcasting mul mat support for GQA

* Clean up unused functions

Add repeat op

* Add further ops, not yet enabled. Improve semaphore code

* Reduce number of used semaphores by utilizing timelines more properly

* Remove queue information

* Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations

* Add Vulkan to llama-bench

* Remove cblas dependency

* Fix matmul k-split bug

* Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader

* Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug

* Fix issues with float16 overflows in shaders

* Fix issues with older Vulkan headers on Ubuntu 22.04

* Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers

* Implement further ops, rework op_f32 calls, fix bugs

* Finish full offloading support, add last remaining ops, fix bugs, remove redundant code

* Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders

* Merge upstream changes, fix conflicts, adapt soft_max op

* Fix Python and shader header format

* Free model gpu buffers on exit

* Use single queue per device to simplify code

* Add matmul shader support for running multiple calculations in parallel

* Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible

* Fix missing event cast

* Replace uint64_t(-1) with UINT64_MAX, rename function for clarity

* Fix warning about empty C function parameters

* Fix compiler warnings

* Properly implement Vulkan backend buffer handling

* Fix oversized host staging buffers

* Simplify barrier synchronization calls

* Fix gcc warnings

* Implement max_size for backend buffer types to limit the size of a single allocation

* Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size

* refactor multi buf

* Disable unsupported ops to fix tests

* Check for maintenance4 support before using it

* Handle devices with only a single queue

* Fix single queue logic

* propagate buffer usage in multi buffers

* Implement rope_neox op

* Cleanup header and other files

* Simplify gpu_extras by removing events and putting staging memcpys into contexts

* Move queue into context

Add not-yet-enabled async backend ops

* Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization

* Add get_max_size to SYCL backend.

Co-authored-by: Georgi Gerganov <[email protected]>

* llama : fix trailing whitespace

---------

Co-authored-by: Henri Vasserman <[email protected]>
Co-authored-by: Concedo <[email protected]>
Co-authored-by: slaren <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>

Files changed (9) hide show
  1. ggml-alloc.c +82 -24
  2. ggml-backend-impl.h +6 -0
  3. ggml-backend.c +102 -2
  4. ggml-backend.h +3 -0
  5. ggml-cuda.cu +3 -0
  6. ggml-metal.m +1 -0
  7. ggml-opencl.cpp +2 -0
  8. ggml.c +42 -3
  9. ggml.h +1 -0
ggml-alloc.c CHANGED
@@ -778,38 +778,26 @@ size_t ggml_allocr_alloc_graph(ggml_allocr_t alloc, struct ggml_cgraph * graph)
778
  }
779
 
780
  // utils
781
- ggml_backend_buffer_t ggml_backend_alloc_ctx_tensors_from_buft(struct ggml_context * ctx, ggml_backend_buffer_type_t buft) {
782
- GGML_ASSERT(ggml_get_no_alloc(ctx) == true);
783
-
784
- size_t alignment = ggml_backend_buft_get_alignment(buft);
785
-
786
- size_t nbytes = 0;
787
- for (struct ggml_tensor * t = ggml_get_first_tensor(ctx); t != NULL; t = ggml_get_next_tensor(ctx, t)) {
788
- if (t->data == NULL && t->view_src == NULL) {
789
- nbytes += GGML_PAD(ggml_backend_buft_get_alloc_size(buft, t), alignment);
790
- }
791
- }
792
-
793
- if (nbytes == 0) {
794
- // all the tensors in the context are already allocated
795
- #ifndef NDEBUG
796
- fprintf(stderr, "%s: all tensors in the context are already allocated\n", __func__);
797
- #endif
798
- return NULL;
799
- }
800
 
801
- ggml_backend_buffer_t buffer = ggml_backend_buft_alloc_buffer(buft, nbytes);
 
 
 
 
802
  if (buffer == NULL) {
803
- // failed to allocate buffer
804
  #ifndef NDEBUG
805
- fprintf(stderr, "%s: failed to allocate buffer\n", __func__);
806
  #endif
807
- return NULL;
 
 
 
 
808
  }
809
 
810
  ggml_tallocr_t tallocr = ggml_tallocr_new_from_buffer(buffer);
811
 
812
- for (struct ggml_tensor * t = ggml_get_first_tensor(ctx); t != NULL; t = ggml_get_next_tensor(ctx, t)) {
813
  if (t->data == NULL) {
814
  if (t->view_src == NULL) {
815
  ggml_tallocr_alloc(tallocr, t);
@@ -826,6 +814,76 @@ ggml_backend_buffer_t ggml_backend_alloc_ctx_tensors_from_buft(struct ggml_conte
826
 
827
  ggml_tallocr_free(tallocr);
828
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
829
  return buffer;
830
  }
831
 
 
778
  }
779
 
780
  // utils
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
781
 
782
+ static bool alloc_tensor_range(struct ggml_context * ctx,
783
+ struct ggml_tensor * first, struct ggml_tensor * last,
784
+ ggml_backend_buffer_type_t buft, size_t size,
785
+ ggml_backend_buffer_t ** buffers, size_t * n_buffers) {
786
+ ggml_backend_buffer_t buffer = ggml_backend_buft_alloc_buffer(buft, size);
787
  if (buffer == NULL) {
 
788
  #ifndef NDEBUG
789
+ fprintf(stderr, "%s: failed to allocate %s buffer of size %zu\n", __func__, ggml_backend_buft_name(buft), size);
790
  #endif
791
+ for (size_t i = 0; i < *n_buffers; i++) {
792
+ ggml_backend_buffer_free(*buffers[i]);
793
+ }
794
+ free(buffers);
795
+ return false;
796
  }
797
 
798
  ggml_tallocr_t tallocr = ggml_tallocr_new_from_buffer(buffer);
799
 
800
+ for (struct ggml_tensor * t = first; t != last; t = ggml_get_next_tensor(ctx, t)) {
801
  if (t->data == NULL) {
802
  if (t->view_src == NULL) {
803
  ggml_tallocr_alloc(tallocr, t);
 
814
 
815
  ggml_tallocr_free(tallocr);
816
 
817
+ *buffers = realloc(*buffers, sizeof(ggml_backend_buffer_t) * (*n_buffers + 1));
818
+ (*buffers)[(*n_buffers)++] = buffer;
819
+
820
+ return true;
821
+ }
822
+
823
+ ggml_backend_buffer_t ggml_backend_alloc_ctx_tensors_from_buft(struct ggml_context * ctx, ggml_backend_buffer_type_t buft) {
824
+ GGML_ASSERT(ggml_get_no_alloc(ctx) == true);
825
+
826
+ size_t alignment = ggml_backend_buft_get_alignment(buft);
827
+ size_t max_size = ggml_backend_buft_get_max_size(buft);
828
+
829
+ ggml_backend_buffer_t * buffers = NULL;
830
+ size_t n_buffers = 0;
831
+
832
+ size_t cur_buf_size = 0;
833
+ struct ggml_tensor * first = ggml_get_first_tensor(ctx);
834
+ for (struct ggml_tensor * t = first; t != NULL; t = ggml_get_next_tensor(ctx, t)) {
835
+ size_t this_size = 0;
836
+ if (t->data == NULL && t->view_src == NULL) {
837
+ this_size = GGML_PAD(ggml_backend_buft_get_alloc_size(buft, t), alignment);
838
+ }
839
+
840
+ if (this_size > max_size) {
841
+ // tensor is too large to fit in a single buffer
842
+ fprintf(stderr, "%s: tensor %s is too large to fit in a %s buffer (tensor size: %zu, max buffer size: %zu)\n",
843
+ __func__, t->name,
844
+ ggml_backend_buft_name(buft),
845
+ this_size, max_size);
846
+ for (size_t i = 0; i < n_buffers; i++) {
847
+ ggml_backend_buffer_free(buffers[i]);
848
+ }
849
+ free(buffers);
850
+ return NULL;
851
+ }
852
+
853
+ if ((cur_buf_size + this_size) > max_size) {
854
+ // allocate tensors in the current buffer
855
+ if (!alloc_tensor_range(ctx, first, t, buft, cur_buf_size, &buffers, &n_buffers)) {
856
+ return NULL;
857
+ }
858
+ first = t;
859
+ cur_buf_size = this_size;
860
+ } else {
861
+ cur_buf_size += this_size;
862
+ }
863
+ }
864
+
865
+ // allocate remaining tensors
866
+ if (cur_buf_size > 0) {
867
+ if (!alloc_tensor_range(ctx, first, NULL, buft, cur_buf_size, &buffers, &n_buffers)) {
868
+ return NULL;
869
+ }
870
+ }
871
+
872
+ if (n_buffers == 0) {
873
+ // all the tensors in the context are already allocated
874
+ #ifndef NDEBUG
875
+ fprintf(stderr, "%s: all tensors in the context are already allocated\n", __func__);
876
+ #endif
877
+ return NULL;
878
+ }
879
+
880
+ ggml_backend_buffer_t buffer;
881
+ if (n_buffers == 1) {
882
+ buffer = buffers[0];
883
+ } else {
884
+ buffer = ggml_backend_multi_buffer_alloc_buffer(buffers, n_buffers);
885
+ }
886
+ free(buffers);
887
  return buffer;
888
  }
889
 
ggml-backend-impl.h CHANGED
@@ -19,6 +19,7 @@ extern "C" {
19
  const char * (*GGML_CALL get_name) (ggml_backend_buffer_type_t buft);
20
  ggml_backend_buffer_t (*GGML_CALL alloc_buffer) (ggml_backend_buffer_type_t buft, size_t size);
21
  size_t (*GGML_CALL get_alignment) (ggml_backend_buffer_type_t buft); // tensor alignment
 
22
  size_t (*GGML_CALL get_alloc_size) (ggml_backend_buffer_type_t buft, const struct ggml_tensor * tensor); // data size needed to allocate the tensor, including padding
23
  bool (*GGML_CALL supports_backend)(ggml_backend_buffer_type_t buft, ggml_backend_t backend); // check if the buffer type is usable by the backend
24
  // check if tensor data is in host memory
@@ -63,6 +64,11 @@ extern "C" {
63
  // do not use directly, use ggml_backend_tensor_copy instead
64
  bool ggml_backend_buffer_copy_tensor(const struct ggml_tensor * src, struct ggml_tensor * dst);
65
 
 
 
 
 
 
66
  //
67
  // Backend
68
  //
 
19
  const char * (*GGML_CALL get_name) (ggml_backend_buffer_type_t buft);
20
  ggml_backend_buffer_t (*GGML_CALL alloc_buffer) (ggml_backend_buffer_type_t buft, size_t size);
21
  size_t (*GGML_CALL get_alignment) (ggml_backend_buffer_type_t buft); // tensor alignment
22
+ size_t (*GGML_CALL get_max_size) (ggml_backend_buffer_type_t buft); // allocation max size
23
  size_t (*GGML_CALL get_alloc_size) (ggml_backend_buffer_type_t buft, const struct ggml_tensor * tensor); // data size needed to allocate the tensor, including padding
24
  bool (*GGML_CALL supports_backend)(ggml_backend_buffer_type_t buft, ggml_backend_t backend); // check if the buffer type is usable by the backend
25
  // check if tensor data is in host memory
 
64
  // do not use directly, use ggml_backend_tensor_copy instead
65
  bool ggml_backend_buffer_copy_tensor(const struct ggml_tensor * src, struct ggml_tensor * dst);
66
 
67
+ // buffer that contains a collection of buffers
68
+ GGML_CALL ggml_backend_buffer_t ggml_backend_multi_buffer_alloc_buffer(ggml_backend_buffer_t * buffers, size_t n_buffers);
69
+ GGML_CALL bool ggml_backend_buffer_is_multi_buffer(ggml_backend_buffer_t buffer);
70
+ GGML_CALL void ggml_backend_multi_buffer_set_usage(ggml_backend_buffer_t buffer, enum ggml_backend_buffer_usage usage);
71
+
72
  //
73
  // Backend
74
  //
ggml-backend.c CHANGED
@@ -27,6 +27,14 @@ size_t ggml_backend_buft_get_alignment(ggml_backend_buffer_type_t buft) {
27
  return buft->iface.get_alignment(buft);
28
  }
29
 
 
 
 
 
 
 
 
 
30
  GGML_CALL size_t ggml_backend_buft_get_alloc_size(ggml_backend_buffer_type_t buft, struct ggml_tensor * tensor) {
31
  // get_alloc_size is optional, defaults to ggml_nbytes
32
  if (buft->iface.get_alloc_size) {
@@ -57,8 +65,6 @@ GGML_CALL ggml_backend_buffer_t ggml_backend_buffer_init(
57
  size_t size) {
58
  ggml_backend_buffer_t buffer = malloc(sizeof(struct ggml_backend_buffer));
59
 
60
- GGML_ASSERT(iface.get_base != NULL);
61
-
62
  (*buffer) = (struct ggml_backend_buffer) {
63
  /* .interface = */ iface,
64
  /* .buft = */ buft,
@@ -108,6 +114,10 @@ size_t ggml_backend_buffer_get_alignment (ggml_backend_buffer_t buffer) {
108
  return ggml_backend_buft_get_alignment(ggml_backend_buffer_get_type(buffer));
109
  }
110
 
 
 
 
 
111
  size_t ggml_backend_buffer_get_alloc_size(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) {
112
  return ggml_backend_buft_get_alloc_size(ggml_backend_buffer_get_type(buffer), tensor);
113
  }
@@ -122,6 +132,11 @@ bool ggml_backend_buffer_is_host(ggml_backend_buffer_t buffer) {
122
 
123
  void ggml_backend_buffer_set_usage(ggml_backend_buffer_t buffer, enum ggml_backend_buffer_usage usage) {
124
  buffer->usage = usage;
 
 
 
 
 
125
  }
126
 
127
  ggml_backend_buffer_type_t ggml_backend_buffer_get_type(ggml_backend_buffer_t buffer) {
@@ -171,6 +186,10 @@ size_t ggml_backend_get_alignment(ggml_backend_t backend) {
171
  return ggml_backend_buft_get_alignment(ggml_backend_get_default_buffer_type(backend));
172
  }
173
 
 
 
 
 
174
  void ggml_backend_tensor_set_async(ggml_backend_t backend, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
175
  GGML_ASSERT(tensor->data != NULL && "tensor not allocated");
176
  GGML_ASSERT(offset + size <= ggml_nbytes(tensor) && "tensor write out of bounds");
@@ -349,6 +368,11 @@ GGML_CALL static void ggml_backend_registry_init(void) {
349
  extern GGML_CALL ggml_backend_buffer_type_t ggml_backend_metal_buffer_type(void);
350
  ggml_backend_register("Metal", ggml_backend_reg_metal_init, ggml_backend_metal_buffer_type(), NULL);
351
  #endif
 
 
 
 
 
352
  }
353
 
354
  GGML_CALL void ggml_backend_register(const char * name, ggml_backend_init_fn init_fn, ggml_backend_buffer_type_t default_buffer_type, void * user_data) {
@@ -552,6 +576,7 @@ GGML_CALL ggml_backend_buffer_type_t ggml_backend_cpu_buffer_type(void) {
552
  /* .get_name = */ ggml_backend_cpu_buffer_type_get_name,
553
  /* .alloc_buffer = */ ggml_backend_cpu_buffer_type_alloc_buffer,
554
  /* .get_alignment = */ ggml_backend_cpu_buffer_type_get_alignment,
 
555
  /* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
556
  /* .supports_backend = */ ggml_backend_cpu_buffer_type_supports_backend,
557
  /* .is_host = */ ggml_backend_cpu_buffer_type_is_host,
@@ -607,6 +632,7 @@ ggml_backend_buffer_type_t ggml_backend_cpu_hbm_buffer_type(void) {
607
  /* .get_name = */ ggml_backend_cpu_hbm_buffer_type_get_name,
608
  /* .alloc_buffer = */ ggml_backend_cpu_hbm_buffer_type_alloc_buffer,
609
  /* .get_alignment = */ ggml_backend_cpu_buffer_type_get_alignment,
 
610
  /* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
611
  /* .supports_backend = */ ggml_backend_cpu_buffer_type_supports_backend,
612
  /* .is_host = */ ggml_backend_cpu_buffer_type_is_host,
@@ -763,6 +789,80 @@ GGML_CALL static ggml_backend_t ggml_backend_reg_cpu_init(const char * params, v
763
  GGML_UNUSED(user_data);
764
  }
765
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
766
 
767
  // scheduler
768
 
 
27
  return buft->iface.get_alignment(buft);
28
  }
29
 
30
+ size_t ggml_backend_buft_get_max_size(ggml_backend_buffer_type_t buft) {
31
+ // get_max_size is optional, defaults to SIZE_MAX
32
+ if (buft->iface.get_max_size) {
33
+ return buft->iface.get_max_size(buft);
34
+ }
35
+ return SIZE_MAX;
36
+ }
37
+
38
  GGML_CALL size_t ggml_backend_buft_get_alloc_size(ggml_backend_buffer_type_t buft, struct ggml_tensor * tensor) {
39
  // get_alloc_size is optional, defaults to ggml_nbytes
40
  if (buft->iface.get_alloc_size) {
 
65
  size_t size) {
66
  ggml_backend_buffer_t buffer = malloc(sizeof(struct ggml_backend_buffer));
67
 
 
 
68
  (*buffer) = (struct ggml_backend_buffer) {
69
  /* .interface = */ iface,
70
  /* .buft = */ buft,
 
114
  return ggml_backend_buft_get_alignment(ggml_backend_buffer_get_type(buffer));
115
  }
116
 
117
+ size_t ggml_backend_buffer_get_max_size(ggml_backend_buffer_t buffer) {
118
+ return ggml_backend_buft_get_max_size(ggml_backend_buffer_get_type(buffer));
119
+ }
120
+
121
  size_t ggml_backend_buffer_get_alloc_size(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) {
122
  return ggml_backend_buft_get_alloc_size(ggml_backend_buffer_get_type(buffer), tensor);
123
  }
 
132
 
133
  void ggml_backend_buffer_set_usage(ggml_backend_buffer_t buffer, enum ggml_backend_buffer_usage usage) {
134
  buffer->usage = usage;
135
+
136
+ // FIXME: add a generic callback to the buffer interface
137
+ if (ggml_backend_buffer_is_multi_buffer(buffer)) {
138
+ ggml_backend_multi_buffer_set_usage(buffer, usage);
139
+ }
140
  }
141
 
142
  ggml_backend_buffer_type_t ggml_backend_buffer_get_type(ggml_backend_buffer_t buffer) {
 
186
  return ggml_backend_buft_get_alignment(ggml_backend_get_default_buffer_type(backend));
187
  }
188
 
189
+ size_t ggml_backend_get_max_size(ggml_backend_t backend) {
190
+ return ggml_backend_buft_get_max_size(ggml_backend_get_default_buffer_type(backend));
191
+ }
192
+
193
  void ggml_backend_tensor_set_async(ggml_backend_t backend, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
194
  GGML_ASSERT(tensor->data != NULL && "tensor not allocated");
195
  GGML_ASSERT(offset + size <= ggml_nbytes(tensor) && "tensor write out of bounds");
 
368
  extern GGML_CALL ggml_backend_buffer_type_t ggml_backend_metal_buffer_type(void);
369
  ggml_backend_register("Metal", ggml_backend_reg_metal_init, ggml_backend_metal_buffer_type(), NULL);
370
  #endif
371
+
372
+ #ifdef GGML_USE_VULKAN
373
+ extern GGML_CALL int ggml_backend_vk_reg_devices(void);
374
+ ggml_backend_vk_reg_devices();
375
+ #endif
376
  }
377
 
378
  GGML_CALL void ggml_backend_register(const char * name, ggml_backend_init_fn init_fn, ggml_backend_buffer_type_t default_buffer_type, void * user_data) {
 
576
  /* .get_name = */ ggml_backend_cpu_buffer_type_get_name,
577
  /* .alloc_buffer = */ ggml_backend_cpu_buffer_type_alloc_buffer,
578
  /* .get_alignment = */ ggml_backend_cpu_buffer_type_get_alignment,
579
+ /* .get_max_size = */ NULL, // defaults to SIZE_MAX
580
  /* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
581
  /* .supports_backend = */ ggml_backend_cpu_buffer_type_supports_backend,
582
  /* .is_host = */ ggml_backend_cpu_buffer_type_is_host,
 
632
  /* .get_name = */ ggml_backend_cpu_hbm_buffer_type_get_name,
633
  /* .alloc_buffer = */ ggml_backend_cpu_hbm_buffer_type_alloc_buffer,
634
  /* .get_alignment = */ ggml_backend_cpu_buffer_type_get_alignment,
635
+ /* .get_max_size = */ NULL, // defaults to SIZE_MAX
636
  /* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
637
  /* .supports_backend = */ ggml_backend_cpu_buffer_type_supports_backend,
638
  /* .is_host = */ ggml_backend_cpu_buffer_type_is_host,
 
789
  GGML_UNUSED(user_data);
790
  }
791
 
792
+ // multi-buffer buffer
793
+
794
+ struct ggml_backend_multi_buffer_context {
795
+ ggml_backend_buffer_t * buffers;
796
+ size_t n_buffers;
797
+ };
798
+
799
+ typedef struct ggml_backend_multi_buffer_context * ggml_backend_multi_buffer_context_t;
800
+
801
+ GGML_CALL static const char * ggml_backend_multi_buffer_get_name(ggml_backend_buffer_t buffer) {
802
+ ggml_backend_multi_buffer_context_t ctx = (ggml_backend_multi_buffer_context_t) buffer->context;
803
+
804
+ return ctx->buffers[0]->iface.get_name(ctx->buffers[0]);
805
+ }
806
+
807
+ GGML_CALL static void ggml_backend_multi_buffer_free_buffer(ggml_backend_buffer_t buffer) {
808
+ ggml_backend_multi_buffer_context_t ctx = (ggml_backend_multi_buffer_context_t) buffer->context;
809
+ for (size_t i = 0; i < ctx->n_buffers; i++) {
810
+ ggml_backend_buffer_free(ctx->buffers[i]);
811
+ }
812
+
813
+ free(ctx->buffers);
814
+ free(ctx);
815
+ }
816
+
817
+ GGML_CALL static void ggml_backend_multi_buffer_clear(ggml_backend_buffer_t buffer, uint8_t value) {
818
+ ggml_backend_multi_buffer_context_t ctx = (ggml_backend_multi_buffer_context_t) buffer->context;
819
+ for (size_t i = 0; i < ctx->n_buffers; i++) {
820
+ ggml_backend_buffer_clear(ctx->buffers[i], value);
821
+ }
822
+ }
823
+
824
+ static struct ggml_backend_buffer_i ggml_backend_multi_buffer_context_interface(void) {
825
+ static struct ggml_backend_buffer_i multi_backend_buffer_i = {
826
+ /* .get_name = */ ggml_backend_multi_buffer_get_name,
827
+ /* .free_buffer = */ ggml_backend_multi_buffer_free_buffer,
828
+ /* .get_base = */ NULL,
829
+ /* .init_tensor = */ NULL,
830
+ /* .set_tensor = */ NULL,
831
+ /* .get_tensor = */ NULL,
832
+ /* .cpy_tensor = */ NULL,
833
+ /* .clear = */ ggml_backend_multi_buffer_clear,
834
+ /* .reset = */ NULL,
835
+ };
836
+
837
+ return multi_backend_buffer_i;
838
+ }
839
+
840
+ GGML_CALL ggml_backend_buffer_t ggml_backend_multi_buffer_alloc_buffer(ggml_backend_buffer_t * buffers, size_t n_buffers) {
841
+ ggml_backend_multi_buffer_context_t ctx = (ggml_backend_multi_buffer_context_t) malloc(sizeof(struct ggml_backend_multi_buffer_context));
842
+ ctx->n_buffers = n_buffers;
843
+ ctx->buffers = (ggml_backend_buffer_t *) malloc(n_buffers * sizeof(ggml_backend_buffer_t));
844
+
845
+ size_t total_size = 0;
846
+ for (size_t i = 0; i < n_buffers; i++) {
847
+ ctx->buffers[i] = buffers[i];
848
+ total_size += ggml_backend_buffer_get_size(buffers[i]);
849
+ }
850
+
851
+ return ggml_backend_buffer_init(buffers[0]->buft, ggml_backend_multi_buffer_context_interface(), ctx, total_size);
852
+ }
853
+
854
+ GGML_CALL bool ggml_backend_buffer_is_multi_buffer(ggml_backend_buffer_t buffer) {
855
+ return buffer->iface.get_name == ggml_backend_multi_buffer_get_name;
856
+ }
857
+
858
+ GGML_CALL void ggml_backend_multi_buffer_set_usage(ggml_backend_buffer_t buffer, enum ggml_backend_buffer_usage usage) {
859
+ GGML_ASSERT(ggml_backend_buffer_is_multi_buffer(buffer));
860
+ ggml_backend_multi_buffer_context_t ctx = (ggml_backend_multi_buffer_context_t) buffer->context;
861
+ for (size_t i = 0; i < ctx->n_buffers; i++) {
862
+ ggml_backend_buffer_set_usage(ctx->buffers[i], usage);
863
+ }
864
+ }
865
+
866
 
867
  // scheduler
868
 
ggml-backend.h CHANGED
@@ -20,6 +20,7 @@ extern "C" {
20
  GGML_API const char * ggml_backend_buft_name (ggml_backend_buffer_type_t buft);
21
  GGML_API GGML_CALL ggml_backend_buffer_t ggml_backend_buft_alloc_buffer (ggml_backend_buffer_type_t buft, size_t size);
22
  GGML_API size_t ggml_backend_buft_get_alignment (ggml_backend_buffer_type_t buft);
 
23
  GGML_API GGML_CALL size_t ggml_backend_buft_get_alloc_size (ggml_backend_buffer_type_t buft, struct ggml_tensor * tensor);
24
  GGML_API bool ggml_backend_buft_supports_backend(ggml_backend_buffer_type_t buft, ggml_backend_t backend);
25
  GGML_API bool ggml_backend_buft_is_host (ggml_backend_buffer_type_t buft);
@@ -36,6 +37,7 @@ extern "C" {
36
  GGML_API size_t ggml_backend_buffer_get_size (ggml_backend_buffer_t buffer);
37
  GGML_API GGML_CALL void ggml_backend_buffer_init_tensor (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
38
  GGML_API size_t ggml_backend_buffer_get_alignment (ggml_backend_buffer_t buffer);
 
39
  GGML_API size_t ggml_backend_buffer_get_alloc_size(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
40
  GGML_API void ggml_backend_buffer_clear (ggml_backend_buffer_t buffer, uint8_t value);
41
  GGML_API bool ggml_backend_buffer_is_host (ggml_backend_buffer_t buffer);
@@ -54,6 +56,7 @@ extern "C" {
54
  GGML_API ggml_backend_buffer_type_t ggml_backend_get_default_buffer_type(ggml_backend_t backend);
55
  GGML_API ggml_backend_buffer_t ggml_backend_alloc_buffer(ggml_backend_t backend, size_t size);
56
  GGML_API size_t ggml_backend_get_alignment(ggml_backend_t backend);
 
57
 
58
  GGML_API void ggml_backend_tensor_set_async(ggml_backend_t backend, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size);
59
  GGML_API void ggml_backend_tensor_get_async(ggml_backend_t backend, const struct ggml_tensor * tensor, void * data, size_t offset, size_t size);
 
20
  GGML_API const char * ggml_backend_buft_name (ggml_backend_buffer_type_t buft);
21
  GGML_API GGML_CALL ggml_backend_buffer_t ggml_backend_buft_alloc_buffer (ggml_backend_buffer_type_t buft, size_t size);
22
  GGML_API size_t ggml_backend_buft_get_alignment (ggml_backend_buffer_type_t buft);
23
+ GGML_API size_t ggml_backend_buft_get_max_size (ggml_backend_buffer_type_t buft);
24
  GGML_API GGML_CALL size_t ggml_backend_buft_get_alloc_size (ggml_backend_buffer_type_t buft, struct ggml_tensor * tensor);
25
  GGML_API bool ggml_backend_buft_supports_backend(ggml_backend_buffer_type_t buft, ggml_backend_t backend);
26
  GGML_API bool ggml_backend_buft_is_host (ggml_backend_buffer_type_t buft);
 
37
  GGML_API size_t ggml_backend_buffer_get_size (ggml_backend_buffer_t buffer);
38
  GGML_API GGML_CALL void ggml_backend_buffer_init_tensor (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
39
  GGML_API size_t ggml_backend_buffer_get_alignment (ggml_backend_buffer_t buffer);
40
+ GGML_API size_t ggml_backend_buffer_get_max_size (ggml_backend_buffer_t buffer);
41
  GGML_API size_t ggml_backend_buffer_get_alloc_size(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
42
  GGML_API void ggml_backend_buffer_clear (ggml_backend_buffer_t buffer, uint8_t value);
43
  GGML_API bool ggml_backend_buffer_is_host (ggml_backend_buffer_t buffer);
 
56
  GGML_API ggml_backend_buffer_type_t ggml_backend_get_default_buffer_type(ggml_backend_t backend);
57
  GGML_API ggml_backend_buffer_t ggml_backend_alloc_buffer(ggml_backend_t backend, size_t size);
58
  GGML_API size_t ggml_backend_get_alignment(ggml_backend_t backend);
59
+ GGML_API size_t ggml_backend_get_max_size(ggml_backend_t backend);
60
 
61
  GGML_API void ggml_backend_tensor_set_async(ggml_backend_t backend, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size);
62
  GGML_API void ggml_backend_tensor_get_async(ggml_backend_t backend, const struct ggml_tensor * tensor, void * data, size_t offset, size_t size);
ggml-cuda.cu CHANGED
@@ -10440,6 +10440,7 @@ static ggml_backend_buffer_type_i ggml_backend_cuda_buffer_type_interface = {
10440
  /* .get_name = */ ggml_backend_cuda_buffer_type_name,
10441
  /* .alloc_buffer = */ ggml_backend_cuda_buffer_type_alloc_buffer,
10442
  /* .get_alignment = */ ggml_backend_cuda_buffer_type_get_alignment,
 
10443
  /* .get_alloc_size = */ ggml_backend_cuda_buffer_type_get_alloc_size,
10444
  /* .supports_backend = */ ggml_backend_cuda_buffer_type_supports_backend,
10445
  /* .is_host = */ NULL,
@@ -10715,6 +10716,7 @@ static ggml_backend_buffer_type_i ggml_backend_cuda_split_buffer_type_interface
10715
  /* .get_name = */ ggml_backend_cuda_split_buffer_type_name,
10716
  /* .alloc_buffer = */ ggml_backend_cuda_split_buffer_type_alloc_buffer,
10717
  /* .get_alignment = */ ggml_backend_cuda_split_buffer_type_get_alignment,
 
10718
  /* .get_alloc_size = */ ggml_backend_cuda_split_buffer_type_get_alloc_size,
10719
  /* .supports_backend = */ ggml_backend_cuda_split_buffer_type_supports_backend,
10720
  /* .is_host = */ ggml_backend_cuda_split_buffer_type_is_host,
@@ -10794,6 +10796,7 @@ GGML_CALL ggml_backend_buffer_type_t ggml_backend_cuda_host_buffer_type() {
10794
  /* .get_name = */ ggml_backend_cuda_host_buffer_type_name,
10795
  /* .alloc_buffer = */ ggml_backend_cuda_host_buffer_type_alloc_buffer,
10796
  /* .get_alignment = */ ggml_backend_cpu_buffer_type()->iface.get_alignment,
 
10797
  /* .get_alloc_size = */ ggml_backend_cpu_buffer_type()->iface.get_alloc_size,
10798
  /* .supports_backend = */ ggml_backend_cpu_buffer_type()->iface.supports_backend,
10799
  /* .is_host = */ ggml_backend_cpu_buffer_type()->iface.is_host,
 
10440
  /* .get_name = */ ggml_backend_cuda_buffer_type_name,
10441
  /* .alloc_buffer = */ ggml_backend_cuda_buffer_type_alloc_buffer,
10442
  /* .get_alignment = */ ggml_backend_cuda_buffer_type_get_alignment,
10443
+ /* .get_max_size = */ NULL, // defaults to SIZE_MAX
10444
  /* .get_alloc_size = */ ggml_backend_cuda_buffer_type_get_alloc_size,
10445
  /* .supports_backend = */ ggml_backend_cuda_buffer_type_supports_backend,
10446
  /* .is_host = */ NULL,
 
10716
  /* .get_name = */ ggml_backend_cuda_split_buffer_type_name,
10717
  /* .alloc_buffer = */ ggml_backend_cuda_split_buffer_type_alloc_buffer,
10718
  /* .get_alignment = */ ggml_backend_cuda_split_buffer_type_get_alignment,
10719
+ /* .get_max_size = */ NULL, // defaults to SIZE_MAX
10720
  /* .get_alloc_size = */ ggml_backend_cuda_split_buffer_type_get_alloc_size,
10721
  /* .supports_backend = */ ggml_backend_cuda_split_buffer_type_supports_backend,
10722
  /* .is_host = */ ggml_backend_cuda_split_buffer_type_is_host,
 
10796
  /* .get_name = */ ggml_backend_cuda_host_buffer_type_name,
10797
  /* .alloc_buffer = */ ggml_backend_cuda_host_buffer_type_alloc_buffer,
10798
  /* .get_alignment = */ ggml_backend_cpu_buffer_type()->iface.get_alignment,
10799
+ /* .get_max_size = */ NULL, // defaults to SIZE_MAX
10800
  /* .get_alloc_size = */ ggml_backend_cpu_buffer_type()->iface.get_alloc_size,
10801
  /* .supports_backend = */ ggml_backend_cpu_buffer_type()->iface.supports_backend,
10802
  /* .is_host = */ ggml_backend_cpu_buffer_type()->iface.is_host,
ggml-metal.m CHANGED
@@ -2400,6 +2400,7 @@ GGML_CALL ggml_backend_buffer_type_t ggml_backend_metal_buffer_type(void) {
2400
  /* .get_name = */ ggml_backend_metal_buffer_type_get_name,
2401
  /* .alloc_buffer = */ ggml_backend_metal_buffer_type_alloc_buffer,
2402
  /* .get_alignment = */ ggml_backend_metal_buffer_type_get_alignment,
 
2403
  /* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
2404
  /* .supports_backend = */ ggml_backend_metal_buffer_type_supports_backend,
2405
  /* .is_host = */ ggml_backend_metal_buffer_type_is_host,
 
2400
  /* .get_name = */ ggml_backend_metal_buffer_type_get_name,
2401
  /* .alloc_buffer = */ ggml_backend_metal_buffer_type_alloc_buffer,
2402
  /* .get_alignment = */ ggml_backend_metal_buffer_type_get_alignment,
2403
+ /* .get_max_size = */ NULL, // TODO: return device.maxBufferLength
2404
  /* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
2405
  /* .supports_backend = */ ggml_backend_metal_buffer_type_supports_backend,
2406
  /* .is_host = */ ggml_backend_metal_buffer_type_is_host,
ggml-opencl.cpp CHANGED
@@ -2136,6 +2136,7 @@ static ggml_backend_buffer_type_i ggml_backend_opencl_buffer_type_interface = {
2136
  /* .get_name = */ ggml_backend_opencl_buffer_type_name,
2137
  /* .alloc_buffer = */ ggml_backend_opencl_buffer_type_alloc_buffer,
2138
  /* .get_alignment = */ ggml_backend_opencl_buffer_type_get_alignment,
 
2139
  /* .get_alloc_size = */ NULL,
2140
  /* .supports_backend = */ ggml_backend_opencl_buffer_type_supports_backend,
2141
  /* .is_host = */ NULL,
@@ -2192,6 +2193,7 @@ ggml_backend_buffer_type_t ggml_backend_opencl_host_buffer_type() {
2192
  /* .get_name = */ ggml_backend_opencl_host_buffer_type_name,
2193
  /* .alloc_buffer = */ ggml_backend_opencl_host_buffer_type_alloc_buffer,
2194
  /* .get_alignment = */ ggml_backend_cpu_buffer_type()->iface.get_alignment,
 
2195
  /* .get_alloc_size = */ ggml_backend_cpu_buffer_type()->iface.get_alloc_size,
2196
  /* .supports_backend = */ ggml_backend_cpu_buffer_type()->iface.supports_backend,
2197
  /* .is_host = */ ggml_backend_cpu_buffer_type()->iface.is_host,
 
2136
  /* .get_name = */ ggml_backend_opencl_buffer_type_name,
2137
  /* .alloc_buffer = */ ggml_backend_opencl_buffer_type_alloc_buffer,
2138
  /* .get_alignment = */ ggml_backend_opencl_buffer_type_get_alignment,
2139
+ /* .get_max_size = */ NULL, // TODO: return from device info
2140
  /* .get_alloc_size = */ NULL,
2141
  /* .supports_backend = */ ggml_backend_opencl_buffer_type_supports_backend,
2142
  /* .is_host = */ NULL,
 
2193
  /* .get_name = */ ggml_backend_opencl_host_buffer_type_name,
2194
  /* .alloc_buffer = */ ggml_backend_opencl_host_buffer_type_alloc_buffer,
2195
  /* .get_alignment = */ ggml_backend_cpu_buffer_type()->iface.get_alignment,
2196
+ /* .get_max_size = */ NULL, // defaults to SIZE_MAX
2197
  /* .get_alloc_size = */ ggml_backend_cpu_buffer_type()->iface.get_alloc_size,
2198
  /* .supports_backend = */ ggml_backend_cpu_buffer_type()->iface.supports_backend,
2199
  /* .is_host = */ ggml_backend_cpu_buffer_type()->iface.is_host,
ggml.c CHANGED
@@ -248,6 +248,8 @@ inline static void * ggml_aligned_malloc(size_t size) {
248
  #include "ggml-cuda.h"
249
  #elif defined(GGML_USE_CLBLAST)
250
  #include "ggml-opencl.h"
 
 
251
  #elif defined(GGML_USE_SYCL)
252
  #include "ggml-sycl.h"
253
  #endif
@@ -2295,6 +2297,8 @@ struct ggml_context * ggml_init(struct ggml_init_params params) {
2295
  ggml_init_cublas();
2296
  #elif defined(GGML_USE_CLBLAST)
2297
  ggml_cl_init();
 
 
2298
  #elif defined(GGML_USE_SYCL)
2299
  ggml_init_sycl();
2300
  #endif
@@ -8019,7 +8023,7 @@ static void ggml_compute_forward_mul_f32(
8019
  const int ith = params->ith;
8020
  const int nth = params->nth;
8021
 
8022
- #ifdef GGML_USE_CLBLAST
8023
  if (src1->backend == GGML_BACKEND_GPU) {
8024
  // TODO: OpenCL kernel support full broadcast
8025
  GGML_ASSERT(ggml_can_repeat_rows(src1, src0));
@@ -14703,6 +14707,18 @@ static void ggml_compute_forward(struct ggml_compute_params * params, struct ggm
14703
  }
14704
  GGML_ASSERT(tensor->src[0] == NULL || tensor->src[0]->backend == GGML_BACKEND_CPU);
14705
  GGML_ASSERT(tensor->src[1] == NULL || tensor->src[1]->backend == GGML_BACKEND_CPU);
 
 
 
 
 
 
 
 
 
 
 
 
14706
  #endif // GGML_USE_CUBLAS
14707
 
14708
  #ifdef GGML_USE_SYCL
@@ -17105,6 +17121,17 @@ int ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan) {
17105
  }
17106
  }
17107
 
 
 
 
 
 
 
 
 
 
 
 
17108
  const int n_threads = cplan->n_threads;
17109
 
17110
  struct ggml_compute_state_shared state_shared = {
@@ -17156,6 +17183,10 @@ int ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan) {
17156
  }
17157
  }
17158
 
 
 
 
 
17159
  // performance stats (graph)
17160
  {
17161
  int64_t perf_cycles_cur = ggml_perf_cycles() - perf_start_cycles;
@@ -20290,7 +20321,7 @@ int ggml_cpu_has_wasm_simd(void) {
20290
  }
20291
 
20292
  int ggml_cpu_has_blas(void) {
20293
- #if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS) || defined(GGML_USE_CUBLAS) || defined(GGML_USE_CLBLAST) || defined(GGML_USE_SYCL)
20294
  return 1;
20295
  #else
20296
  return 0;
@@ -20313,6 +20344,14 @@ int ggml_cpu_has_clblast(void) {
20313
  #endif
20314
  }
20315
 
 
 
 
 
 
 
 
 
20316
  int ggml_cpu_has_sycl(void) {
20317
  #if defined(GGML_USE_SYCL)
20318
  return 1;
@@ -20322,7 +20361,7 @@ int ggml_cpu_has_sycl(void) {
20322
  }
20323
 
20324
  int ggml_cpu_has_gpublas(void) {
20325
- return ggml_cpu_has_cublas() || ggml_cpu_has_clblast() || ggml_cpu_has_sycl();
20326
  }
20327
 
20328
  int ggml_cpu_has_sse3(void) {
 
248
  #include "ggml-cuda.h"
249
  #elif defined(GGML_USE_CLBLAST)
250
  #include "ggml-opencl.h"
251
+ #elif defined(GGML_USE_VULKAN)
252
+ #include "ggml-vulkan.h"
253
  #elif defined(GGML_USE_SYCL)
254
  #include "ggml-sycl.h"
255
  #endif
 
2297
  ggml_init_cublas();
2298
  #elif defined(GGML_USE_CLBLAST)
2299
  ggml_cl_init();
2300
+ #elif defined(GGML_USE_VULKAN)
2301
+ ggml_vk_init();
2302
  #elif defined(GGML_USE_SYCL)
2303
  ggml_init_sycl();
2304
  #endif
 
8023
  const int ith = params->ith;
8024
  const int nth = params->nth;
8025
 
8026
+ #if defined(GGML_USE_CLBLAST)
8027
  if (src1->backend == GGML_BACKEND_GPU) {
8028
  // TODO: OpenCL kernel support full broadcast
8029
  GGML_ASSERT(ggml_can_repeat_rows(src1, src0));
 
14707
  }
14708
  GGML_ASSERT(tensor->src[0] == NULL || tensor->src[0]->backend == GGML_BACKEND_CPU);
14709
  GGML_ASSERT(tensor->src[1] == NULL || tensor->src[1]->backend == GGML_BACKEND_CPU);
14710
+ #elif defined(GGML_USE_VULKAN)
14711
+ const bool skip_cpu = ggml_vk_compute_forward(params, tensor);
14712
+ #ifdef GGML_VULKAN_CHECK_RESULTS
14713
+ if (skip_cpu) {
14714
+ ggml_vk_check_results_1(params, tensor);
14715
+ }
14716
+ #endif
14717
+ if (skip_cpu) {
14718
+ return;
14719
+ }
14720
+ GGML_ASSERT(tensor->src[0] == NULL || tensor->src[0]->backend == GGML_BACKEND_CPU);
14721
+ GGML_ASSERT(tensor->src[1] == NULL || tensor->src[1]->backend == GGML_BACKEND_CPU);
14722
  #endif // GGML_USE_CUBLAS
14723
 
14724
  #ifdef GGML_USE_SYCL
 
17121
  }
17122
  }
17123
 
17124
+ #ifdef GGML_USE_VULKAN
17125
+ for (int i = 0; i < cgraph->n_nodes; i++) {
17126
+ ggml_vk_preallocate_buffers_graph(cgraph->nodes[i]);
17127
+ }
17128
+ ggml_vk_preallocate_buffers();
17129
+
17130
+ for (int i = 0; i < cgraph->n_nodes; i++) {
17131
+ ggml_vk_build_graph(cgraph->nodes[i], i == cgraph->n_nodes - 1);
17132
+ }
17133
+ #endif
17134
+
17135
  const int n_threads = cplan->n_threads;
17136
 
17137
  struct ggml_compute_state_shared state_shared = {
 
17183
  }
17184
  }
17185
 
17186
+ #ifdef GGML_USE_VULKAN
17187
+ ggml_vk_graph_cleanup();
17188
+ #endif
17189
+
17190
  // performance stats (graph)
17191
  {
17192
  int64_t perf_cycles_cur = ggml_perf_cycles() - perf_start_cycles;
 
20321
  }
20322
 
20323
  int ggml_cpu_has_blas(void) {
20324
+ #if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS) || defined(GGML_USE_CUBLAS) || defined(GGML_USE_VULKAN) || defined(GGML_USE_CLBLAST) || defined(GGML_USE_SYCL)
20325
  return 1;
20326
  #else
20327
  return 0;
 
20344
  #endif
20345
  }
20346
 
20347
+ int ggml_cpu_has_vulkan(void) {
20348
+ #if defined(GGML_USE_VULKAN)
20349
+ return 1;
20350
+ #else
20351
+ return 0;
20352
+ #endif
20353
+ }
20354
+
20355
  int ggml_cpu_has_sycl(void) {
20356
  #if defined(GGML_USE_SYCL)
20357
  return 1;
 
20361
  }
20362
 
20363
  int ggml_cpu_has_gpublas(void) {
20364
+ return ggml_cpu_has_cublas() || ggml_cpu_has_clblast() || ggml_cpu_has_vulkan() || ggml_cpu_has_sycl();
20365
  }
20366
 
20367
  int ggml_cpu_has_sse3(void) {
ggml.h CHANGED
@@ -2263,6 +2263,7 @@ extern "C" {
2263
  GGML_API int ggml_cpu_has_blas (void);
2264
  GGML_API int ggml_cpu_has_cublas (void);
2265
  GGML_API int ggml_cpu_has_clblast (void);
 
2266
  GGML_API int ggml_cpu_has_gpublas (void);
2267
  GGML_API int ggml_cpu_has_sse3 (void);
2268
  GGML_API int ggml_cpu_has_ssse3 (void);
 
2263
  GGML_API int ggml_cpu_has_blas (void);
2264
  GGML_API int ggml_cpu_has_cublas (void);
2265
  GGML_API int ggml_cpu_has_clblast (void);
2266
+ GGML_API int ggml_cpu_has_vulkan (void);
2267
  GGML_API int ggml_cpu_has_gpublas (void);
2268
  GGML_API int ggml_cpu_has_sse3 (void);
2269
  GGML_API int ggml_cpu_has_ssse3 (void);