Spaces:
Running
ggml : add Vulkan backend (llama/2059)
Browse files* Vulkan loader code
* Fix matmul kernel, continue implementation
* Continue implementation
* Vulkan memory management
* Vulkan development
* Matmul call
* Add aligned malloc and free for VMA
* Continue implementation
* First matmul success
* GEMM Kernel optimization
* 1D Blocktiling
* 2D Blocktiling
* Write coalescing
* Continue vulkan implementation and optimization
* First FP16 attempt, disabled for now
* Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel
* Enable device extensions properly, restore fp16 matmul op
* Fix mulmat_f16
* Output FP32 in fp16 matmul shader
* Fix f16_to_f32 kernel
* dequant_q4_0 kernel
* Add VMA library
* Avoid requesting dedicated memory, VMA can decide that by itself
* Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly
* add cmake commands
* Add 2d write operation, profiling code
* Fix 2d write
* Fix queue selection for AMD RADV
* Fix trailing whitespace in vk_mem_alloc.h
* Add WIP warp tile mat mul shaders
* Disable glslc optimization
* Disable glslc optimization for CMake
* Optimize warptile matmul shader, replace blocktile with it
* Add split-k optimization for small matrix multiplication
Use semaphores for synchronization instead of fences or waitidle
Rework async write/read for synchronization
* Fix validation errors, improve compatibility with AMD GPUs
* Rework command buffer handling
* Variable matmul kernel using specialization constants
* Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints
* Reuse semaphores
* Handle stage flags during command buffer submission properly
* Increase matmul test runs for consistent results
* Fix F32 matmul
* Add vectorized loading and zeropadding for matrix multiplication
* Use pinned memory for f16 preprocessing
* Don't force aligned matmul
* Don't free before queue done
* Replace VMA library with native Vulkan buffer management
* Basic offloading support with mul_f32 and dmmv for q4_0
* Run glslc commands in parallel
* Unroll loops in dmmv shader
* Reduce usage of waitIdle
* Reuse pinned allocation for f16 conversion
* Handle devices with only a single queue
* Fix trailing whitespace in CMakeLists.txt
* Allow parallel execution of kernels, parallelize third and fourth dimension calls
* Add fallback for devices only supporting one DescriptorSet per DescriptorPool
* Move to graph function similar to CUDA implementation
* Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function
* Add F32 dmmv shaders
* Batch submissions
* Add .spv to gitignore
* Split off matrix vector multiplication for separate optimization
* Use single command buffer for matrix vector multiplication ops
* Reduce overhead of mul_f32 calls by using a single command buffer
* Add submission batching to mul_f32
* Fix tests
* Add missing barrier
* Add further missing barrier
* Add further ops
* Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions
* Remove unnecessary cblas link
* Fix descriptor set pre-allocation assert
* Add runtime shader compilation, start transferring shaders to this approach
* Transfer remaining shaders to header and compile on runtime
* Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16
* Add support for q4_1, q5_0, q5_1 and q8_0
* Remove unnecessary scalar layout extension
* Parse graph early to pre-record command buffers
* Add q6_k support
* Add multi-submit for command buffers
* Fix q6_k dequant shader for AMD
* Fix q6_k for GPUs without fp16 support
* Simplify q6_k fp16 fix
* Minor fixes
* Fix wg_denom of m-mulmat shaders
* Add Python-based Vulkan shader generator
* Replace shaderc dependency with precompiled shaders
Fix python script to generate shaders
* Clean up code
* Fix shader generator script Windows compatibility
Co-authored-by: Concedo <[email protected]>
* Close file before deletion
* Fix vulkan shader fp32 name
* Add q2_k and q3_k support
Add validation check to compare shader results to cpu results
* Add q4_k support
* Add q5_k support
* Bake SPIR-V bytecode into the library instead of loading shaders from file
* Switch to signal semaphores for flexibility
Prepare broadcasting support for mul mat
* Finish broadcasting mul mat support for GQA
* Clean up unused functions
Add repeat op
* Add further ops, not yet enabled. Improve semaphore code
* Reduce number of used semaphores by utilizing timelines more properly
* Remove queue information
* Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations
* Add Vulkan to llama-bench
* Remove cblas dependency
* Fix matmul k-split bug
* Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader
* Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug
* Fix issues with float16 overflows in shaders
* Fix issues with older Vulkan headers on Ubuntu 22.04
* Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers
* Implement further ops, rework op_f32 calls, fix bugs
* Finish full offloading support, add last remaining ops, fix bugs, remove redundant code
* Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders
* Merge upstream changes, fix conflicts, adapt soft_max op
* Fix Python and shader header format
* Free model gpu buffers on exit
* Use single queue per device to simplify code
* Add matmul shader support for running multiple calculations in parallel
* Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible
* Fix missing event cast
* Replace uint64_t(-1) with UINT64_MAX, rename function for clarity
* Fix warning about empty C function parameters
* Fix compiler warnings
* Properly implement Vulkan backend buffer handling
* Fix oversized host staging buffers
* Simplify barrier synchronization calls
* Fix gcc warnings
* Implement max_size for backend buffer types to limit the size of a single allocation
* Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size
* refactor multi buf
* Disable unsupported ops to fix tests
* Check for maintenance4 support before using it
* Handle devices with only a single queue
* Fix single queue logic
* propagate buffer usage in multi buffers
* Implement rope_neox op
* Cleanup header and other files
* Simplify gpu_extras by removing events and putting staging memcpys into contexts
* Move queue into context
Add not-yet-enabled async backend ops
* Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization
* Add get_max_size to SYCL backend.
Co-authored-by: Georgi Gerganov <[email protected]>
* llama : fix trailing whitespace
---------
Co-authored-by: Henri Vasserman <[email protected]>
Co-authored-by: Concedo <[email protected]>
Co-authored-by: slaren <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
- ggml-alloc.c +82 -24
- ggml-backend-impl.h +6 -0
- ggml-backend.c +102 -2
- ggml-backend.h +3 -0
- ggml-cuda.cu +3 -0
- ggml-metal.m +1 -0
- ggml-opencl.cpp +2 -0
- ggml.c +42 -3
- ggml.h +1 -0
|
@@ -778,38 +778,26 @@ size_t ggml_allocr_alloc_graph(ggml_allocr_t alloc, struct ggml_cgraph * graph)
|
|
| 778 |
}
|
| 779 |
|
| 780 |
// utils
|
| 781 |
-
ggml_backend_buffer_t ggml_backend_alloc_ctx_tensors_from_buft(struct ggml_context * ctx, ggml_backend_buffer_type_t buft) {
|
| 782 |
-
GGML_ASSERT(ggml_get_no_alloc(ctx) == true);
|
| 783 |
-
|
| 784 |
-
size_t alignment = ggml_backend_buft_get_alignment(buft);
|
| 785 |
-
|
| 786 |
-
size_t nbytes = 0;
|
| 787 |
-
for (struct ggml_tensor * t = ggml_get_first_tensor(ctx); t != NULL; t = ggml_get_next_tensor(ctx, t)) {
|
| 788 |
-
if (t->data == NULL && t->view_src == NULL) {
|
| 789 |
-
nbytes += GGML_PAD(ggml_backend_buft_get_alloc_size(buft, t), alignment);
|
| 790 |
-
}
|
| 791 |
-
}
|
| 792 |
-
|
| 793 |
-
if (nbytes == 0) {
|
| 794 |
-
// all the tensors in the context are already allocated
|
| 795 |
-
#ifndef NDEBUG
|
| 796 |
-
fprintf(stderr, "%s: all tensors in the context are already allocated\n", __func__);
|
| 797 |
-
#endif
|
| 798 |
-
return NULL;
|
| 799 |
-
}
|
| 800 |
|
| 801 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 802 |
if (buffer == NULL) {
|
| 803 |
-
// failed to allocate buffer
|
| 804 |
#ifndef NDEBUG
|
| 805 |
-
fprintf(stderr, "%s: failed to allocate buffer\n", __func__);
|
| 806 |
#endif
|
| 807 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 808 |
}
|
| 809 |
|
| 810 |
ggml_tallocr_t tallocr = ggml_tallocr_new_from_buffer(buffer);
|
| 811 |
|
| 812 |
-
for (struct ggml_tensor * t =
|
| 813 |
if (t->data == NULL) {
|
| 814 |
if (t->view_src == NULL) {
|
| 815 |
ggml_tallocr_alloc(tallocr, t);
|
|
@@ -826,6 +814,76 @@ ggml_backend_buffer_t ggml_backend_alloc_ctx_tensors_from_buft(struct ggml_conte
|
|
| 826 |
|
| 827 |
ggml_tallocr_free(tallocr);
|
| 828 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 829 |
return buffer;
|
| 830 |
}
|
| 831 |
|
|
|
|
| 778 |
}
|
| 779 |
|
| 780 |
// utils
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 781 |
|
| 782 |
+
static bool alloc_tensor_range(struct ggml_context * ctx,
|
| 783 |
+
struct ggml_tensor * first, struct ggml_tensor * last,
|
| 784 |
+
ggml_backend_buffer_type_t buft, size_t size,
|
| 785 |
+
ggml_backend_buffer_t ** buffers, size_t * n_buffers) {
|
| 786 |
+
ggml_backend_buffer_t buffer = ggml_backend_buft_alloc_buffer(buft, size);
|
| 787 |
if (buffer == NULL) {
|
|
|
|
| 788 |
#ifndef NDEBUG
|
| 789 |
+
fprintf(stderr, "%s: failed to allocate %s buffer of size %zu\n", __func__, ggml_backend_buft_name(buft), size);
|
| 790 |
#endif
|
| 791 |
+
for (size_t i = 0; i < *n_buffers; i++) {
|
| 792 |
+
ggml_backend_buffer_free(*buffers[i]);
|
| 793 |
+
}
|
| 794 |
+
free(buffers);
|
| 795 |
+
return false;
|
| 796 |
}
|
| 797 |
|
| 798 |
ggml_tallocr_t tallocr = ggml_tallocr_new_from_buffer(buffer);
|
| 799 |
|
| 800 |
+
for (struct ggml_tensor * t = first; t != last; t = ggml_get_next_tensor(ctx, t)) {
|
| 801 |
if (t->data == NULL) {
|
| 802 |
if (t->view_src == NULL) {
|
| 803 |
ggml_tallocr_alloc(tallocr, t);
|
|
|
|
| 814 |
|
| 815 |
ggml_tallocr_free(tallocr);
|
| 816 |
|
| 817 |
+
*buffers = realloc(*buffers, sizeof(ggml_backend_buffer_t) * (*n_buffers + 1));
|
| 818 |
+
(*buffers)[(*n_buffers)++] = buffer;
|
| 819 |
+
|
| 820 |
+
return true;
|
| 821 |
+
}
|
| 822 |
+
|
| 823 |
+
ggml_backend_buffer_t ggml_backend_alloc_ctx_tensors_from_buft(struct ggml_context * ctx, ggml_backend_buffer_type_t buft) {
|
| 824 |
+
GGML_ASSERT(ggml_get_no_alloc(ctx) == true);
|
| 825 |
+
|
| 826 |
+
size_t alignment = ggml_backend_buft_get_alignment(buft);
|
| 827 |
+
size_t max_size = ggml_backend_buft_get_max_size(buft);
|
| 828 |
+
|
| 829 |
+
ggml_backend_buffer_t * buffers = NULL;
|
| 830 |
+
size_t n_buffers = 0;
|
| 831 |
+
|
| 832 |
+
size_t cur_buf_size = 0;
|
| 833 |
+
struct ggml_tensor * first = ggml_get_first_tensor(ctx);
|
| 834 |
+
for (struct ggml_tensor * t = first; t != NULL; t = ggml_get_next_tensor(ctx, t)) {
|
| 835 |
+
size_t this_size = 0;
|
| 836 |
+
if (t->data == NULL && t->view_src == NULL) {
|
| 837 |
+
this_size = GGML_PAD(ggml_backend_buft_get_alloc_size(buft, t), alignment);
|
| 838 |
+
}
|
| 839 |
+
|
| 840 |
+
if (this_size > max_size) {
|
| 841 |
+
// tensor is too large to fit in a single buffer
|
| 842 |
+
fprintf(stderr, "%s: tensor %s is too large to fit in a %s buffer (tensor size: %zu, max buffer size: %zu)\n",
|
| 843 |
+
__func__, t->name,
|
| 844 |
+
ggml_backend_buft_name(buft),
|
| 845 |
+
this_size, max_size);
|
| 846 |
+
for (size_t i = 0; i < n_buffers; i++) {
|
| 847 |
+
ggml_backend_buffer_free(buffers[i]);
|
| 848 |
+
}
|
| 849 |
+
free(buffers);
|
| 850 |
+
return NULL;
|
| 851 |
+
}
|
| 852 |
+
|
| 853 |
+
if ((cur_buf_size + this_size) > max_size) {
|
| 854 |
+
// allocate tensors in the current buffer
|
| 855 |
+
if (!alloc_tensor_range(ctx, first, t, buft, cur_buf_size, &buffers, &n_buffers)) {
|
| 856 |
+
return NULL;
|
| 857 |
+
}
|
| 858 |
+
first = t;
|
| 859 |
+
cur_buf_size = this_size;
|
| 860 |
+
} else {
|
| 861 |
+
cur_buf_size += this_size;
|
| 862 |
+
}
|
| 863 |
+
}
|
| 864 |
+
|
| 865 |
+
// allocate remaining tensors
|
| 866 |
+
if (cur_buf_size > 0) {
|
| 867 |
+
if (!alloc_tensor_range(ctx, first, NULL, buft, cur_buf_size, &buffers, &n_buffers)) {
|
| 868 |
+
return NULL;
|
| 869 |
+
}
|
| 870 |
+
}
|
| 871 |
+
|
| 872 |
+
if (n_buffers == 0) {
|
| 873 |
+
// all the tensors in the context are already allocated
|
| 874 |
+
#ifndef NDEBUG
|
| 875 |
+
fprintf(stderr, "%s: all tensors in the context are already allocated\n", __func__);
|
| 876 |
+
#endif
|
| 877 |
+
return NULL;
|
| 878 |
+
}
|
| 879 |
+
|
| 880 |
+
ggml_backend_buffer_t buffer;
|
| 881 |
+
if (n_buffers == 1) {
|
| 882 |
+
buffer = buffers[0];
|
| 883 |
+
} else {
|
| 884 |
+
buffer = ggml_backend_multi_buffer_alloc_buffer(buffers, n_buffers);
|
| 885 |
+
}
|
| 886 |
+
free(buffers);
|
| 887 |
return buffer;
|
| 888 |
}
|
| 889 |
|
|
@@ -19,6 +19,7 @@ extern "C" {
|
|
| 19 |
const char * (*GGML_CALL get_name) (ggml_backend_buffer_type_t buft);
|
| 20 |
ggml_backend_buffer_t (*GGML_CALL alloc_buffer) (ggml_backend_buffer_type_t buft, size_t size);
|
| 21 |
size_t (*GGML_CALL get_alignment) (ggml_backend_buffer_type_t buft); // tensor alignment
|
|
|
|
| 22 |
size_t (*GGML_CALL get_alloc_size) (ggml_backend_buffer_type_t buft, const struct ggml_tensor * tensor); // data size needed to allocate the tensor, including padding
|
| 23 |
bool (*GGML_CALL supports_backend)(ggml_backend_buffer_type_t buft, ggml_backend_t backend); // check if the buffer type is usable by the backend
|
| 24 |
// check if tensor data is in host memory
|
|
@@ -63,6 +64,11 @@ extern "C" {
|
|
| 63 |
// do not use directly, use ggml_backend_tensor_copy instead
|
| 64 |
bool ggml_backend_buffer_copy_tensor(const struct ggml_tensor * src, struct ggml_tensor * dst);
|
| 65 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
//
|
| 67 |
// Backend
|
| 68 |
//
|
|
|
|
| 19 |
const char * (*GGML_CALL get_name) (ggml_backend_buffer_type_t buft);
|
| 20 |
ggml_backend_buffer_t (*GGML_CALL alloc_buffer) (ggml_backend_buffer_type_t buft, size_t size);
|
| 21 |
size_t (*GGML_CALL get_alignment) (ggml_backend_buffer_type_t buft); // tensor alignment
|
| 22 |
+
size_t (*GGML_CALL get_max_size) (ggml_backend_buffer_type_t buft); // allocation max size
|
| 23 |
size_t (*GGML_CALL get_alloc_size) (ggml_backend_buffer_type_t buft, const struct ggml_tensor * tensor); // data size needed to allocate the tensor, including padding
|
| 24 |
bool (*GGML_CALL supports_backend)(ggml_backend_buffer_type_t buft, ggml_backend_t backend); // check if the buffer type is usable by the backend
|
| 25 |
// check if tensor data is in host memory
|
|
|
|
| 64 |
// do not use directly, use ggml_backend_tensor_copy instead
|
| 65 |
bool ggml_backend_buffer_copy_tensor(const struct ggml_tensor * src, struct ggml_tensor * dst);
|
| 66 |
|
| 67 |
+
// buffer that contains a collection of buffers
|
| 68 |
+
GGML_CALL ggml_backend_buffer_t ggml_backend_multi_buffer_alloc_buffer(ggml_backend_buffer_t * buffers, size_t n_buffers);
|
| 69 |
+
GGML_CALL bool ggml_backend_buffer_is_multi_buffer(ggml_backend_buffer_t buffer);
|
| 70 |
+
GGML_CALL void ggml_backend_multi_buffer_set_usage(ggml_backend_buffer_t buffer, enum ggml_backend_buffer_usage usage);
|
| 71 |
+
|
| 72 |
//
|
| 73 |
// Backend
|
| 74 |
//
|
|
@@ -27,6 +27,14 @@ size_t ggml_backend_buft_get_alignment(ggml_backend_buffer_type_t buft) {
|
|
| 27 |
return buft->iface.get_alignment(buft);
|
| 28 |
}
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
GGML_CALL size_t ggml_backend_buft_get_alloc_size(ggml_backend_buffer_type_t buft, struct ggml_tensor * tensor) {
|
| 31 |
// get_alloc_size is optional, defaults to ggml_nbytes
|
| 32 |
if (buft->iface.get_alloc_size) {
|
|
@@ -57,8 +65,6 @@ GGML_CALL ggml_backend_buffer_t ggml_backend_buffer_init(
|
|
| 57 |
size_t size) {
|
| 58 |
ggml_backend_buffer_t buffer = malloc(sizeof(struct ggml_backend_buffer));
|
| 59 |
|
| 60 |
-
GGML_ASSERT(iface.get_base != NULL);
|
| 61 |
-
|
| 62 |
(*buffer) = (struct ggml_backend_buffer) {
|
| 63 |
/* .interface = */ iface,
|
| 64 |
/* .buft = */ buft,
|
|
@@ -108,6 +114,10 @@ size_t ggml_backend_buffer_get_alignment (ggml_backend_buffer_t buffer) {
|
|
| 108 |
return ggml_backend_buft_get_alignment(ggml_backend_buffer_get_type(buffer));
|
| 109 |
}
|
| 110 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
size_t ggml_backend_buffer_get_alloc_size(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) {
|
| 112 |
return ggml_backend_buft_get_alloc_size(ggml_backend_buffer_get_type(buffer), tensor);
|
| 113 |
}
|
|
@@ -122,6 +132,11 @@ bool ggml_backend_buffer_is_host(ggml_backend_buffer_t buffer) {
|
|
| 122 |
|
| 123 |
void ggml_backend_buffer_set_usage(ggml_backend_buffer_t buffer, enum ggml_backend_buffer_usage usage) {
|
| 124 |
buffer->usage = usage;
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
}
|
| 126 |
|
| 127 |
ggml_backend_buffer_type_t ggml_backend_buffer_get_type(ggml_backend_buffer_t buffer) {
|
|
@@ -171,6 +186,10 @@ size_t ggml_backend_get_alignment(ggml_backend_t backend) {
|
|
| 171 |
return ggml_backend_buft_get_alignment(ggml_backend_get_default_buffer_type(backend));
|
| 172 |
}
|
| 173 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
void ggml_backend_tensor_set_async(ggml_backend_t backend, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
|
| 175 |
GGML_ASSERT(tensor->data != NULL && "tensor not allocated");
|
| 176 |
GGML_ASSERT(offset + size <= ggml_nbytes(tensor) && "tensor write out of bounds");
|
|
@@ -349,6 +368,11 @@ GGML_CALL static void ggml_backend_registry_init(void) {
|
|
| 349 |
extern GGML_CALL ggml_backend_buffer_type_t ggml_backend_metal_buffer_type(void);
|
| 350 |
ggml_backend_register("Metal", ggml_backend_reg_metal_init, ggml_backend_metal_buffer_type(), NULL);
|
| 351 |
#endif
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 352 |
}
|
| 353 |
|
| 354 |
GGML_CALL void ggml_backend_register(const char * name, ggml_backend_init_fn init_fn, ggml_backend_buffer_type_t default_buffer_type, void * user_data) {
|
|
@@ -552,6 +576,7 @@ GGML_CALL ggml_backend_buffer_type_t ggml_backend_cpu_buffer_type(void) {
|
|
| 552 |
/* .get_name = */ ggml_backend_cpu_buffer_type_get_name,
|
| 553 |
/* .alloc_buffer = */ ggml_backend_cpu_buffer_type_alloc_buffer,
|
| 554 |
/* .get_alignment = */ ggml_backend_cpu_buffer_type_get_alignment,
|
|
|
|
| 555 |
/* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
|
| 556 |
/* .supports_backend = */ ggml_backend_cpu_buffer_type_supports_backend,
|
| 557 |
/* .is_host = */ ggml_backend_cpu_buffer_type_is_host,
|
|
@@ -607,6 +632,7 @@ ggml_backend_buffer_type_t ggml_backend_cpu_hbm_buffer_type(void) {
|
|
| 607 |
/* .get_name = */ ggml_backend_cpu_hbm_buffer_type_get_name,
|
| 608 |
/* .alloc_buffer = */ ggml_backend_cpu_hbm_buffer_type_alloc_buffer,
|
| 609 |
/* .get_alignment = */ ggml_backend_cpu_buffer_type_get_alignment,
|
|
|
|
| 610 |
/* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
|
| 611 |
/* .supports_backend = */ ggml_backend_cpu_buffer_type_supports_backend,
|
| 612 |
/* .is_host = */ ggml_backend_cpu_buffer_type_is_host,
|
|
@@ -763,6 +789,80 @@ GGML_CALL static ggml_backend_t ggml_backend_reg_cpu_init(const char * params, v
|
|
| 763 |
GGML_UNUSED(user_data);
|
| 764 |
}
|
| 765 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 766 |
|
| 767 |
// scheduler
|
| 768 |
|
|
|
|
| 27 |
return buft->iface.get_alignment(buft);
|
| 28 |
}
|
| 29 |
|
| 30 |
+
size_t ggml_backend_buft_get_max_size(ggml_backend_buffer_type_t buft) {
|
| 31 |
+
// get_max_size is optional, defaults to SIZE_MAX
|
| 32 |
+
if (buft->iface.get_max_size) {
|
| 33 |
+
return buft->iface.get_max_size(buft);
|
| 34 |
+
}
|
| 35 |
+
return SIZE_MAX;
|
| 36 |
+
}
|
| 37 |
+
|
| 38 |
GGML_CALL size_t ggml_backend_buft_get_alloc_size(ggml_backend_buffer_type_t buft, struct ggml_tensor * tensor) {
|
| 39 |
// get_alloc_size is optional, defaults to ggml_nbytes
|
| 40 |
if (buft->iface.get_alloc_size) {
|
|
|
|
| 65 |
size_t size) {
|
| 66 |
ggml_backend_buffer_t buffer = malloc(sizeof(struct ggml_backend_buffer));
|
| 67 |
|
|
|
|
|
|
|
| 68 |
(*buffer) = (struct ggml_backend_buffer) {
|
| 69 |
/* .interface = */ iface,
|
| 70 |
/* .buft = */ buft,
|
|
|
|
| 114 |
return ggml_backend_buft_get_alignment(ggml_backend_buffer_get_type(buffer));
|
| 115 |
}
|
| 116 |
|
| 117 |
+
size_t ggml_backend_buffer_get_max_size(ggml_backend_buffer_t buffer) {
|
| 118 |
+
return ggml_backend_buft_get_max_size(ggml_backend_buffer_get_type(buffer));
|
| 119 |
+
}
|
| 120 |
+
|
| 121 |
size_t ggml_backend_buffer_get_alloc_size(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) {
|
| 122 |
return ggml_backend_buft_get_alloc_size(ggml_backend_buffer_get_type(buffer), tensor);
|
| 123 |
}
|
|
|
|
| 132 |
|
| 133 |
void ggml_backend_buffer_set_usage(ggml_backend_buffer_t buffer, enum ggml_backend_buffer_usage usage) {
|
| 134 |
buffer->usage = usage;
|
| 135 |
+
|
| 136 |
+
// FIXME: add a generic callback to the buffer interface
|
| 137 |
+
if (ggml_backend_buffer_is_multi_buffer(buffer)) {
|
| 138 |
+
ggml_backend_multi_buffer_set_usage(buffer, usage);
|
| 139 |
+
}
|
| 140 |
}
|
| 141 |
|
| 142 |
ggml_backend_buffer_type_t ggml_backend_buffer_get_type(ggml_backend_buffer_t buffer) {
|
|
|
|
| 186 |
return ggml_backend_buft_get_alignment(ggml_backend_get_default_buffer_type(backend));
|
| 187 |
}
|
| 188 |
|
| 189 |
+
size_t ggml_backend_get_max_size(ggml_backend_t backend) {
|
| 190 |
+
return ggml_backend_buft_get_max_size(ggml_backend_get_default_buffer_type(backend));
|
| 191 |
+
}
|
| 192 |
+
|
| 193 |
void ggml_backend_tensor_set_async(ggml_backend_t backend, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
|
| 194 |
GGML_ASSERT(tensor->data != NULL && "tensor not allocated");
|
| 195 |
GGML_ASSERT(offset + size <= ggml_nbytes(tensor) && "tensor write out of bounds");
|
|
|
|
| 368 |
extern GGML_CALL ggml_backend_buffer_type_t ggml_backend_metal_buffer_type(void);
|
| 369 |
ggml_backend_register("Metal", ggml_backend_reg_metal_init, ggml_backend_metal_buffer_type(), NULL);
|
| 370 |
#endif
|
| 371 |
+
|
| 372 |
+
#ifdef GGML_USE_VULKAN
|
| 373 |
+
extern GGML_CALL int ggml_backend_vk_reg_devices(void);
|
| 374 |
+
ggml_backend_vk_reg_devices();
|
| 375 |
+
#endif
|
| 376 |
}
|
| 377 |
|
| 378 |
GGML_CALL void ggml_backend_register(const char * name, ggml_backend_init_fn init_fn, ggml_backend_buffer_type_t default_buffer_type, void * user_data) {
|
|
|
|
| 576 |
/* .get_name = */ ggml_backend_cpu_buffer_type_get_name,
|
| 577 |
/* .alloc_buffer = */ ggml_backend_cpu_buffer_type_alloc_buffer,
|
| 578 |
/* .get_alignment = */ ggml_backend_cpu_buffer_type_get_alignment,
|
| 579 |
+
/* .get_max_size = */ NULL, // defaults to SIZE_MAX
|
| 580 |
/* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
|
| 581 |
/* .supports_backend = */ ggml_backend_cpu_buffer_type_supports_backend,
|
| 582 |
/* .is_host = */ ggml_backend_cpu_buffer_type_is_host,
|
|
|
|
| 632 |
/* .get_name = */ ggml_backend_cpu_hbm_buffer_type_get_name,
|
| 633 |
/* .alloc_buffer = */ ggml_backend_cpu_hbm_buffer_type_alloc_buffer,
|
| 634 |
/* .get_alignment = */ ggml_backend_cpu_buffer_type_get_alignment,
|
| 635 |
+
/* .get_max_size = */ NULL, // defaults to SIZE_MAX
|
| 636 |
/* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
|
| 637 |
/* .supports_backend = */ ggml_backend_cpu_buffer_type_supports_backend,
|
| 638 |
/* .is_host = */ ggml_backend_cpu_buffer_type_is_host,
|
|
|
|
| 789 |
GGML_UNUSED(user_data);
|
| 790 |
}
|
| 791 |
|
| 792 |
+
// multi-buffer buffer
|
| 793 |
+
|
| 794 |
+
struct ggml_backend_multi_buffer_context {
|
| 795 |
+
ggml_backend_buffer_t * buffers;
|
| 796 |
+
size_t n_buffers;
|
| 797 |
+
};
|
| 798 |
+
|
| 799 |
+
typedef struct ggml_backend_multi_buffer_context * ggml_backend_multi_buffer_context_t;
|
| 800 |
+
|
| 801 |
+
GGML_CALL static const char * ggml_backend_multi_buffer_get_name(ggml_backend_buffer_t buffer) {
|
| 802 |
+
ggml_backend_multi_buffer_context_t ctx = (ggml_backend_multi_buffer_context_t) buffer->context;
|
| 803 |
+
|
| 804 |
+
return ctx->buffers[0]->iface.get_name(ctx->buffers[0]);
|
| 805 |
+
}
|
| 806 |
+
|
| 807 |
+
GGML_CALL static void ggml_backend_multi_buffer_free_buffer(ggml_backend_buffer_t buffer) {
|
| 808 |
+
ggml_backend_multi_buffer_context_t ctx = (ggml_backend_multi_buffer_context_t) buffer->context;
|
| 809 |
+
for (size_t i = 0; i < ctx->n_buffers; i++) {
|
| 810 |
+
ggml_backend_buffer_free(ctx->buffers[i]);
|
| 811 |
+
}
|
| 812 |
+
|
| 813 |
+
free(ctx->buffers);
|
| 814 |
+
free(ctx);
|
| 815 |
+
}
|
| 816 |
+
|
| 817 |
+
GGML_CALL static void ggml_backend_multi_buffer_clear(ggml_backend_buffer_t buffer, uint8_t value) {
|
| 818 |
+
ggml_backend_multi_buffer_context_t ctx = (ggml_backend_multi_buffer_context_t) buffer->context;
|
| 819 |
+
for (size_t i = 0; i < ctx->n_buffers; i++) {
|
| 820 |
+
ggml_backend_buffer_clear(ctx->buffers[i], value);
|
| 821 |
+
}
|
| 822 |
+
}
|
| 823 |
+
|
| 824 |
+
static struct ggml_backend_buffer_i ggml_backend_multi_buffer_context_interface(void) {
|
| 825 |
+
static struct ggml_backend_buffer_i multi_backend_buffer_i = {
|
| 826 |
+
/* .get_name = */ ggml_backend_multi_buffer_get_name,
|
| 827 |
+
/* .free_buffer = */ ggml_backend_multi_buffer_free_buffer,
|
| 828 |
+
/* .get_base = */ NULL,
|
| 829 |
+
/* .init_tensor = */ NULL,
|
| 830 |
+
/* .set_tensor = */ NULL,
|
| 831 |
+
/* .get_tensor = */ NULL,
|
| 832 |
+
/* .cpy_tensor = */ NULL,
|
| 833 |
+
/* .clear = */ ggml_backend_multi_buffer_clear,
|
| 834 |
+
/* .reset = */ NULL,
|
| 835 |
+
};
|
| 836 |
+
|
| 837 |
+
return multi_backend_buffer_i;
|
| 838 |
+
}
|
| 839 |
+
|
| 840 |
+
GGML_CALL ggml_backend_buffer_t ggml_backend_multi_buffer_alloc_buffer(ggml_backend_buffer_t * buffers, size_t n_buffers) {
|
| 841 |
+
ggml_backend_multi_buffer_context_t ctx = (ggml_backend_multi_buffer_context_t) malloc(sizeof(struct ggml_backend_multi_buffer_context));
|
| 842 |
+
ctx->n_buffers = n_buffers;
|
| 843 |
+
ctx->buffers = (ggml_backend_buffer_t *) malloc(n_buffers * sizeof(ggml_backend_buffer_t));
|
| 844 |
+
|
| 845 |
+
size_t total_size = 0;
|
| 846 |
+
for (size_t i = 0; i < n_buffers; i++) {
|
| 847 |
+
ctx->buffers[i] = buffers[i];
|
| 848 |
+
total_size += ggml_backend_buffer_get_size(buffers[i]);
|
| 849 |
+
}
|
| 850 |
+
|
| 851 |
+
return ggml_backend_buffer_init(buffers[0]->buft, ggml_backend_multi_buffer_context_interface(), ctx, total_size);
|
| 852 |
+
}
|
| 853 |
+
|
| 854 |
+
GGML_CALL bool ggml_backend_buffer_is_multi_buffer(ggml_backend_buffer_t buffer) {
|
| 855 |
+
return buffer->iface.get_name == ggml_backend_multi_buffer_get_name;
|
| 856 |
+
}
|
| 857 |
+
|
| 858 |
+
GGML_CALL void ggml_backend_multi_buffer_set_usage(ggml_backend_buffer_t buffer, enum ggml_backend_buffer_usage usage) {
|
| 859 |
+
GGML_ASSERT(ggml_backend_buffer_is_multi_buffer(buffer));
|
| 860 |
+
ggml_backend_multi_buffer_context_t ctx = (ggml_backend_multi_buffer_context_t) buffer->context;
|
| 861 |
+
for (size_t i = 0; i < ctx->n_buffers; i++) {
|
| 862 |
+
ggml_backend_buffer_set_usage(ctx->buffers[i], usage);
|
| 863 |
+
}
|
| 864 |
+
}
|
| 865 |
+
|
| 866 |
|
| 867 |
// scheduler
|
| 868 |
|
|
@@ -20,6 +20,7 @@ extern "C" {
|
|
| 20 |
GGML_API const char * ggml_backend_buft_name (ggml_backend_buffer_type_t buft);
|
| 21 |
GGML_API GGML_CALL ggml_backend_buffer_t ggml_backend_buft_alloc_buffer (ggml_backend_buffer_type_t buft, size_t size);
|
| 22 |
GGML_API size_t ggml_backend_buft_get_alignment (ggml_backend_buffer_type_t buft);
|
|
|
|
| 23 |
GGML_API GGML_CALL size_t ggml_backend_buft_get_alloc_size (ggml_backend_buffer_type_t buft, struct ggml_tensor * tensor);
|
| 24 |
GGML_API bool ggml_backend_buft_supports_backend(ggml_backend_buffer_type_t buft, ggml_backend_t backend);
|
| 25 |
GGML_API bool ggml_backend_buft_is_host (ggml_backend_buffer_type_t buft);
|
|
@@ -36,6 +37,7 @@ extern "C" {
|
|
| 36 |
GGML_API size_t ggml_backend_buffer_get_size (ggml_backend_buffer_t buffer);
|
| 37 |
GGML_API GGML_CALL void ggml_backend_buffer_init_tensor (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
|
| 38 |
GGML_API size_t ggml_backend_buffer_get_alignment (ggml_backend_buffer_t buffer);
|
|
|
|
| 39 |
GGML_API size_t ggml_backend_buffer_get_alloc_size(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
|
| 40 |
GGML_API void ggml_backend_buffer_clear (ggml_backend_buffer_t buffer, uint8_t value);
|
| 41 |
GGML_API bool ggml_backend_buffer_is_host (ggml_backend_buffer_t buffer);
|
|
@@ -54,6 +56,7 @@ extern "C" {
|
|
| 54 |
GGML_API ggml_backend_buffer_type_t ggml_backend_get_default_buffer_type(ggml_backend_t backend);
|
| 55 |
GGML_API ggml_backend_buffer_t ggml_backend_alloc_buffer(ggml_backend_t backend, size_t size);
|
| 56 |
GGML_API size_t ggml_backend_get_alignment(ggml_backend_t backend);
|
|
|
|
| 57 |
|
| 58 |
GGML_API void ggml_backend_tensor_set_async(ggml_backend_t backend, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size);
|
| 59 |
GGML_API void ggml_backend_tensor_get_async(ggml_backend_t backend, const struct ggml_tensor * tensor, void * data, size_t offset, size_t size);
|
|
|
|
| 20 |
GGML_API const char * ggml_backend_buft_name (ggml_backend_buffer_type_t buft);
|
| 21 |
GGML_API GGML_CALL ggml_backend_buffer_t ggml_backend_buft_alloc_buffer (ggml_backend_buffer_type_t buft, size_t size);
|
| 22 |
GGML_API size_t ggml_backend_buft_get_alignment (ggml_backend_buffer_type_t buft);
|
| 23 |
+
GGML_API size_t ggml_backend_buft_get_max_size (ggml_backend_buffer_type_t buft);
|
| 24 |
GGML_API GGML_CALL size_t ggml_backend_buft_get_alloc_size (ggml_backend_buffer_type_t buft, struct ggml_tensor * tensor);
|
| 25 |
GGML_API bool ggml_backend_buft_supports_backend(ggml_backend_buffer_type_t buft, ggml_backend_t backend);
|
| 26 |
GGML_API bool ggml_backend_buft_is_host (ggml_backend_buffer_type_t buft);
|
|
|
|
| 37 |
GGML_API size_t ggml_backend_buffer_get_size (ggml_backend_buffer_t buffer);
|
| 38 |
GGML_API GGML_CALL void ggml_backend_buffer_init_tensor (ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
|
| 39 |
GGML_API size_t ggml_backend_buffer_get_alignment (ggml_backend_buffer_t buffer);
|
| 40 |
+
GGML_API size_t ggml_backend_buffer_get_max_size (ggml_backend_buffer_t buffer);
|
| 41 |
GGML_API size_t ggml_backend_buffer_get_alloc_size(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor);
|
| 42 |
GGML_API void ggml_backend_buffer_clear (ggml_backend_buffer_t buffer, uint8_t value);
|
| 43 |
GGML_API bool ggml_backend_buffer_is_host (ggml_backend_buffer_t buffer);
|
|
|
|
| 56 |
GGML_API ggml_backend_buffer_type_t ggml_backend_get_default_buffer_type(ggml_backend_t backend);
|
| 57 |
GGML_API ggml_backend_buffer_t ggml_backend_alloc_buffer(ggml_backend_t backend, size_t size);
|
| 58 |
GGML_API size_t ggml_backend_get_alignment(ggml_backend_t backend);
|
| 59 |
+
GGML_API size_t ggml_backend_get_max_size(ggml_backend_t backend);
|
| 60 |
|
| 61 |
GGML_API void ggml_backend_tensor_set_async(ggml_backend_t backend, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size);
|
| 62 |
GGML_API void ggml_backend_tensor_get_async(ggml_backend_t backend, const struct ggml_tensor * tensor, void * data, size_t offset, size_t size);
|
|
@@ -10440,6 +10440,7 @@ static ggml_backend_buffer_type_i ggml_backend_cuda_buffer_type_interface = {
|
|
| 10440 |
/* .get_name = */ ggml_backend_cuda_buffer_type_name,
|
| 10441 |
/* .alloc_buffer = */ ggml_backend_cuda_buffer_type_alloc_buffer,
|
| 10442 |
/* .get_alignment = */ ggml_backend_cuda_buffer_type_get_alignment,
|
|
|
|
| 10443 |
/* .get_alloc_size = */ ggml_backend_cuda_buffer_type_get_alloc_size,
|
| 10444 |
/* .supports_backend = */ ggml_backend_cuda_buffer_type_supports_backend,
|
| 10445 |
/* .is_host = */ NULL,
|
|
@@ -10715,6 +10716,7 @@ static ggml_backend_buffer_type_i ggml_backend_cuda_split_buffer_type_interface
|
|
| 10715 |
/* .get_name = */ ggml_backend_cuda_split_buffer_type_name,
|
| 10716 |
/* .alloc_buffer = */ ggml_backend_cuda_split_buffer_type_alloc_buffer,
|
| 10717 |
/* .get_alignment = */ ggml_backend_cuda_split_buffer_type_get_alignment,
|
|
|
|
| 10718 |
/* .get_alloc_size = */ ggml_backend_cuda_split_buffer_type_get_alloc_size,
|
| 10719 |
/* .supports_backend = */ ggml_backend_cuda_split_buffer_type_supports_backend,
|
| 10720 |
/* .is_host = */ ggml_backend_cuda_split_buffer_type_is_host,
|
|
@@ -10794,6 +10796,7 @@ GGML_CALL ggml_backend_buffer_type_t ggml_backend_cuda_host_buffer_type() {
|
|
| 10794 |
/* .get_name = */ ggml_backend_cuda_host_buffer_type_name,
|
| 10795 |
/* .alloc_buffer = */ ggml_backend_cuda_host_buffer_type_alloc_buffer,
|
| 10796 |
/* .get_alignment = */ ggml_backend_cpu_buffer_type()->iface.get_alignment,
|
|
|
|
| 10797 |
/* .get_alloc_size = */ ggml_backend_cpu_buffer_type()->iface.get_alloc_size,
|
| 10798 |
/* .supports_backend = */ ggml_backend_cpu_buffer_type()->iface.supports_backend,
|
| 10799 |
/* .is_host = */ ggml_backend_cpu_buffer_type()->iface.is_host,
|
|
|
|
| 10440 |
/* .get_name = */ ggml_backend_cuda_buffer_type_name,
|
| 10441 |
/* .alloc_buffer = */ ggml_backend_cuda_buffer_type_alloc_buffer,
|
| 10442 |
/* .get_alignment = */ ggml_backend_cuda_buffer_type_get_alignment,
|
| 10443 |
+
/* .get_max_size = */ NULL, // defaults to SIZE_MAX
|
| 10444 |
/* .get_alloc_size = */ ggml_backend_cuda_buffer_type_get_alloc_size,
|
| 10445 |
/* .supports_backend = */ ggml_backend_cuda_buffer_type_supports_backend,
|
| 10446 |
/* .is_host = */ NULL,
|
|
|
|
| 10716 |
/* .get_name = */ ggml_backend_cuda_split_buffer_type_name,
|
| 10717 |
/* .alloc_buffer = */ ggml_backend_cuda_split_buffer_type_alloc_buffer,
|
| 10718 |
/* .get_alignment = */ ggml_backend_cuda_split_buffer_type_get_alignment,
|
| 10719 |
+
/* .get_max_size = */ NULL, // defaults to SIZE_MAX
|
| 10720 |
/* .get_alloc_size = */ ggml_backend_cuda_split_buffer_type_get_alloc_size,
|
| 10721 |
/* .supports_backend = */ ggml_backend_cuda_split_buffer_type_supports_backend,
|
| 10722 |
/* .is_host = */ ggml_backend_cuda_split_buffer_type_is_host,
|
|
|
|
| 10796 |
/* .get_name = */ ggml_backend_cuda_host_buffer_type_name,
|
| 10797 |
/* .alloc_buffer = */ ggml_backend_cuda_host_buffer_type_alloc_buffer,
|
| 10798 |
/* .get_alignment = */ ggml_backend_cpu_buffer_type()->iface.get_alignment,
|
| 10799 |
+
/* .get_max_size = */ NULL, // defaults to SIZE_MAX
|
| 10800 |
/* .get_alloc_size = */ ggml_backend_cpu_buffer_type()->iface.get_alloc_size,
|
| 10801 |
/* .supports_backend = */ ggml_backend_cpu_buffer_type()->iface.supports_backend,
|
| 10802 |
/* .is_host = */ ggml_backend_cpu_buffer_type()->iface.is_host,
|
|
@@ -2400,6 +2400,7 @@ GGML_CALL ggml_backend_buffer_type_t ggml_backend_metal_buffer_type(void) {
|
|
| 2400 |
/* .get_name = */ ggml_backend_metal_buffer_type_get_name,
|
| 2401 |
/* .alloc_buffer = */ ggml_backend_metal_buffer_type_alloc_buffer,
|
| 2402 |
/* .get_alignment = */ ggml_backend_metal_buffer_type_get_alignment,
|
|
|
|
| 2403 |
/* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
|
| 2404 |
/* .supports_backend = */ ggml_backend_metal_buffer_type_supports_backend,
|
| 2405 |
/* .is_host = */ ggml_backend_metal_buffer_type_is_host,
|
|
|
|
| 2400 |
/* .get_name = */ ggml_backend_metal_buffer_type_get_name,
|
| 2401 |
/* .alloc_buffer = */ ggml_backend_metal_buffer_type_alloc_buffer,
|
| 2402 |
/* .get_alignment = */ ggml_backend_metal_buffer_type_get_alignment,
|
| 2403 |
+
/* .get_max_size = */ NULL, // TODO: return device.maxBufferLength
|
| 2404 |
/* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
|
| 2405 |
/* .supports_backend = */ ggml_backend_metal_buffer_type_supports_backend,
|
| 2406 |
/* .is_host = */ ggml_backend_metal_buffer_type_is_host,
|
|
@@ -2136,6 +2136,7 @@ static ggml_backend_buffer_type_i ggml_backend_opencl_buffer_type_interface = {
|
|
| 2136 |
/* .get_name = */ ggml_backend_opencl_buffer_type_name,
|
| 2137 |
/* .alloc_buffer = */ ggml_backend_opencl_buffer_type_alloc_buffer,
|
| 2138 |
/* .get_alignment = */ ggml_backend_opencl_buffer_type_get_alignment,
|
|
|
|
| 2139 |
/* .get_alloc_size = */ NULL,
|
| 2140 |
/* .supports_backend = */ ggml_backend_opencl_buffer_type_supports_backend,
|
| 2141 |
/* .is_host = */ NULL,
|
|
@@ -2192,6 +2193,7 @@ ggml_backend_buffer_type_t ggml_backend_opencl_host_buffer_type() {
|
|
| 2192 |
/* .get_name = */ ggml_backend_opencl_host_buffer_type_name,
|
| 2193 |
/* .alloc_buffer = */ ggml_backend_opencl_host_buffer_type_alloc_buffer,
|
| 2194 |
/* .get_alignment = */ ggml_backend_cpu_buffer_type()->iface.get_alignment,
|
|
|
|
| 2195 |
/* .get_alloc_size = */ ggml_backend_cpu_buffer_type()->iface.get_alloc_size,
|
| 2196 |
/* .supports_backend = */ ggml_backend_cpu_buffer_type()->iface.supports_backend,
|
| 2197 |
/* .is_host = */ ggml_backend_cpu_buffer_type()->iface.is_host,
|
|
|
|
| 2136 |
/* .get_name = */ ggml_backend_opencl_buffer_type_name,
|
| 2137 |
/* .alloc_buffer = */ ggml_backend_opencl_buffer_type_alloc_buffer,
|
| 2138 |
/* .get_alignment = */ ggml_backend_opencl_buffer_type_get_alignment,
|
| 2139 |
+
/* .get_max_size = */ NULL, // TODO: return from device info
|
| 2140 |
/* .get_alloc_size = */ NULL,
|
| 2141 |
/* .supports_backend = */ ggml_backend_opencl_buffer_type_supports_backend,
|
| 2142 |
/* .is_host = */ NULL,
|
|
|
|
| 2193 |
/* .get_name = */ ggml_backend_opencl_host_buffer_type_name,
|
| 2194 |
/* .alloc_buffer = */ ggml_backend_opencl_host_buffer_type_alloc_buffer,
|
| 2195 |
/* .get_alignment = */ ggml_backend_cpu_buffer_type()->iface.get_alignment,
|
| 2196 |
+
/* .get_max_size = */ NULL, // defaults to SIZE_MAX
|
| 2197 |
/* .get_alloc_size = */ ggml_backend_cpu_buffer_type()->iface.get_alloc_size,
|
| 2198 |
/* .supports_backend = */ ggml_backend_cpu_buffer_type()->iface.supports_backend,
|
| 2199 |
/* .is_host = */ ggml_backend_cpu_buffer_type()->iface.is_host,
|
|
@@ -248,6 +248,8 @@ inline static void * ggml_aligned_malloc(size_t size) {
|
|
| 248 |
#include "ggml-cuda.h"
|
| 249 |
#elif defined(GGML_USE_CLBLAST)
|
| 250 |
#include "ggml-opencl.h"
|
|
|
|
|
|
|
| 251 |
#elif defined(GGML_USE_SYCL)
|
| 252 |
#include "ggml-sycl.h"
|
| 253 |
#endif
|
|
@@ -2295,6 +2297,8 @@ struct ggml_context * ggml_init(struct ggml_init_params params) {
|
|
| 2295 |
ggml_init_cublas();
|
| 2296 |
#elif defined(GGML_USE_CLBLAST)
|
| 2297 |
ggml_cl_init();
|
|
|
|
|
|
|
| 2298 |
#elif defined(GGML_USE_SYCL)
|
| 2299 |
ggml_init_sycl();
|
| 2300 |
#endif
|
|
@@ -8019,7 +8023,7 @@ static void ggml_compute_forward_mul_f32(
|
|
| 8019 |
const int ith = params->ith;
|
| 8020 |
const int nth = params->nth;
|
| 8021 |
|
| 8022 |
-
#
|
| 8023 |
if (src1->backend == GGML_BACKEND_GPU) {
|
| 8024 |
// TODO: OpenCL kernel support full broadcast
|
| 8025 |
GGML_ASSERT(ggml_can_repeat_rows(src1, src0));
|
|
@@ -14703,6 +14707,18 @@ static void ggml_compute_forward(struct ggml_compute_params * params, struct ggm
|
|
| 14703 |
}
|
| 14704 |
GGML_ASSERT(tensor->src[0] == NULL || tensor->src[0]->backend == GGML_BACKEND_CPU);
|
| 14705 |
GGML_ASSERT(tensor->src[1] == NULL || tensor->src[1]->backend == GGML_BACKEND_CPU);
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14706 |
#endif // GGML_USE_CUBLAS
|
| 14707 |
|
| 14708 |
#ifdef GGML_USE_SYCL
|
|
@@ -17105,6 +17121,17 @@ int ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan) {
|
|
| 17105 |
}
|
| 17106 |
}
|
| 17107 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17108 |
const int n_threads = cplan->n_threads;
|
| 17109 |
|
| 17110 |
struct ggml_compute_state_shared state_shared = {
|
|
@@ -17156,6 +17183,10 @@ int ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan) {
|
|
| 17156 |
}
|
| 17157 |
}
|
| 17158 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17159 |
// performance stats (graph)
|
| 17160 |
{
|
| 17161 |
int64_t perf_cycles_cur = ggml_perf_cycles() - perf_start_cycles;
|
|
@@ -20290,7 +20321,7 @@ int ggml_cpu_has_wasm_simd(void) {
|
|
| 20290 |
}
|
| 20291 |
|
| 20292 |
int ggml_cpu_has_blas(void) {
|
| 20293 |
-
#if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS) || defined(GGML_USE_CUBLAS) || defined(GGML_USE_CLBLAST) || defined(GGML_USE_SYCL)
|
| 20294 |
return 1;
|
| 20295 |
#else
|
| 20296 |
return 0;
|
|
@@ -20313,6 +20344,14 @@ int ggml_cpu_has_clblast(void) {
|
|
| 20313 |
#endif
|
| 20314 |
}
|
| 20315 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20316 |
int ggml_cpu_has_sycl(void) {
|
| 20317 |
#if defined(GGML_USE_SYCL)
|
| 20318 |
return 1;
|
|
@@ -20322,7 +20361,7 @@ int ggml_cpu_has_sycl(void) {
|
|
| 20322 |
}
|
| 20323 |
|
| 20324 |
int ggml_cpu_has_gpublas(void) {
|
| 20325 |
-
return ggml_cpu_has_cublas() || ggml_cpu_has_clblast() || ggml_cpu_has_sycl();
|
| 20326 |
}
|
| 20327 |
|
| 20328 |
int ggml_cpu_has_sse3(void) {
|
|
|
|
| 248 |
#include "ggml-cuda.h"
|
| 249 |
#elif defined(GGML_USE_CLBLAST)
|
| 250 |
#include "ggml-opencl.h"
|
| 251 |
+
#elif defined(GGML_USE_VULKAN)
|
| 252 |
+
#include "ggml-vulkan.h"
|
| 253 |
#elif defined(GGML_USE_SYCL)
|
| 254 |
#include "ggml-sycl.h"
|
| 255 |
#endif
|
|
|
|
| 2297 |
ggml_init_cublas();
|
| 2298 |
#elif defined(GGML_USE_CLBLAST)
|
| 2299 |
ggml_cl_init();
|
| 2300 |
+
#elif defined(GGML_USE_VULKAN)
|
| 2301 |
+
ggml_vk_init();
|
| 2302 |
#elif defined(GGML_USE_SYCL)
|
| 2303 |
ggml_init_sycl();
|
| 2304 |
#endif
|
|
|
|
| 8023 |
const int ith = params->ith;
|
| 8024 |
const int nth = params->nth;
|
| 8025 |
|
| 8026 |
+
#if defined(GGML_USE_CLBLAST)
|
| 8027 |
if (src1->backend == GGML_BACKEND_GPU) {
|
| 8028 |
// TODO: OpenCL kernel support full broadcast
|
| 8029 |
GGML_ASSERT(ggml_can_repeat_rows(src1, src0));
|
|
|
|
| 14707 |
}
|
| 14708 |
GGML_ASSERT(tensor->src[0] == NULL || tensor->src[0]->backend == GGML_BACKEND_CPU);
|
| 14709 |
GGML_ASSERT(tensor->src[1] == NULL || tensor->src[1]->backend == GGML_BACKEND_CPU);
|
| 14710 |
+
#elif defined(GGML_USE_VULKAN)
|
| 14711 |
+
const bool skip_cpu = ggml_vk_compute_forward(params, tensor);
|
| 14712 |
+
#ifdef GGML_VULKAN_CHECK_RESULTS
|
| 14713 |
+
if (skip_cpu) {
|
| 14714 |
+
ggml_vk_check_results_1(params, tensor);
|
| 14715 |
+
}
|
| 14716 |
+
#endif
|
| 14717 |
+
if (skip_cpu) {
|
| 14718 |
+
return;
|
| 14719 |
+
}
|
| 14720 |
+
GGML_ASSERT(tensor->src[0] == NULL || tensor->src[0]->backend == GGML_BACKEND_CPU);
|
| 14721 |
+
GGML_ASSERT(tensor->src[1] == NULL || tensor->src[1]->backend == GGML_BACKEND_CPU);
|
| 14722 |
#endif // GGML_USE_CUBLAS
|
| 14723 |
|
| 14724 |
#ifdef GGML_USE_SYCL
|
|
|
|
| 17121 |
}
|
| 17122 |
}
|
| 17123 |
|
| 17124 |
+
#ifdef GGML_USE_VULKAN
|
| 17125 |
+
for (int i = 0; i < cgraph->n_nodes; i++) {
|
| 17126 |
+
ggml_vk_preallocate_buffers_graph(cgraph->nodes[i]);
|
| 17127 |
+
}
|
| 17128 |
+
ggml_vk_preallocate_buffers();
|
| 17129 |
+
|
| 17130 |
+
for (int i = 0; i < cgraph->n_nodes; i++) {
|
| 17131 |
+
ggml_vk_build_graph(cgraph->nodes[i], i == cgraph->n_nodes - 1);
|
| 17132 |
+
}
|
| 17133 |
+
#endif
|
| 17134 |
+
|
| 17135 |
const int n_threads = cplan->n_threads;
|
| 17136 |
|
| 17137 |
struct ggml_compute_state_shared state_shared = {
|
|
|
|
| 17183 |
}
|
| 17184 |
}
|
| 17185 |
|
| 17186 |
+
#ifdef GGML_USE_VULKAN
|
| 17187 |
+
ggml_vk_graph_cleanup();
|
| 17188 |
+
#endif
|
| 17189 |
+
|
| 17190 |
// performance stats (graph)
|
| 17191 |
{
|
| 17192 |
int64_t perf_cycles_cur = ggml_perf_cycles() - perf_start_cycles;
|
|
|
|
| 20321 |
}
|
| 20322 |
|
| 20323 |
int ggml_cpu_has_blas(void) {
|
| 20324 |
+
#if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS) || defined(GGML_USE_CUBLAS) || defined(GGML_USE_VULKAN) || defined(GGML_USE_CLBLAST) || defined(GGML_USE_SYCL)
|
| 20325 |
return 1;
|
| 20326 |
#else
|
| 20327 |
return 0;
|
|
|
|
| 20344 |
#endif
|
| 20345 |
}
|
| 20346 |
|
| 20347 |
+
int ggml_cpu_has_vulkan(void) {
|
| 20348 |
+
#if defined(GGML_USE_VULKAN)
|
| 20349 |
+
return 1;
|
| 20350 |
+
#else
|
| 20351 |
+
return 0;
|
| 20352 |
+
#endif
|
| 20353 |
+
}
|
| 20354 |
+
|
| 20355 |
int ggml_cpu_has_sycl(void) {
|
| 20356 |
#if defined(GGML_USE_SYCL)
|
| 20357 |
return 1;
|
|
|
|
| 20361 |
}
|
| 20362 |
|
| 20363 |
int ggml_cpu_has_gpublas(void) {
|
| 20364 |
+
return ggml_cpu_has_cublas() || ggml_cpu_has_clblast() || ggml_cpu_has_vulkan() || ggml_cpu_has_sycl();
|
| 20365 |
}
|
| 20366 |
|
| 20367 |
int ggml_cpu_has_sse3(void) {
|
|
@@ -2263,6 +2263,7 @@ extern "C" {
|
|
| 2263 |
GGML_API int ggml_cpu_has_blas (void);
|
| 2264 |
GGML_API int ggml_cpu_has_cublas (void);
|
| 2265 |
GGML_API int ggml_cpu_has_clblast (void);
|
|
|
|
| 2266 |
GGML_API int ggml_cpu_has_gpublas (void);
|
| 2267 |
GGML_API int ggml_cpu_has_sse3 (void);
|
| 2268 |
GGML_API int ggml_cpu_has_ssse3 (void);
|
|
|
|
| 2263 |
GGML_API int ggml_cpu_has_blas (void);
|
| 2264 |
GGML_API int ggml_cpu_has_cublas (void);
|
| 2265 |
GGML_API int ggml_cpu_has_clblast (void);
|
| 2266 |
+
GGML_API int ggml_cpu_has_vulkan (void);
|
| 2267 |
GGML_API int ggml_cpu_has_gpublas (void);
|
| 2268 |
GGML_API int ggml_cpu_has_sse3 (void);
|
| 2269 |
GGML_API int ggml_cpu_has_ssse3 (void);
|