Spaces:
Running
Running
Commit History
CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n (llama/15132)
c768824
musa: fix failures in test-backend-ops for mul_mat_id op (llama/15236)
4168dda
CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 (llama/15131)
1d24833
HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes (llama/14949)
149f5a5
CUDA: skip masked KV slices for all FA kernels (llama/14924)
0c60f80
HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets (llama/14945)
e37eff3
HIP: add GGML_HIP_MMQ_MFMA option to allow disableing the MFMA path. (llama/14930)
f9dbd96
HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (llama/14624)
5422b31
deepsek
commited on
musa: upgrade musa sdk to rc4.2.0 (llama/14498)
a687ec3
musa: fix build warnings (unused variable) (llama/14561)
891b1d1
CUDA: add dynamic shared mem to softmax, refactor general usage (llama/14497)
8e1f56c
musa: enable fp16 mma (all) and cublas on qy2 (llama/13842)
e35329b
CUDA/HIP: optimize mmv paths taken for HIP devices (llama/14324)
1a9d2d3
CUDA: mul_mat_v support for batch sizes > 1 (llama/14262)
2d1e6e7
HIP: enable vec fattn on RDNA4 (llama/14323)
b6dc6a1
uvos
commited on
CUDA: add mean operation (llama/14313)
7cee55b
cuda : synchronize graph capture and cublas handle destruction (llama/14288)
39c4fa5
Diego Devesa
commited on
HIP: disable rocwmma on gfx12 by default until rocm 7.0 (llama/14202)
f95736f
uvos
commited on
HIP: Replace usage of depricated preprocessor macro __AMDGCN_WAVEFRONT_SIZE__ (llama/14183)
c3467c7
uvos
commited on
ggml-cpu : split arch-specific implementations (llama/13892)
8c833e9
CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856) (llama/13895)
a75e157
cuda : avoid cuGetErrorString (llama/13791)
cdf95d3
CUDA: FA support for Deepseek (Ampere or newer) (llama/13306)
507d30c
whisper: remove MSVC warnings pragmas (#3090)
e0d130c
unverified
musa: fix typo in cc control (llama/13144)
5fb7320
R0CKSTAR
commited on
Simplify and improve CUDA graphs through use of indirect copy pointers (llama/9017)
a2fdbe6
Alan Gray
slaren
commited on
musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (llama/12611)
12bb60d
R0CKSTAR
commited on
HIP: Add support for RDNA4 targets (llama/12372)
a73f01f
Slobodan Josic
commited on
CUDA: Fix clang warnings (llama/12540)
efa6dac
R0CKSTAR
commited on
musa: refine compute capability (llama/12493)
5e508d2
R0CKSTAR
commited on
cuda : enable CUDA Graph on CUDA Toolkit < 12.x (llama/12394)
1e69b8c
Gaurav Garg
commited on
CUDA/HIP: refractor mmqv to unify the calculation of nwarps and rows per block between host and device code. (llama/12177)
1f75790
HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (llama/12032)
a027c1d
David Huang
commited on
CUDA: app option to compile without FlashAttention (llama/12025)
fbc5f16
MUSA: support ARM64 and enable dp4a .etc (llama/11843)
ab96dac
Bodhi
Bodhi Hu
commited on
CUDA: use async data loading for FlashAttention (llama/11894)
5b9980d
CUDA: fix CUDART_VERSION checks (llama/11821)
04f123a
CUDA: use arch list for compatibility check (llama/11775)
b88e163
CUDA/HIP: add support for selectable warp size to mmv (llama/11519)
ed08269
uvos
commited on
HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other (llama/11601)
4850c24
uvos
commited on
CUDA: use mma PTX instructions for FlashAttention (llama/11583)
f328957
HIP: Prepare reduction operators for wave 64
bc1c1a4
uvos
commited on
CUDA/HIP: add warp_size to cuda_device_info
e538e2c
uvos
commited on
AMD: parse the architecture as supplied by gcnArchName (llama/11244)
04b01d8
Haus1
commited on
Hip: disable VMM on hip as it seams that it dosent work in some configurations (llama/11420)
2cc4df4
uvos
commited on
hip : Add hipGraph and VMM support to ROCM (llama/11362)
089afa0
uvos
commited on
CUDA: rename macros to avoid conflicts with WinAPI (llama/10736)
8544072
Andreas Kieslinger
commited on
Add some minimal optimizations for CDNA (llama/10498)
bf49bbe
uvos
commited on