Commits · natasa365/whisper.cpp

CUDA: fix negative KV_max values in FA (llama/15321)

6e3a7b6

JohannesGaessler commited on Aug 14, 2025

HIP: Cleanup hipification header (llama/15285)

7cdf9cd

uvos

JohannesGaessler commited on Aug 14, 2025

cuda : fix GGML_CUDA_GRAPHS=OFF (llama/15300)

59c694d

Sigbjørn Skjæret commited on Aug 14, 2025

finetune: SGD optimizer, more CLI args (llama/13873)

f585fe7

Jonathan Graehl

OccamRazor

JohannesGaessler commited on Aug 14, 2025

HIP: bump requirement to rocm 6.1 (llama/15296)

58a3802

uvos commited on Aug 13, 2025

CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n (llama/15132)

c768824

ORippler commited on Aug 13, 2025

HIP: disable sync warp shuffel operators from clr amd_warp_sync_functions.h (llama/15273)

8fca6dd

uvos commited on Aug 12, 2025

CUDA cmake: add `-lineinfo` for easier debug (llama/15260)

008e169

am17an commited on Aug 12, 2025

musa: fix failures in test-backend-ops for mul_mat_id op (llama/15236)

4168dda

yeahdongcn commited on Aug 12, 2025

cuda: refactored ssm_scan and use CUB (llama/13291)

7a187d1

David Zhao commited on Aug 9, 2025

CUDA: add attention sinks for tile and wmma (llama/15178)

46e7c87

am17an commited on Aug 9, 2025

ggml : fix field name when new ggml_backend (llama/14944)

685748d

AN Long commited on Aug 8, 2025

CUDA: attention sinks for mma FlashAttention (llama/15157)

0ab9aba

JohannesGaessler commited on Aug 8, 2025

CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 (llama/15131)

1d24833

JohannesGaessler commited on Aug 7, 2025

llama : add gpt-oss (llama/15091)

bf225d6

ggerganov

ngxson HF Staff slaren commited on Aug 5, 2025

CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (llama/15035)

9e85264

JohannesGaessler commited on Aug 2, 2025

cuda: make im2col a little faster (llama/15025)

9a85c65

leejet commited on Aug 2, 2025

cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 (llama/15038)

cc3a2ed

ggerganov commited on Aug 2, 2025

CUDA: fix MMQ nwarps for AMD with warp_size==32 (llama/15014)

fbc3cd1

JohannesGaessler commited on Aug 1, 2025

HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes (llama/14949)

149f5a5

uvos commited on Jul 30, 2025

CUDA: skip masked KV slices for all FA kernels (llama/14924)

0c60f80

JohannesGaessler commited on Jul 30, 2025

HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets (llama/14945)

e37eff3

uvos commited on Jul 29, 2025

HIP: add GGML_HIP_MMQ_MFMA option to allow disableing the MFMA path. (llama/14930)

f9dbd96

uvos commited on Jul 29, 2025

HIP: Ignore unsupported unroll transformation in fattn-vec (llama/14931)

8e133f7

uvos commited on Jul 29, 2025

cuda : add softcap fusion (llama/14907)

2237878

Sigbjørn Skjæret commited on Jul 29, 2025

CUDA: add roll (llama/14919)

d41a4ec

am17an commited on Jul 29, 2025

CUDA: fix pointer incrementation in FA (llama/14916)

eb84e7e

JohannesGaessler commited on Jul 28, 2025

HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (llama/14624)

5422b31

deepsek commited on Jul 26, 2025

musa: fix build warnings (unused variable) (llama/14869)

f38d409

yeahdongcn commited on Jul 26, 2025

musa: upgrade musa sdk to rc4.2.0 (llama/14498)

a687ec3

yeahdongcn commited on Jul 24, 2025

CUDA: fix overflow in FA, tune performance (llama/14840)

10ac92f

JohannesGaessler commited on Jul 23, 2025

CUDA: fix compilation with GGML_CUDA_F16 (llama/14837)

2746afd

JohannesGaessler commited on Jul 23, 2025

CUDA: fix quantized KV cache + multiple sequences (llama/14822)

88864af

JohannesGaessler

ggerganov commited on Jul 23, 2025

CUDA: add fused rms norm (llama/14800)

79bc58c

am17an commited on Jul 23, 2025

cuda : implement bf16 cpy ops and enable bf16 cont (llama/14763)

b54b644

Sigbjørn Skjæret commited on Jul 22, 2025

cuda: remove linking to cublasLt (llama/14790)

fafaa8b

yeahdongcn commited on Jul 21, 2025

vulkan/cuda: Fix im2col when KW!=KH (llama/14789)

0be0329

jeffbolznv commited on Jul 21, 2025

cuda : Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs (llama/14741)

bb523fb

Oliver Simons commited on Jul 18, 2025

CUDA: set_rows + cpy.cu refactor (llama/14712)

536128f

am17an commited on Jul 18, 2025

llama : add high-throughput mode (llama/14363)

b2d73a2

ggerganov

JohannesGaessler commited on Jul 16, 2025

cuda: fix build warnings in set-rows.cu (unused variable) (llama/14687)

1e145c7

yeahdongcn commited on Jul 15, 2025

cuda : add set rows for bf16 (llama/14664)

1f97ff4

Sigbjørn Skjæret commited on Jul 13, 2025

cuda : add ELU support (llama/14657)

cbe8006

Yavor Ivanov commited on Jul 13, 2025

ggml : add build-time message to remind about ggml_set_rows (llama/14661)

0f5d4ba

ggerganov commited on Jul 13, 2025

CUDA: add set rows for f32 and f16 (llama/14551)

e51f2d4

am17an commited on Jul 12, 2025

model : support LiquidAI LFM2 hybrid family (llama/14620)

07ff90a

Tarek Dakhran commited on Jul 11, 2025

HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (llama/14634)

4354560

Slobodan Josic commited on Jul 11, 2025

cuda : support Falcon-H1 state size for SSM_SCAN (llama/14602)

92b2d32

compilade commited on Jul 10, 2025

ggml : add ggml_scale_bias (llama/14417)

573d50a

ngxson HF Staff commited on Jul 9, 2025

cuda : fix rope with partial rotation and non-cont src (llama/14580)

aaf2d96

ggerganov commited on Jul 8, 2025

Commit History

CUDA: fix negative KV_max values in FA (llama/15321) 6e3a7b6

HIP: Cleanup hipification header (llama/15285) 7cdf9cd

cuda : fix GGML_CUDA_GRAPHS=OFF (llama/15300) 59c694d

finetune: SGD optimizer, more CLI args (llama/13873) f585fe7

HIP: bump requirement to rocm 6.1 (llama/15296) 58a3802

CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n (llama/15132) c768824

HIP: disable sync warp shuffel operators from clr amd_warp_sync_functions.h (llama/15273) 8fca6dd

CUDA cmake: add `-lineinfo` for easier debug (llama/15260) 008e169

musa: fix failures in test-backend-ops for mul_mat_id op (llama/15236) 4168dda

cuda: refactored ssm_scan and use CUB (llama/13291) 7a187d1

CUDA: add attention sinks for tile and wmma (llama/15178) 46e7c87

ggml : fix field name when new ggml_backend (llama/14944) 685748d

CUDA: attention sinks for mma FlashAttention (llama/15157) 0ab9aba

CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 (llama/15131) 1d24833

llama : add gpt-oss (llama/15091) bf225d6

CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (llama/15035) 9e85264

cuda: make im2col a little faster (llama/15025) 9a85c65

cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 (llama/15038) cc3a2ed

CUDA: fix MMQ nwarps for AMD with warp_size==32 (llama/15014) fbc3cd1

HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes (llama/14949) 149f5a5

CUDA: skip masked KV slices for all FA kernels (llama/14924) 0c60f80

HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets (llama/14945) e37eff3

HIP: add GGML_HIP_MMQ_MFMA option to allow disableing the MFMA path. (llama/14930) f9dbd96

HIP: Ignore unsupported unroll transformation in fattn-vec (llama/14931) 8e133f7

cuda : add softcap fusion (llama/14907) 2237878

CUDA: add roll (llama/14919) d41a4ec

CUDA: fix pointer incrementation in FA (llama/14916) eb84e7e

HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (llama/14624) 5422b31

musa: fix build warnings (unused variable) (llama/14869) f38d409

musa: upgrade musa sdk to rc4.2.0 (llama/14498) a687ec3

CUDA: fix overflow in FA, tune performance (llama/14840) 10ac92f

CUDA: fix compilation with GGML_CUDA_F16 (llama/14837) 2746afd

CUDA: fix quantized KV cache + multiple sequences (llama/14822) 88864af

CUDA: add fused rms norm (llama/14800) 79bc58c

cuda : implement bf16 cpy ops and enable bf16 cont (llama/14763) b54b644

cuda: remove linking to cublasLt (llama/14790) fafaa8b

vulkan/cuda: Fix im2col when KW!=KH (llama/14789) 0be0329

cuda : Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs (llama/14741) bb523fb

CUDA: set_rows + cpy.cu refactor (llama/14712) 536128f

llama : add high-throughput mode (llama/14363) b2d73a2

cuda: fix build warnings in set-rows.cu (unused variable) (llama/14687) 1e145c7

cuda : add set rows for bf16 (llama/14664) 1f97ff4

cuda : add ELU support (llama/14657) cbe8006

ggml : add build-time message to remind about ggml_set_rows (llama/14661) 0f5d4ba

CUDA: add set rows for f32 and f16 (llama/14551) e51f2d4

model : support LiquidAI LFM2 hybrid family (llama/14620) 07ff90a

HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (llama/14634) 4354560

cuda : support Falcon-H1 state size for SSM_SCAN (llama/14602) 92b2d32

ggml : add ggml_scale_bias (llama/14417) 573d50a

cuda : fix rope with partial rotation and non-cont src (llama/14580) aaf2d96

CUDA: fix negative KV_max values in FA (llama/15321)

6e3a7b6

HIP: Cleanup hipification header (llama/15285)

7cdf9cd

cuda : fix GGML_CUDA_GRAPHS=OFF (llama/15300)

59c694d

finetune: SGD optimizer, more CLI args (llama/13873)

f585fe7

HIP: bump requirement to rocm 6.1 (llama/15296)

58a3802

CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n (llama/15132)

c768824

HIP: disable sync warp shuffel operators from clr amd_warp_sync_functions.h (llama/15273)

8fca6dd

CUDA cmake: add `-lineinfo` for easier debug (llama/15260)

008e169

musa: fix failures in test-backend-ops for mul_mat_id op (llama/15236)

4168dda

cuda: refactored ssm_scan and use CUB (llama/13291)

7a187d1

CUDA: add attention sinks for tile and wmma (llama/15178)

46e7c87

ggml : fix field name when new ggml_backend (llama/14944)

685748d

CUDA: attention sinks for mma FlashAttention (llama/15157)

0ab9aba

CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 (llama/15131)

1d24833

llama : add gpt-oss (llama/15091)

bf225d6

CUDA: use mma FA kernel for gqa > 4 on RTX 4000 (llama/15035)

9e85264

cuda: make im2col a little faster (llama/15025)

9a85c65

cuda, sycl : fix batched gemm when ne02 == 1 && ne03 > 1 (llama/15038)

cc3a2ed

CUDA: fix MMQ nwarps for AMD with warp_size==32 (llama/15014)

fbc3cd1

HIP: enable mfma mmq on gfx908 and gfx90a for select datatypes and shapes (llama/14949)

149f5a5

CUDA: skip masked KV slices for all FA kernels (llama/14924)

0c60f80

HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets (llama/14945)

e37eff3

HIP: add GGML_HIP_MMQ_MFMA option to allow disableing the MFMA path. (llama/14930)

f9dbd96

HIP: Ignore unsupported unroll transformation in fattn-vec (llama/14931)

8e133f7

cuda : add softcap fusion (llama/14907)

2237878

CUDA: add roll (llama/14919)

d41a4ec

CUDA: fix pointer incrementation in FA (llama/14916)

eb84e7e

HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 (llama/14624)

5422b31

musa: fix build warnings (unused variable) (llama/14869)

f38d409

musa: upgrade musa sdk to rc4.2.0 (llama/14498)

a687ec3

CUDA: fix overflow in FA, tune performance (llama/14840)

10ac92f

CUDA: fix compilation with GGML_CUDA_F16 (llama/14837)

2746afd

CUDA: fix quantized KV cache + multiple sequences (llama/14822)

88864af

CUDA: add fused rms norm (llama/14800)

79bc58c

cuda : implement bf16 cpy ops and enable bf16 cont (llama/14763)

b54b644

cuda: remove linking to cublasLt (llama/14790)

fafaa8b

vulkan/cuda: Fix im2col when KW!=KH (llama/14789)

0be0329

cuda : Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs (llama/14741)

bb523fb

CUDA: set_rows + cpy.cu refactor (llama/14712)

536128f

llama : add high-throughput mode (llama/14363)

b2d73a2

cuda: fix build warnings in set-rows.cu (unused variable) (llama/14687)

1e145c7

cuda : add set rows for bf16 (llama/14664)

1f97ff4

cuda : add ELU support (llama/14657)

cbe8006

ggml : add build-time message to remind about ggml_set_rows (llama/14661)

0f5d4ba

CUDA: add set rows for f32 and f16 (llama/14551)

e51f2d4

model : support LiquidAI LFM2 hybrid family (llama/14620)

07ff90a

HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (llama/14634)

4354560

cuda : support Falcon-H1 state size for SSM_SCAN (llama/14602)

92b2d32

ggml : add ggml_scale_bias (llama/14417)

573d50a

cuda : fix rope with partial rotation and non-cont src (llama/14580)

aaf2d96