Spaces:
Running
Running
| whispercpp | |
| ========== | |
|  | |
| Ruby bindings for [whisper.cpp][], an interface of automatic speech recognition model. | |
| Installation | |
| ------------ | |
| Install the gem and add to the application's Gemfile by executing: | |
| $ bundle add whispercpp | |
| If bundler is not being used to manage dependencies, install the gem by executing: | |
| $ gem install whispercpp | |
| You can pass build options for whisper.cpp, for instance: | |
| $ bundle config build.whispercpp --enable-ggml-cuda | |
| or, | |
| $ gem install whispercpp -- --enable-ggml-cuda | |
| See whisper.cpp's [README](https://github.com/ggml-org/whisper.cpp/blob/master/README.md) for available options. You need convert options present the README to Ruby-style options. | |
| For boolean options like `GGML_CUDA`, the README says `-DGGML_CUDA=1`. You need strip `-D`, prepend `--enable-` for `1` or `ON` (`--disable-` for `0` or `OFF`) and make it kebab-case: `--enable-ggml-cuda`. | |
| For options which require arguments like `CMAKE_CUDA_ARCHITECTURES`, the README says `-DCMAKE_CUDA_ARCHITECTURES="86"`. You need strip `-D`, prepend `--`, make it kebab-case, append `=` and append argument: `--cmake-cuda-architectures="86"`. | |
| Usage | |
| ----- | |
| ```ruby | |
| require "whisper" | |
| whisper = Whisper::Context.new("base") | |
| params = Whisper::Params.new( | |
| language: "en", | |
| offset: 10_000, | |
| duration: 60_000, | |
| max_text_tokens: 300, | |
| translate: true, | |
| print_timestamps: false, | |
| initial_prompt: "Initial prompt here." | |
| ) | |
| whisper.transcribe("path/to/audio.wav", params) do |whole_text| | |
| puts whole_text | |
| end | |
| ``` | |
| ### Preparing model ### | |
| Some models are prepared up-front: | |
| ```ruby | |
| base_en = Whisper::Model.pre_converted_models["base.en"] | |
| whisper = Whisper::Context.new(base_en) | |
| ``` | |
| At first time you use a model, it is downloaded automatically. After that, downloaded cached file is used. To clear cache, call `#clear_cache`: | |
| ```ruby | |
| Whisper::Model.pre_converted_models["base"].clear_cache | |
| ``` | |
| You also can use shorthand for pre-converted models: | |
| ```ruby | |
| whisper = Whisper::Context.new("base.en") | |
| ``` | |
| You can see the list of prepared model names by `Whisper::Model.pre_converted_models.keys`: | |
| ```ruby | |
| puts Whisper::Model.pre_converted_models.keys | |
| # tiny | |
| # tiny.en | |
| # tiny-q5_1 | |
| # tiny.en-q5_1 | |
| # tiny-q8_0 | |
| # base | |
| # base.en | |
| # base-q5_1 | |
| # base.en-q5_1 | |
| # base-q8_0 | |
| # : | |
| # : | |
| ``` | |
| You can also use local model files you prepared: | |
| ```ruby | |
| whisper = Whisper::Context.new("path/to/your/model.bin") | |
| ``` | |
| Or, you can download model files: | |
| ```ruby | |
| whisper = Whisper::Context.new("https://example.net/uri/of/your/model.bin") | |
| # Or | |
| whisper = Whisper::Context.new(URI("https://example.net/uri/of/your/model.bin")) | |
| ``` | |
| See [models][] page for details. | |
| ### Preparing audio file ### | |
| Currently, whisper.cpp accepts only 16-bit WAV files. | |
| API | |
| --- | |
| ### Segments ### | |
| Once `Whisper::Context#transcribe` called, you can retrieve segments by `#each_segment`: | |
| ```ruby | |
| def format_time(time_ms) | |
| sec, decimal_part = time_ms.divmod(1000) | |
| min, sec = sec.divmod(60) | |
| hour, min = min.divmod(60) | |
| "%02d:%02d:%02d.%03d" % [hour, min, sec, decimal_part] | |
| end | |
| whisper | |
| .transcribe("path/to/audio.wav", params) | |
| .each_segment.with_index do |segment, index| | |
| line = "[%{nth}: %{st} --> %{ed}] %{text}" % { | |
| nth: index + 1, | |
| st: format_time(segment.start_time), | |
| ed: format_time(segment.end_time), | |
| text: segment.text | |
| } | |
| line << " (speaker turned)" if segment.speaker_next_turn? | |
| puts line | |
| end | |
| ``` | |
| You can also add hook to params called on new segment: | |
| ```ruby | |
| # Add hook before calling #transcribe | |
| params.on_new_segment do |segment| | |
| line = "[%{st} --> %{ed}] %{text}" % { | |
| st: format_time(segment.start_time), | |
| ed: format_time(segment.end_time), | |
| text: segment.text | |
| } | |
| line << " (speaker turned)" if segment.speaker_next_turn? | |
| puts line | |
| end | |
| whisper.transcribe("path/to/audio.wav", params) | |
| ``` | |
| ### Models ### | |
| You can see model information: | |
| ```ruby | |
| whisper = Whisper::Context.new("base") | |
| model = whisper.model | |
| model.n_vocab # => 51864 | |
| model.n_audio_ctx # => 1500 | |
| model.n_audio_state # => 512 | |
| model.n_audio_head # => 8 | |
| model.n_audio_layer # => 6 | |
| model.n_text_ctx # => 448 | |
| model.n_text_state # => 512 | |
| model.n_text_head # => 8 | |
| model.n_text_layer # => 6 | |
| model.n_mels # => 80 | |
| model.ftype # => 1 | |
| model.type # => "base" | |
| ``` | |
| ### Logging ### | |
| You can set log callback: | |
| ```ruby | |
| prefix = "[MyApp] " | |
| log_callback = ->(level, buffer, user_data) { | |
| case level | |
| when Whisper::LOG_LEVEL_NONE | |
| puts "#{user_data}none: #{buffer}" | |
| when Whisper::LOG_LEVEL_INFO | |
| puts "#{user_data}info: #{buffer}" | |
| when Whisper::LOG_LEVEL_WARN | |
| puts "#{user_data}warn: #{buffer}" | |
| when Whisper::LOG_LEVEL_ERROR | |
| puts "#{user_data}error: #{buffer}" | |
| when Whisper::LOG_LEVEL_DEBUG | |
| puts "#{user_data}debug: #{buffer}" | |
| when Whisper::LOG_LEVEL_CONT | |
| puts "#{user_data}same to previous: #{buffer}" | |
| end | |
| } | |
| Whisper.log_set log_callback, prefix | |
| ``` | |
| Using this feature, you are also able to suppress log: | |
| ```ruby | |
| Whisper.log_set ->(level, buffer, user_data) { | |
| # do nothing | |
| }, nil | |
| Whisper::Context.new("base") | |
| ``` | |
| ### Low-level API to transcribe ### | |
| You can also call `Whisper::Context#full` and `#full_parallel` with a Ruby array as samples. Although `#transcribe` with audio file path is recommended because it extracts PCM samples in C++ and is fast, `#full` and `#full_parallel` give you flexibility. | |
| ```ruby | |
| require "whisper" | |
| require "wavefile" | |
| reader = WaveFile::Reader.new("path/to/audio.wav", WaveFile::Format.new(:mono, :float, 16000)) | |
| samples = reader.enum_for(:each_buffer).map(&:samples).flatten | |
| whisper = Whisper::Context.new("base") | |
| whisper | |
| .full(Whisper::Params.new, samples) | |
| .each_segment do |segment| | |
| puts segment.text | |
| end | |
| ``` | |
| The second argument `samples` may be an array, an object with `length` and `each` method, or a MemoryView. If you can prepare audio data as C array and export it as a MemoryView, whispercpp accepts and works with it with zero copy. | |
| Development | |
| ----------- | |
| % git clone https://github.com/ggml-org/whisper.cpp.git | |
| % cd whisper.cpp/bindings/ruby | |
| % rake test | |
| First call of `rake test` builds an extension and downloads a model for testing. After that, you add tests in `tests` directory and modify `ext/ruby_whisper.cpp`. | |
| If something seems wrong on build, running `rake clean` solves some cases. | |
| License | |
| ------- | |
| The same to [whisper.cpp][]. | |
| [whisper.cpp]: https://github.com/ggml-org/whisper.cpp | |
| [models]: https://github.com/ggml-org/whisper.cpp/tree/master/models | |