KEMBAR78
Faster streaming support · Issue #137 · ggml-org/whisper.cpp · GitHub
Skip to content

Faster streaming support #137

@ameenba

Description

@ameenba

Have you tried building the spectrogram and encoder output in smaller chunks and appending? I think the spectrogram should generate fairly easily with minimal noise depending on the size of the chunk, and the encoder output can also be appended with sufficiently large chunks.

So the encoder instead takes in Bx80xN as its input and outputs Bx(N/2)x(embedding size), if you wanted to send in 1s of audio into the tiny.en model, for example, 1x80x100->1x50x384. This should result in much faster processing for short clips (when the audio clip is <30s), and allows real time streaming without much wasted computation (like having to calculate a full x1500 encoding for each chunk of audio).

Some noise may be introduced at various size of chunks (spectrogram chunk size can be independent from encoder chunk size), and some overlap of spectrogram input/encoder output may help further to reduce that noise. This allows us for better scheduled deployment where the decoder, encoder, and spectrogram can run on different threads at the same time to produce our transcription.

Choosing when to decode will be another challenge as you don't want to decode if a full word is not complete in the encoding, but there are definitely solutions around that as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ideasInteresting ideas for experimentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions