Combining multiple speech recognition engines in Nimble Streamer – Softvelum: efficient tools to build your streaming networks

Nimble Streamer supports AI-powered automatic speech recognition (ASR) through several technology providers, including:

Whisper library and models for on-premises processing.
Speechmatics cloud ASR service.
KWIKmotion AI-powered real-time captions service.

This feature set enables automatic generation of closed captions and create translation WebVTT subtitles for HLS live streams. WebVTT for MPEG-DASH is also supported, see respective section below.

There are cases when a streaming provider needs to use a combination of these recognition technologies on the same server. For instance, Whisper is used for English subtitles in one set of streams, and KWIKmotion is used for Arabic transcription and translation in another – all on the same server.

In this post we’ll describe how to use these recognition engines on the same Nimble Streamer instance.

Configuration mechanics

Before moving forward, let us remind of how streams’ URLs work. This will help describing the processing.

URL structure and why it’s important

Let’s say, there’s a stream URL https://servername/app/stream/playlist.m3u8. It’s a typical URL with standard structure.

Here, “app” is the application name, as defined it in Nimble Streamer live streams configuration. Having an application is the way you separate groups of streams. E.g. if there are some streams for a TV station called “TV1”, you can call the app like “tv1” and have all of its streams there. Going forward, the same set of settings can be applied for that station’s content, like output streams, transcoding settings, paywall restrictions etc.

The “stream” is the name of a particular stream in that application. Like that TV1 station has a FullHD original stream, then a bunch of downscale streams, some streams in different languages etc. These stream names might be “original”, “abr-stream”, “espanol” etc.

Configuration files and UI

There are three places where the ASR behavior can be configured:

WMSPanel live streams settings where Transcription is enabled either on a server level, or for particular applications. That’s why we emphasized the importance of application in the stream URL.
nimble.conf file located at /etc/nimble/nimble.conf
transcriber-config.json file located at /etc/nimble/transcriber-config.json

WMSPanel UI

WMSPanel UI setup is simple and straight-forward. Go to Nimble Streamer top menu and choose Live streams settings.

The image above shows that we enabled transcription on the server level in Global tab. We’ve also enabled HLS and SLDP as our output protocols. The same can be done in Applications tab by creating particular apps with their respective names and settings.

If you don’t enable transcriber in WMSPanel settings, the respective apps and streams will not be processed by any ASR engine.

nimble.conf

The nimble.conf file defines server-wide ASR behavior. Whatever is put here, will apply to all apps and streams by default. For example, these are the lines for Whisper setup:

transcriber_type = whisper
whisper_language = en

With this setting, the server will transcribe streams for apps that have transcription enabled in WMSPanel UI, or for entire server’s streams if that’s enabled globally.

To make the new parameters work, restart the Nimble instance:

sudo service nimble restart

You may omit any ASR settings in nimble.conf and leave only transcriber-config.json path as shown below. In this case, the ASR behavior will be defined solely in transcriber-config.json.

transcriber-config.json

In order to enable transcriber-config.json processing, add this line to nimble.conf:

transcriber_config_path = /etc/nimble/transcriber-config.json

The transcriber-config.json file contains JSON of the following structure:

{
    "whisper_params" : [
        {"app": "live", "lang": "en"}
        ... other Whisper AI params
    ],
    "speechmatics_params" : [
        {
          "api_url": "wss://eu2.rt.speechmatics.com/v2",
          "api_key": "<your_api_key>"
        },
        {"app":"live1", "lang":"en"},
        ... other Speechmatics params
    ],
    "kwikmotion_params" : [
        {
          "api_url": "wss://livecc01.kwikmotion.com:8004",
          "api_key": "<your_api_key>"
        },
        {"app":"live1", "stream":"stream1", "lang":"ar"},
        ... other KWIKmotion params
    ]
}

We’ll see more config samples below in Setup examples section.

You can see whisper_params, speechmatics_params and kwikmotion_params sections, each describing respective engines’ behavior. If you use only one of these engines, just keep only respective section. Learn more about each engine’s setup in respective blog posts. For now, notice the “app” and “stream” elements that are used for defining the priorities.

Processing priorities

The overall order of processing is as follows:

Whatever is defined in nimble.conf, it is applied by default. So every app and stream on the servers will be transcribed and translated by the engine specified there.
If there is an application setting in transcriber-config.json for a particular engine, then that application’s settings will have the priority. So if nimble.conf has Whisper as the ASR engine and transcriber-config.json has Speechmatics for a particular app, then that app and all of its streams will be processed by Speechmatics. Other apps and streams will be processed by Whisper.
If there is a stream setting in transcriber-config.json for a particular engine, then that stream’s setting will have the priority. So if nimble.conf has Whisper as the ASR engine, and transcriber-config.json has Speechmatics for a particular app, and then there is a setting for KWIKmotion for that particular stream, then that stream will be processed by KWIKmotion. Other apps and streams will be processed by Whisper and Speechmatics accordingly.

Let’s examine some config to see how it works.

Setup example

We’ll use comprehensive configs in order to show variations.

The nimble.conf file looks like this:

transcriber_type = whisper
whisper_language = en

The transcriber-config.json file content is this:

{
    "kwikmotion_params" : [
      {
        "api_url": "wss://livecc01.kwikmotion.com:8004",
        "api_key": "<your_api_key>"
      },
      {"app":"live", "stream":"stream", "lang":"en"},
      {"app":"live", "stream":"stream-ar-hd", "lang":"ar", "target_langs": "ar", "webvtt_style": "line:70% position:50% align:center"}
    ],
    "speechmatics_params" : [
        {
          "api_url": "wss://eu2.rt.speechmatics.com/v2",
          "api_key": "<your_api_key>"
        },
        {"app":"live-es", "lang":"es"},
        {"app":"live", "stream":"stream-hd", "lang":"en", "target_langs": "es,fr,de"},
        {"app":"live", "stream":"med-conference", "lang":"en", "target_langs": "en", "transcription_config": { "operating_point": "enhanced", "domain": "medical"} }
    ],
    "whisper_params" : [
        {"app": "live", "stream": "stream-special", "lang": "en", "use_gpu": false, "whisper_model_path": "external/whisper/whisper.cpp/models/ggml-large-v2.bin", 
            "whisper_full_params" : {
                "n_threads" : 4,
                "temperature": 0.0,
                "best_of": 5,
                "temperature_inc": 0.2,
                "entropy_thold": 2.4,
                "logprob_thold": -1.0,
                "no_speech_thold": 0.6
            }
        }
    ]
}

Now let’s see which sets of parameters can be applied to which streams. We’ll use the notation “app/stream” as shorthand.

Whisper is used for English transcription for any streams by default, except specific app/stream processed by other engines.
- In addition to that, live/stream-special is also processed by Whisper with a set of specific parameters like a different model and a set of precise settings.
KWIKmotion is used for just 2 specific streams:
- live/stream for English
- live/stream-ar-hd for Arabic subtitles.
Speechmatics will process:
- any streams from live-es app for Spanish,
- live/stream-hd stream for Spanish, French and German translation,
- live/med-conference stream for English with the medical thesaurus for better transcription.

Let’s take some streams’ examples to see which engine will process them.

origin/stream is processed by Whisper. We don’t see any parameters for origin app in transcriber-config.json so it’s processed with default engine.
live/stream is processed by KWIKmotion for English subtitles.
live-es/original is processed by Speechmatics for Spanish subtitles.
live/abr is processed by Whisper. We don’t see any parameters neither for live app nor for live/abr stream, so the default engine from nimble.conf is applied.

What about potential conflicts when different engines’ settings cover the same app/stream? E.g. here’s the config:

{"kwikmotion_params": [{"app":"live", "stream":"stream", "lang":"ar"}], "speechmatics_params": [{"app":"live", "stream":"stream", "lang":"en"}]}

There are two engines that can be applied to this stream. In this case, the server will process the stream according to Nimble’s default internal logic. This internal priority logic may change in future versions, so avoid ambiguous configurations.

WebVTT for MPEG-DASH

Besides WebVTT generation for HLS, Nimble Streamer is able to provide these subtitles for MPEG-DASH.

Add the following parameters into nimble.conf file and re-start Nimble instance:

dash_webvtt_subtitles_enabled = true

This will enable WebVTT for DASH. Notice that it’s an experimental feature. If you face any issues, please contact us.

Nimble Streamer provides flexible automatic transcription and translation capabilities using all supported speech recognition engines. If you have questions about ASR configuration or need help designing your setup, feel free to contact our engineering team.