Speech recognition for closed captions in live streaming – Softvelum: efficient tools to build your streaming networks

Live streaming delivers not only video and audio, but also closed captions and subtitles which are simply a transcribed text synchronized with the audio. Typically, those captions are generated at source and delivered to viewers via media servers. However, you may want to generate them on-the-fly using modern AI technologies. Speech-to-Text is now supported in Nimble Streamer for your live streaming scenarios.

Why is closed captioning important?

An estimated 1.5 billion people worldwide have some degree of hearing loss, making watching videos a challenging process for them. Adding captions improves these individuals’ quality of life and allows them to enjoy the same content experience as others. That’s why many countries require broadcasters and streaming companies to add closed captions. There are laws like Americans with Disabilities Act (ADA), European Accessibility Act (EAA) and more. So having closed captions is the core requirement for broadcasting and streaming in many cases.

Besides the humanitarian aspect, captions are just convenient. Many people simply prefer to read captions, such as when watching a video in a crowded environment where they cannot turn up the volume without disturbing others.

Translation for live streaming subtitles

Another challenge closely related to captions is subtitling in different languages.

There are more than 7,000 languages, and only about 20% of the world’s population speaks English, which means that if you want your content to reach more audience than your home country, you must translate it to all languages where your viewers are located. This is why you need real-time translation to expand your viewership.

It’s a relatively known task for VOD content, but adding it to live streaming is quite a challenge.

AI-based speech-to-text with Nimble Streamer

Nimble Streamer can help you solve these problems using two approaches

Use Whisper AI automatic speech recognition (ASR) model
Integrate Speechmatics STT service into your workflow with transcription and translation.

All you need to do is follow these simple steps.

Set up live input and output as you normally do for your streams.
Enable AI speech-to-text processing for the designated streams with either Whisper or Speechmatics.
Use a player which can present WebVTT subtitles in your website or app.
Deliver your content via HLS as usual.

The output HLS stream will have all data necessary for closed captions display and your player will pick it up so your viewers would have a great viewing experience.

Closed captions of a live stream from Nimble Streamer
processed by Whisper as shown in THEOPlayer

Transcription pricing

The price model is simple and the starter price is very affordable and is based on WMSPanel pricing.

In order to get the transcription running in your WMSPanel account (basic price is 20 USD/m) with your Nimble Streamer instance (which is 50 USD/m), you need a Nimble Live Transcoder license for 50 USD and an Addenda package license for 50 USD. This makes a starter price of 170 USD.

Notice that Speechmatics service pricing applies to the transcription process, you need to refer to Speechmatics for exact quote. Softvelum is not affiliated with Speechmatics.

Whisper Speech-to-Text performance

Speech recognition is a heavy-duty task which requires a lot of computing resources.

At the moment our ASR implementation with Whisper base language model can only work with NVidia accelerators. Their GPUs can handle all processing needed for this extraordinary task.

We’ve run some tests and we can tell that the following can be achieved using Nimble Streamer engine. The following hardware can produce the following input streams into output HLS with closed captioning:
– NVidia GeForce RTX3070 can process 17 input streams.
– NVidia GeForce RTX4050 can process 10 input streams.

We’ll share more details as we run more tests on other hardware.

Start now

Here are the steps you need to follow in order to make closed captions and translation for your live streams.

Create WMSPanel account and subscribe for it.
Install Nimble Streamer on a Ubuntu 24.04 with NVidia graphic card.
Create Live Transcoder license and register it on your Nimble instance via panel UI.
Create Addenda license and register it on your Nimble instance via panel UI.
Follow the setup instructions to generate WebVTT subtitles and auto-translation from Whisper.cpp.
Or follow these instructions to integrate with Speechmatics.
Apply these instructions to enable CEA-708 subtitles.
After the setup is done, your designated output streams will have closed captions and/or subtitles in them.

That’s it, you can now use the power of AI to improve your viewers’ experience.

Let us know if you have any feedback or issues when using our recognition features.

Nimble Streamer uses Whisper.cpp library and model available via MIT license.