Live Encoding captions generation with Azure Speech Services

This is a Live Encoding tutorial about generating captions automatically using Bitmovin’s integration of Azure Speech Services


Azure Speech Services provides a transcribing service with speech-to-text capabilities with high accuracy. Bitmovin Live encoding service integrates Azure Speech Services to create auto-generated captions in realtime with a simple setup. A full Java code example for starting a Live Encoding with HLS manifest is linked.


  1. Encoder Version: v2.200.0 or higher
  2. Bitmovin Encoding API SDK Version: v1.195.0 or higher
  3. Azure Speech Services account
    1. Subscription key and region

How to Setup


Currently this feature is only available through our API.

  1. Start a Bitmovin Live encoder using the Java API code example from SRT Live Encoding HLS with Azure Speech To Captions Filter
    1. You can configure the Live Input in any of the supported formats detailed in Live Inputs
  2. After adding the properties needed to create an encoding detailed in other tutorials, it’s time to add the different streams and codec configurations
    1. Add Video and Audio Codec Configurations as usual
    2. Follow the instructions in the following section to [Add a subtitle stream with AzureSpeechToCaptionsFilter](#Add a subtitle stream with AzureSpeechToCaptionsFilter)
    3. Start the Live Encoding and start streaming into it once it is ready

Add a subtitle stream with AzureSpeechToCaptionsFilter

As detailed in the example SRT Live Encoding HLS with Azure Speech To Captions Filter , in order to use Azure Speech Services to create auto-generated captions we need to configure first a subtitle stream for our encoding.

Create WebVTT subtitle configuration

In this example we are going to create a WebVTT subtitle configuration for our encoding to store the captions produced by the live encoder. We are setting the default values cueIdentifierPolicy to INCLUDE_IDENTIFIERS and appending zeroes when hours = 0 by setting the appendOptionalZeroHour to True. For more details check the WebVTT configuration documentation.

private static SubtitleConfiguration createWebVttConfig(String name) {
    WebVttConfiguration webVttConfiguration = new WebVttConfiguration();

    return bitmovinApi.encoding.configurations.subtitles.webvtt.create(webVttConfiguration);

Add AzureSpeechToCaptionsFilter to a subtitle stream

In order to create the AzureSpeechToCaptionsFilter we first need to gather the credentials from the Azure Speech Services. This is an example of what we can find in the azure portal:

With this information we can configure the filter as follows:

private static AzureSpeechToCaptionsFilter createAzureSpeechToCaptionsFilter() {
     AzureSpeechServicesCredentials azureSpeechServicesCredentials = new AzureSpeechServicesCredentials();

     AzureSpeechToCaptionsSettings azureSpeechToCaptionsSettings = new AzureSpeechToCaptionsSettings();

     AzureSpeechToCaptionsFilter azureSpeechToCaptionsFilter = new AzureSpeechToCaptionsFilter();

     return bitmovinApi.encoding.filters.azureSpeechToCaptions.create(azureSpeechToCaptionsFilter);
  • The configProvider gets the subscriptionKey and region for the filter
  • The language is set to en-US (IETF BCP 47 language tag) as documented in the list of supported languages of Azure's official documentation .
  • The captionDelay is set to 100 MILLISECONDS to delay the display of each caption, to mimic a real-time experience
  • The captionTemainTime is set to remain 1 second on screen
  • The profanityOption is configured so it replaces letters in profane words with asterisk (*) characters.

Now the filter can be added to the created subtitle stream

Stream subtitleStream = createStream(encoding, input, webVttConfig);
addFiltersToStream(encoding, subtitleStream, getStreamFilterList(Collections.singletonList(azureSpeechToCaptionsFilter)));

Create Chunked Text Muxing

In order to have the webVTT subtile packaged correctly we create a Chunked Text muxing which is creating a segmented webVTT segmented output where the manifest playlist will load the segments from.

 private static ChunkedTextMuxing createChunkedTextMuxing(
      Encoding encoding,
      Output output,
      String outputPath,
      Stream stream,
      Double segmentLength,
      Integer startOffset) {
    MuxingStream muxingStream = new MuxingStream();

    ChunkedTextMuxing chunkedTextMuxing = new ChunkedTextMuxing();
    chunkedTextMuxing.addOutputsItem(buildEncodingOutput(output, outputPath));
    return bitmovinApi.encoding.encodings.muxings.chunkedText.create(encoding.getId(), chunkedTextMuxing);

In this example we set the parameters to have a chunkLenght of 4.0 seconds, which matches the one of Video and Audio. Also the segments will have a segment naming webvtt_segment_%number%.vtt which will translate to webvtt_segment_0.vtt, webvtt_segment_1.vtt, webvtt_segment_2.vtt

createChunkedTextMuxing(encoding, output, "/subtitles", subtitleStream, 4.0, 10);


In this page we learned how to configure a Live encoding using SRT as an input, adding a subtitle stream with the Azure Speech Services Filter and creating auto-generated captions with an HLS manifest output using our default HLS manifest creation.

This powerful combination ensures your live streams are more accessible and engaging for all viewers with a seamless integration of Azure's highly accurate speech-to-text capabilities and Bitmovin's live encoding. So now you can provide real-time captions, making your content more inclusive and professional.