Using Watson Speech to Text and Openwhisk to discover even more dark data in videos

Share this post:

Fact: Video should soon represent up to 90% of all consumer internet traffic. It is a lot of information, often referred to as “dark data”, that is not simply searchable like a row in a database.

Last year, I built Dark Vision, a sample application to discover dark data in videos. Dark Vision uses IBM Bluemix OpenWhisk and Watson Visual Recognition.

The application extracts individual frames from a video. Watson Visual Recognition analyzes each frame and returns tags to qualify the image. These tags are used to improve the viewing experience, to provide better search results and recommendations. This experiment generated a lot of interest and triggered many discussions with developers and clients.

But processing the individual frames is only the beginning.

What about the audio track? If you want to know more about the topic being discussed in a video, the images may not be enough. Take a TED talk as example, there is not much to see but a lot to learn if you are listening carefully and if you follow up with some research. As done for the video frames, what if we could automatically detect the topics and concepts by listening to the audio track.

How about enhancing Dark Vision with a new sense? Hearing! Using Speech to Text.

Sounds like a nice improvement! To get insights from the audio, we need:

  • to extract the audio track from the video,
  • to convert the audio into written text,
  • to analyze the transcript.

Speech to Text Flow

Extract the audio with ffmpeg

The first step is to get the audio track out of the video. Here we can call ffmpeg to the rescue. With this framework it becomes trivial to extract the audio track from the video into its own file. Dark Vision is already using ffmpeg to extract the video frames. Indeed the existing extractor action written as a Docker image can be updated to also extract the audio.

After a bit of trials and errors, I end up with this command: ffmpeg -i video.mp4 -qscale:a 3 -acodec vorbis -map a -strict -2 -ac 2 audio.ogg. How fancy! It tells ffmpeg:

  • to transcode the audio track -map a
  • using the experimental Ogg Vorbis codec -acodec vorbis -strict -2
  • applying a bit of compression -qscale:a 3
  • and making sure the output has two audio channels -ac 2 (the Ogg Vorbis codec requires this).

Step 1 done! We’ve got the audio.

Transcribe the audio with Watson

Watson Speech to Text service, available in IBM Bluemix, transcribes speech from various languages and audio formats (one format is Ogg Vorbis!) to text with low latency. For most languages, the service supports two sampling rates, broadband and narrowband.

The Watson Speech to Text API has three interfaces to transcribe audio: a WebSocket interface, an HTTP REST interface, and an asynchronous HTTP interface. They have options to stream the audio or to send it as a single request.

Dark Vision is a serverless app. Looking at how to use the Speech to Text API as part of an OpenWhisk action, I hit a showstopper: transcribing an audio file may take more than 5 minutes and OpenWhisk actions have a time limit of 5 minutes. Given this challenge, I decide to take a shortcut and only process the first 3 minutes of any video. First I’m leaving precious insights on the table as I’m not processing the full video but only its beginning. Then this approach has a major flaw: my OpenWhisk action is just waiting for Speech to Text to complete this work. And as duration is one metric you are charged on in a serverless platform, it does not make sense to just wait for another service to do heavy processing. There has to be a better way. Fortunately there is: the asynchronous HTTP interface of Watson Speech to Text.

The asynchronous HTTP interface provides methods for transcribing audio via non-blocking calls to the service. You submit the audio and the service will call you back when it is done processing the results. This is just perfect for our serverless environment and another good use case for serverless platforms. The OpenWhisk action can send the audio file to Watson Speech to Text and once Watson is done, another action can act as the callback to process the results.

speech to text architectural diagram

Here goes for Step 2. Off to processing this transcript now!

Analyze the transcript with Natural Language Understanding

Natural Language Understanding is a collection of natural language processing APIs that help you understand sentiment, keywords, entities, high-level concepts and more. We simply need to send the transcript to Natural Language Understanding to get entities, concepts, sentiment in return. Entities and concepts may have links to additional online information attached to them. Always more insights to leverage in your apps.

And if we were to pass the transcript to Watson Personality Insights or Tone Analyzer, we would gather even more insights on the video. Basically any text-based API can contribute to our search for dark data hidden in videos.

Join us at InterConnect 2017

I will be at InterConnect, together with my colleague Ram Vennam, to go into deeper details on Dark Vision and how it uses OpenWhisk and Watson. Our session runs twice on Wednesday, March 22nd. Join us to chat serverless, Watson and Bluemix.

See for yourself

As of this writing, the development is happening in the audio branch of the IBM-Bluemix/openwhisk-darkvisionapp project on GitHub. Chances are by the time InterConnect comes everything will get merged back into the master codebase.

If you are living on the edge, detailed instructions on how to deploy this app with Bluemix are available in the project on GitHub. Make sure to switch to the audio branch – and if you don’t see it, it means the merge happened already. 🙂

If you have feedback, suggestions, or questions about this post, please reach out to me on Twitter: @L2FProd.

More Watson Stories

Expose Auth0-enabled OpenWhisk actions with API Connect

Serverless computing platforms give developers a rapid way to build APIs without servers. OpenWhisk supports automatic generation of REST API for actions exposed by API Gateway. When the capabilities provided by the API Gateway do not cover all usecases you have for an API - such as publishing it in a developer portal or performing advanced mapping, transformation, validation, you can migrate to the API Connect service.

Continue reading

Spring Cloud application with Zuul Gateway on Bluemix Kubernetes

In this post, we'll create a simple Spring Cloud application that demonstrates the Zuul library. Zuul acts as a gateway to other microservices, and provides routing and filtering functionality, among other things. We will build on a project from the Spring guides, and deploy it to Bluemix Kubernetes.

Continue reading

Connecting a Spring Cloud application to Cloudant Service with Feign and Hystrix

In this post, we'll create a simple Spring Cloud application that demonstrates the capabilities of Feign and Hystrix by connecting to a Cloudant service on Bluemix. Feign is a declarative web service client, which comes with Hystrix built in when you use it with Spring Cloud. Hystrix is a Netflix OSS library that implements the circuit breaker pattern.

Continue reading