Fact: Video should soon represent up to 90% of all consumer internet traffic. It is a lot of information, often referred to as “dark data”, that is not simply searchable like a row in a database.
Last year, I built Dark Vision, a sample application to discover dark data in videos. Dark Vision uses IBM Bluemix OpenWhisk and Watson Visual Recognition.
The application extracts individual frames from a video. Watson Visual Recognition analyzes each frame and returns tags to qualify the image. These tags are used to improve the viewing experience, to provide better search results and recommendations. This experiment generated a lot of interest and triggered many discussions with developers and clients.
But processing the individual frames is only the beginning.
What about the audio track? If you want to know more about the topic being discussed in a video, the images may not be enough. Take a TED talk as example, there is not much to see but a lot to learn if you are listening carefully and if you follow up with some research. As done for the video frames, what if we could automatically detect the topics and concepts by listening to the audio track.
How about enhancing Dark Vision with a new sense? Hearing! Using Speech to Text.
Sounds like a nice improvement! To get insights from the audio, we need:
to extract the audio track from the video,
to convert the audio into written text,
to analyze the transcript.
Extract the audio with ffmpeg
The first step is to get the audio track out of the video. Here we can call ffmpeg to the rescue. With this framework it becomes trivial to extract the audio track from the video into its own file. Dark Vision is already using ffmpeg to extract the video frames. Indeed the existing extractor action written as a Docker image can be updated to also extract the audio.
After a bit of trials and errors, I end up with this command: ffmpeg -i video.mp4 -qscale:a 3 -acodec vorbis -map a -strict -2 -ac 2 audio.ogg. How fancy! It tells ffmpeg:
to transcode the audio track -map a
using the experimental Ogg Vorbis codec -acodec vorbis -strict -2
applying a bit of compression -qscale:a 3
and making sure the output has two audio channels -ac 2 (the Ogg Vorbis codec requires this).
Step 1 done! We’ve got the audio.
Transcribe the audio with Watson
Watson Speech to Text service, available in IBM Bluemix, transcribes speech from various languages and audio formats (one format is Ogg Vorbis!) to text with low latency. For most languages, the service supports two sampling rates, broadband and narrowband.
The Watson Speech to Text API has three interfaces to transcribe audio: a WebSocket interface, an HTTP REST interface, and an asynchronous HTTP interface. They have options to stream the audio or to send it as a single request.
Dark Vision is a serverless app. Looking at how to use the Speech to Text API as part of an OpenWhisk action, I hit a showstopper: transcribing an audio file may take more than 5 minutes and OpenWhisk actions have a time limit of 5 minutes. Given this challenge, I decide to take a shortcut and only process the first 3 minutes of any video. First I’m leaving precious insights on the table as I’m not processing the full video but only its beginning. Then this approach has a major flaw: my OpenWhisk action is just waiting for Speech to Text to complete this work. And as duration is one metric you are charged on in a serverless platform, it does not make sense to just wait for another service to do heavy processing. There has to be a better way. Fortunately there is: the asynchronous HTTP interface of Watson Speech to Text.
The asynchronous HTTP interface provides methods for transcribing audio via non-blocking calls to the service. You submit the audio and the service will call you back when it is done processing the results. This is just perfect for our serverless environment and another good use case for serverless platforms. The OpenWhisk action can send the audio file to Watson Speech to Text and once Watson is done, another action can act as the callback to process the results.
Here goes for Step 2. Off to processing this transcript now!
Analyze the transcript with Natural Language Understanding
Natural Language Understanding is a collection of natural language processing APIs that help you understand sentiment, keywords, entities, high-level concepts and more. We simply need to send the transcript to Natural Language Understanding to get entities, concepts, sentiment in return. Entities and concepts may have links to additional online information attached to them. Always more insights to leverage in your apps.
And if we were to pass the transcript to Watson Personality Insights or Tone Analyzer, we would gather even more insights on the video. Basically any text-based API can contribute to our search for dark data hidden in videos.
Join us at InterConnect 2017
I will be at InterConnect, together with my colleague Ram Vennam, to go into deeper details on Dark Vision and how it uses OpenWhisk and Watson. Our session runs twice on Wednesday, March 22nd. Join us to chat serverless, Watson and Bluemix.
As of this writing, the development is happening in the audio branch of the IBM-Bluemix/openwhisk-darkvisionapp project on GitHub. Chances are by the time InterConnect comes everything will get merged back into the master codebase.
If you are living on the edge, detailed instructions on how to deploy this app with Bluemix are available in the project on GitHub. Make sure to switch to the audio branch – and if you don’t see it, it means the merge happened already. 🙂