What is Whisper AI by OpenAI?

whisper AI is an automatic speech recognition system but what can it do?

Updated: Jun 27, 2023 10:08 am

Share this article...

WePC is reader-supported. When you buy through links on our site, we may earn an affiliate commission. Prices subject to change. Learn more

Table of Contents

Whisper AI training

Technical architecture

Model variations

Potential applications

Testing Whisper AI

Is Whisper AI free?

OpenAI released Whisper API

Is OpenAI Whisper free?

What is Whisper AI?

OpenAI, the research company known for its impressive AI language models such as ChatGPT and DALL-E 2, has also released a speech recognition model in September 2022 called Whisper.

Whisper was largely overshadowed by the hype around OpenAI’s other releases ChatGPT and DALL-E 2.

Whisper is an automatic speech recognition system that can transcribe and translate audio files in approximately 100 different languages from around the world.

This groundbreaking AI model employs a staggering 1.6 billion parameters and was trained on an immense volume of data – over 680,000 hours of audio collected from the web. Remarkably, it shows robust zero-shot performance across a broad range of automated speech recognition tasks.

READ NOW: ChatGPT vs Bing AI chatbot

Whisper AI training

One of the distinguishing features that sets Whisper apart from other state-of-the-art Automatic Speech Recognition (ASR) models is that it does not require fine-tuning on a benchmark dataset for its training, but instead utilizes “weak” supervision with a large and noisy dataset of speech audio collected from the internet paired with transcription text.

According to OpenAI, the developers of Whisper, this training approach has produced a model that can excel in generalization and deliver impressive zero-shot performance using sophisticated algorithms and techniques.

The field of Artificial Intelligence is making significant strides in speech-processing tasks, such as multilingual speech recognition, voice activity detection, spoken language identification, and speech translation. This technology is rapidly advancing and being applied to a broad range of use cases.

Technical architecture

Whisper employs an Encoder-Decoder architecture that divides input audio into 30-second segments, converts it into a log-Mel spectrogram format, and feeds it into an encoder.

A decoder is then taught to precisely connect the input audio with its relevant text caption. This model can be refined by integrating customized tokens tailored to specific tasks, such as language recognition, multilingual speech transcription, phrase-level timestamps, and speech-to-English conversion.

Whisper has the potential to significantly improve speech recognition and language translation in various applications, from virtual assistants to language learning tools. With its ability to recognize a wide range of accents and handle technical jargon, Whisper is a promising step toward making speech recognition more accessible and accurate for everyone.

Model variations

Whisper’s edge over other speech recognition systems lies in its training on multilingual and multitask data, making it a versatile performer with high accuracy.

The model boasts five versions, four of which are optimized for English-only applications. Depending on the desired application, each version of whisper offers various tradeoffs between speed and accuracy.

Generally, it’s observed that the tiny.en and base.en models have a better performance than the small.en and medium.en models when dealing with English-only applications.

It is observed that the difference in performance between small.en and medium.en models become less significant when compared to the other models. The overall performance of Whisper varies significantly with respect to the language being used.

READ NOW: Too many requests in 1 hour

Potential applications

Due to its adaptability and precision, Whisper is an exceptional resource for producing transcriptions of interviews and podcasts, and can even convert podcasts made in languages other than English into English using your device.

This powerful amalgamation has the potential to revolutionize the transcription sector.

Testing Whisper AI

We put Whisper to the test by feeding it multiple samples, including a song by Selena Gomez, using the demonstration Python program available on GitHub. Whisper did an excellent job of transcribing the mp4 file into text, surpassing the performance of some AI-powered audio transcription services we’ve tried in the past. The turnaround is shown in the snapshot below.

Is Whisper AI free?

Yes! Whisper AI is a free open-source model which is freely available to the public. This means that anyone can make use of the impressive software.

There are a couple of ways of accessing the model. You can download Whisper AI from GitHub, but be aware that installation of the software is a slightly complicated affair and does require some technical skills and resources.

If that sounds like a bit too much effort then you can try out the software with Hugging Face’s free online version.

OpenAI released Whisper API

Priced at $0.006 per minute OpenAI announced recently that the Whisper model is now available through an API, allowing developers to incorporate this advanced speech-to-text model into their apps and services.

Is OpenAI Whisper free?

Whisper AI is a free and open-source model, however, the OpenAI API service is priced at $0.006 / minute