Speech to Text¶
Introduction¶
The Audio API provides two speech to text endpoints, transcriptions and translations, based on OpenAI's state-of-the-art open source large-v2 Whisper model. They can be used to:
- Transcribe audio into whatever language the audio is in.
- Translate and transcribe the audio into english.
File uploads are currently limited to 25 MB and the following input file types are supported: mp3
, mp4
, mpeg
, mpga
, m4a
, wav
, and webm
.
Transcriptions¶
The transcriptions API takes as input the audio file you want to transcribe and the desired output file format for the transcription of the audio. We currently support multiple input and output file formats.
By default, the response type will be json with the raw text included.
{
"text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger."
}
from openai import OpenAI
client = OpenAI()
audio_file= open("data/audio.m4a", "rb")
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
print(transcript.text)
Sanat tarihiyle ilgilenen veya ilgilenmeyen herkesin kulağına mutlaka Rönesans kelimesi çalınmış olmalı. Duyar duymaz aklımıza bilim, sanat ve yenilikler gelir. Ancak bu tanımlama aslında dini bir terim olarak ortaya çıkmıştır. Kişinin yeniden hayata dönüşüne verilen isimdir Rönesans. Rönesans günümüzde kullanılan haliyle ilk kez 1860 yılında yazarı Jacob Burckhardt'ın kendi çabalarıyla yayımladığı İtalya'da Rönesans Kültürü adlı kitabında karşımıza çıkar. Şimdi daha geriye gidelim ve sizi sıkıcı bir takım tanımlamalardan uzaklaştıralım. Nedir bu Rönesans, ortaya çıkmasının sebepleri nedir ve nelere sebep olmuştur? Rönesans sanatını anlamak için öncelikle Ortaçağ'ı bilmemiz ve anlamamız gerekiyor. Rönesans öncesinde Ortaçağ, Carol Lange, Romanesque ve Gotik gibi sanat takımları hakimdi. Ancak şimdilik bu üç dönemi detaylandırmak yerine genel olarak Ortaçağ'ın karanlığından bahsetmek daha yerinde olacaktır.
# Try to set the response format to 'text'
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="text"
)
print(transcript)
Sanat tarihiyle ilgilenen veya ilgilenmeyen herkesin kulağına mutlaka Rönesans kelimesi çalınmış olmalı. Duyar duymaz aklımıza bilim, sanat ve yenilikler gelir. Ancak bu tanımlama aslında dini bir terim olarak ortaya çıkmıştır. Kişinin yeniden hayata dönüşüne verilen isimdir Rönesans. Rönesans günümüzde kullanılan haliyle ilk kez 1860 yılında yazarı Jacob Burckhardt'ın kendi çabalarıyla yayımladığı İtalya'da Rönesans Kültürü adlı kitabında karşımıza çıkar. Şimdi daha geriye gidelim ve sizi sıkıcı bir takım tanımlamalardan uzaklaştıralım. Nedir bu Rönesans, ortaya çıkmasının sebepleri nedir ve nelere sebep olmuştur? Rönesans sanatını anlamak için öncelikle Ortaçağ'ı bilmemiz ve anlamamız gerekiyor. Rönesans öncesinde Ortaçağ, Carol Lange, Romanesque ve Gotik gibi sanat takımları hakimdi. Ancak şimdilik bu üç dönemi detaylandırmak yerine genel olarak Ortaçağ'ın karanlığından bahsetmek daha yerinde olacaktır.
Translations¶
The translations API takes as input the audio file in any of the supported languages and transcribes, if necessary, the audio into English. This differs from our /Transcriptions endpoint since the output is not in the original input language and is instead translated to English text.
translation = client.audio.translations.create(
model="whisper-1",
file=audio_file,
temperature= 0.5
)
print(translation.text)
Everyone who is interested in art history must have heard the word Renaissance. As soon as we hear it, we think of science, art, and innovation. However, this definition has actually emerged as a religious term. Renaissance is the name given to the person's return to life. Renaissance is used today, for the first time in 1860, writer Jacob Burckhardt published it with his own efforts. In his book called Renaissance Culture in Italy, Now let's go back and remove you from boring definitions. What is this Renaissance? What are the reasons for its emergence and what has caused it? To understand the Renaissance art, we must first know and understand the Middle Ages. Before the Renaissance, the Middle Ages The art teams such as Carolingian, Romanesque and Gothic dominated. However, instead of detailing these 3 periods for now, it would be better to talk about the darkness of the Middle Ages.
Supported languages¶
We currently support the following languages through both the transcriptions and translations endpoint:
Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.
While the underlying model was trained on 98 languages, we only list the languages that exceeded <50% word error rate (WER) which is an industry standard benchmark for speech to text model accuracy. The model will return results for languages not listed above but the quality will be low.
Longer inputs¶
By default, the Whisper API only supports files that are less than 25 MB. If you have an audio file that is longer than that, you will need to break it up into chunks of 25 MB's or less or used a compressed audio format. To get the best performance, we suggest that you avoid breaking the audio up mid-sentence as this may cause some context to be lost.
One way to handle this is to use the PyDub open source Python package to split the audio:
from pydub import AudioSegment
large_audio_file = "data/large_audio.mp3"
song = AudioSegment.from_file(large_audio_file)
# PyDub handles time in milliseconds
two_minutes = 2 * 60 * 1000
first_2_minutes = song[:two_minutes]
first_2_minutes.export("output/large_audio_1.mp3", format="mp3")
<_io.BufferedRandom name='output/large_audio_1.mp3'>
Prompting¶
You can use a prompt to improve the quality of the transcripts generated by the Whisper API. The model will try to match the style of the prompt, so it will be more likely to use capitalization and punctuation if the prompt does too. However, the current prompting system is much more limited than other language models and only provides limited control over the generated audio. Here are some examples of how prompting can help in different scenarios:
- Prompts can be very helpful for correcting specific words or acronyms that the model may misrecognize in the audio. For example, the following prompt improves the transcription of the words DALL·E and GPT-3, which were previously written as "GDP 3" and "DALI": "The transcript is about OpenAI which makes technology like DALL·E, GPT-3, and ChatGPT with the hope of one day building an AGI system that benefits all of humanity"
- To preserve the context of a file that was split into segments, you can prompt the model with the transcript of the preceding segment. This will make the transcript more accurate, as the model will use the relevant information from the previous audio. The model will only consider the final 224 tokens of the prompt and ignore anything earlier.
- Sometimes the model might skip punctuation in the transcript. You can avoid this by using a simple prompt that includes punctuation: "Hello, welcome to my lecture."
- The model may also leave out common filler words in the audio. If you want to keep the filler words in your transcript, you can use a prompt that contains them: "Umm, let me think like, hmm... Okay, here's what I'm, like, thinking."
- Some languages can be written in different ways, such as simplified or traditional Chinese. The model might not always use the writing style that you want for your transcript by default. You can improve this by using a prompt in your preferred writing style.
Improving reliability¶
As we explored in the prompting section, one of the most common challenges faced when using Whisper is the model often does not recognize uncommon words or acronyms. To address this, we have highlighted different techniques which improve the reliability of Whisper in these cases:
Using the prompt parameter¶
The first method involves using the optional prompt parameter to pass a dictionary of the correct spellings.
Since it wasn't trained using instruction-following techniques, Whisper operates more like a base GPT model.
While it will increase reliability, this technique is limited to only 244 characters so your list of SKUs would need to be relatively small in order for this to be a scalable solution.
prompt="ZyntriQix, Digique Plus, CynapseFive, VortiQore V8, EchoNix Array, OrbitalLink Seven, DigiFractal Matrix, PULSE, RAPT, B.R.I.C.K., Q.U.A.R.T.Z., F.L.I.N.T."
Post-processing with GPT-4¶
The second method involves a post-processing step.
We start by providing instructions for GPT-4 through the system_prompt variable. Similar to what we did with the prompt parameter earlier, we can define our company and product names.
If you try this on your own audio file, you can see that GPT-4 manages to correct many misspellings in the transcript. Due to its larger context window, this method might be more scalable than using Whisper's prompt parameter and is more reliable since GPT-4 can be instructed and guided in ways that aren't possible with Whisper given the lack of instruction following.
audio_file= open("data/audio_with_concepts.m4a", "rb")
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language="en"
)
print(f"Output without prompt:\n{transcript.text}")
Output without prompt: Welcome to our company Zintrik X. Today, we will talk about our new products, DGQ+, Synapse 5, VortiCore V8, Equinix Array, and also we will talk about our existing products and their performance, which are Brick, Quartz, and Flint.
audio_file= open("data/audio_with_concepts.m4a", "rb")
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language="en",
prompt="ZyntriQix, Digique Plus, CynapseFive, VortiQore V8, EchoNix Array, OrbitalLink Seven, DigiFractal Matrix, PULSE, RAPT, B.R.I.C.K., Q.U.A.R.T.Z., F.L.I.N.T."
)
print(f"Output with prompt:\n{transcript.text}")
Output with prompt: Welcome to our company ZyntriQix. Today we will talk about our new products, Digique Plus, CynapseFive, VortiQore V8, EchoNix Array, and also we will talk about our existing products and their performance, which are B.R.I.C.K., Q.U.A.R.T.Z., and F.L.I.N.T.
system_prompt = "You are a helpful assistant for the company ZyntriQix. Your task is to correct any spelling discrepancies in the transcribed text. Make sure that the names of the following products are spelled correctly: ZyntriQix, Digique Plus, CynapseFive, VortiQore V8, EchoNix Array, OrbitalLink Seven, DigiFractal Matrix, PULSE, RAPT, B.R.I.C.K., Q.U.A.R.T.Z., F.L.I.N.T. Only add necessary punctuation such as periods, commas, and capitalization, and use only the context provided."
def generate_corrected_transcript(system_prompt, content):
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
messages=[
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": content
}
]
)
return response.choices[0].message.content
audio_file= open("data/audio_with_concepts.m4a", "rb")
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
language="en"
)
corrected_text = generate_corrected_transcript(system_prompt, transcript.text)
print(f"Output corrected by GTP-4:\n{corrected_text}")
Output corrected by GTP-4: Welcome to our company ZyntriQix. Today, we will talk about our new products, Digique Plus, CynapseFive, VortiQore V8, EchoNix Array, and also we will talk about our existing products and their performance, which are B.R.I.C.K., Q.U.A.R.T.Z., and F.L.I.N.T.