from openai import OpenAI
client = OpenAI()
speech_file_path = "output/speech.mp3"
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="Today is a wonderful day to build something people love!"
)
response.write_to_file(speech_file_path)
from pydub import AudioSegment
from pydub.playback import play
audio = AudioSegment.from_file(speech_file_path, format="mp3")
play(audio)
Input #0, wav, from '/var/folders/_2/4yj9mbbn2_zg36jb021hl_gh0000gn/T/tmpi9r_iulj.wav': Duration: 00:00:03.53, bitrate: 384 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 24000 Hz, 1 channels, s16, 384 kb/s 3.37 M-A: 0.000 fd= 0 aq= 0KB vq= 0KB sq= 0B f=0/0
3.44 M-A: 0.000 fd= 0 aq= 0KB vq= 0KB sq= 0B f=0/0
# Try in Turkish
from openai import OpenAI
client = OpenAI()
speech_file_path = "output/speech.mp3"
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="Bugün kod yazmak için çok güzel bir gün!",
response_format="mp3",
speed=1.0
)
response.write_to_file(speech_file_path)
from pydub import AudioSegment
from pydub.playback import play
audio = AudioSegment.from_file(speech_file_path, format="mp3")
play(audio)
Input #0, wav, from '/var/folders/_2/4yj9mbbn2_zg36jb021hl_gh0000gn/T/tmptq_ga1r5.wav': Duration: 00:00:02.62, bitrate: 384 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 24000 Hz, 1 channels, s16, 384 kb/s 2.51 M-A: 0.000 fd= 0 aq= 0KB vq= 0KB sq= 0B f=0/0
Audio quality¶
For real-time applications, the standard tts-1
model provides the lowest latency but at a lower quality than the tts-1-hd
model. Due to the way the audio is generated, tts-1
is likely to generate content that has more static in certain situations than tts-1-hd
. In some cases, the audio may not have noticeable differences depending on your listening device and the individual person.
Supported output formats¶
The default response format is "mp3", but other formats like "opus", "aac", or "flac" are available.
- Opus: For internet streaming and communication, low latency.
- AAC: For digital audio compression, preferred by YouTube, Android, iOS.
- FLAC: For lossless audio compression, favored by audio enthusiasts for archiving.
Supported languages¶
The TTS model generally follows the Whisper model in terms of language support. Whisper supports the following languages and performs well despite the current voices being optimized for English:
Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.
You can generate spoken audio in these languages by providing the input text in the language of your choice.
Streaming real time audio¶
The Speech API provides support for real time audio streaming using chunk transfer encoding. This means that the audio is able to be played before the full file has been generated and made accessible.
from openai import OpenAI
client = OpenAI()
# Create text-to-speech audio file
with client.audio.speech.with_streaming_response.create(
model="tts-1",
voice="alloy",
input="Hello world! This is a streaming test."
) as response:
response.stream_to_file(speech_file_path)
from pydub import AudioSegment
from pydub.playback import play
audio = AudioSegment.from_file(speech_file_path, format="mp3")
play(audio)
Input #0, wav, from '/var/folders/_2/4yj9mbbn2_zg36jb021hl_gh0000gn/T/tmppen1a__6.wav': Duration: 00:00:02.30, bitrate: 384 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 24000 Hz, 1 channels, s16, 384 kb/s 2.12 M-A: 0.000 fd= 0 aq= 0KB vq= 0KB sq= 0B f=0/0
2.18 M-A: 0.000 fd= 0 aq= 0KB vq= 0KB sq= 0B f=0/0