OpenAI Text To Speech (tts-1) and Polish language

Does OpenAI Text to Speech support languages other than English?

OpenAI’s Text-to-Speech service transforms text into audio files with impressive quality when we consider English language. I haven’t performed a blind test, but based on a few samples I heard, I don’t think I could easily recognize if the audio were recorded by a real human or generated by an artificial intelligence model.

However, I wondered if the service could perform comparably well with languages other than English. The documentation is a bit mysterious:

The TTS model generally follows the Whisper model in terms of language support. Whisper supports the following languages and performs well despite the current voices being optimized for English.

Source: Text to speech documentation

Since there were no samples on the page other than those in English, I generated some samples in Polish to see if the model is already useful there. Polish is my native language, so it’s the one where I can judge nuances most easily. It’s also substantially different from English which makes it an interesting test ground. I’m sharing the results here, enjoy! 🙂

Please note, the samples were generated on 2023-11-18 using model tts-1.

The API

I generate samples using the createSpeech API endpoint. A typical example is as simple as sending the following request, with a proper authentication token added:

POST https://api.openai.com/v1/audio/speech

{
    "model": "tts-1",
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "alloy"
}Code language: JavaScript (javascript)

Interestingly, the API does not accept the language parameter, so it must rely on auto-detecting language from the context. This has no chance of working for very short texts. Should the word “no” be read as English, Spanish, Polish, etc? No way to know without more context. But it might work reasonably well for sentences and longer texts.

Example 1: Adam Mickiewicz – Pan Tadeusz (Inwokacja)

Here is our first sample:

Litwo, Ojczyzno moja! ty jesteś jak zdrowie;
Ile cię trzeba cenić, ten tylko się dowie, Kto cię stracił.
Dziś piękność twą w całej ozdobie;
Widzę i opisuję, bo tęsknię po tobie.

Source: Adam Mickiewicz – Pan Tadeusz – Inwokacja
Voice: echo
Voice: alloy
Voice: fable
Voice: nova
Voice: onyx
Voice: shimmer

Example 2: Grzegorz Brzęczyszczykiewicz

Now let me share a tongue twister to see how the model performs with an arguably difficult input.

Grzegorz Brzęczyszczykiewicz, Chrząszczyżewoszyce, powiat Łękołody.

Source: a tongue twister from the movie „Jak rozpętałem II Wojnę Światową”
Voice: fable
Voice: nova

I found the output truly impressive. Everything sounds correct. And the sentence is quite difficult to pronounce without mistakes even for a native speaker of the language 😉

Example 3: a few quotes from Polish comedies

Let me end with a few random, famous quotes from Polish comedies.

“Bunkrów nie ma, ale też jest zajebiście” generated with a voice fable
“Parówkowym skrytożercom mówimy stanowcze: NIE!” generated with a voice alloy
“Ciemność, widzę ciemność! Ciemność widzę!” generated with a voice fable

Discussion

How does it sound to a Polish native? I think it sounds like a foreigner who speaks Polish very well, maybe for years. One can stil hear the western accent, however, notably around the national characters like “ć”, “ś”, and maybe “ę”. The stresses seem right, although it might be debatable, with the main sample being a poem with its own rhythm.

It’s doing quite well with interpretation as well – text written with CAPITAL LETTERS is spoken with an emotional tone, and punctuation marks direct how the whole sentence sounds.

Is any voice significantly better than other? I think nova sounds the best. I think I could mistake it for human voice if I wasn’t using good quality headphones and focusing that much. The second place goes to fable.

Is it useful? I think it depends on a use case. At this stage, I would strongly prefer human voice to demostrate correct pronounciation of words in language learning case, although… it’s probably good enough when compared with no audio sample at all. For other use cases, where realism is not a top priority, it looks fine to use already.

I don’t know much of the current competitors so I cannot compare the results to alternative models, but this API is definitely an interesting one – and not only for those generating speech in English.

Thanks for stopping by!

Leave a Comment