OpenAI Text To Speech (tts-1) and Polish language

Does OpenAI Text to Speech support languages other than English?

OpenAI’s Text-to-Speech service transforms text into audio files with impressive quality when we consider English language. I haven’t performed a blind test, but based on a few samples I heard, I don’t think I could easily recognize if the audio were recorded by a real human or generated by an artificial intelligence model.

However, I wondered if the service could perform comparably well with languages other than English. The documentation is a bit mysterious:

The TTS model generally follows the Whisper model in terms of language support. Whisper supports the following languages and performs well despite the current voices being optimized for English.
Source: Text to speech documentation

Since there were no samples on the page other than those in English, I generated some samples in Polish to see if the model is already useful there. Polish is my native language, so it’s the one where I can judge nuances most easily. It’s also substantially different from English which makes it an interesting test ground. I’m sharing the results here, enjoy! 🙂

Please note, the samples were generated on 2023-11-18 using model tts-1.

The API

I generate samples using the createSpeech API endpoint. A typical example is as simple as sending the following request, with a proper authentication token added:

POST https://api.openai.com/v1/audio/speech

{
    "model": "tts-1",
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "alloy"
}Code language: JavaScript (javascript)

Interestingly, the API does not accept the language parameter, so it must rely on auto-detecting language from the context. This has no chance of working for very short texts. Should the word “no” be read as English, Spanish, Polish, etc? No way to know without more context. But it might work reasonably well for sentences and longer texts.

Example 1: Adam Mickiewicz – Pan Tadeusz (Inwokacja)

Here is our first sample:

Litwo, Ojczyzno moja! ty jesteś jak zdrowie;
Ile cię trzeba cenić, ten tylko się dowie, Kto cię stracił.
Dziś piękność twą w całej ozdobie;
Widzę i opisuję, bo tęsknię po tobie.
Source: Adam Mickiewicz – Pan Tadeusz – Inwokacja

Voice: echo

Voice: alloy

Voice: fable

Voice: nova

Voice: onyx

Voice: shimmer

Example 2: Grzegorz Brzęczyszczykiewicz

Now let me share a tongue twister to see how the model performs with an arguably difficult input.

Grzegorz Brzęczyszczykiewicz, Chrząszczyżewoszyce, powiat Łękołody.
Source: a tongue twister from the movie „Jak rozpętałem II Wojnę Światową”

Voice: fable

Voice: nova

I found the output truly impressive. Everything sounds correct. And the sentence is quite difficult to pronounce without mistakes even for a native speaker of the language 😉

Example 3: a few quotes from Polish comedies

Let me end with a few random, famous quotes from Polish comedies.

“Bunkrów nie ma, ale też jest zajebiście” generated with a voice fable

“Parówkowym skrytożercom mówimy stanowcze: NIE!” generated with a voice alloy

“Ciemność, widzę ciemność! Ciemność widzę!” generated with a voice fable

Discussion

How does it sound to a Polish native? I think it sounds like a foreigner who speaks Polish very well, maybe for years. One can stil hear the western accent, however, notably around the national characters like “ć”, “ś”, and maybe “ę”. The stresses seem right, although it might be debatable, with the main sample being a poem with its own rhythm.

It’s doing quite well with interpretation as well – text written with CAPITAL LETTERS is spoken with an emotional tone, and punctuation marks direct how the whole sentence sounds.

Is any voice significantly better than other? I think nova sounds the best. I think I could mistake it for human voice if I wasn’t using good quality headphones and focusing that much. The second place goes to fable.

Is it useful? I think it depends on a use case. At this stage, I would strongly prefer human voice to demostrate correct pronounciation of words in language learning case, although… it’s probably good enough when compared with no audio sample at all. For other use cases, where realism is not a top priority, it looks fine to use already.

I don’t know much of the current competitors so I cannot compare the results to alternative models, but this API is definitely an interesting one – and not only for those generating speech in English.

Thanks for stopping by!

2 thoughts on “OpenAI Text To Speech (tts-1) and Polish language”

Wolfgang Schneider

January 5, 2025 at 9:11 am

Hi Tim, thanks for this excellent arcticle ! Very helpful for my Polish learning approach. 🙂 I wonder if you have any update on other models, since one year much happened in the AI realm. I searched myself but am unable to judge if the Polish pronunciation is really good or not.
- Tim Taurit
  
  January 5, 2025 at 12:25 pm
  Hey Wolfgang, thanks for your interest in learning Polish; always a pleasure to hear! 🙂 Here’s how I see it in January 2025:
  - The best voice models for Polish I have encountered so far (though I’m sure there are many more on the market) are Azure Text-to-speech models: `pl-PL-MarekNeural` (my favorite), `pl-PL-AgnieszkaNeural`, `pl-PL-ZofiaNeural`. All sound correct and natural; I could easily mistake them for human speech if it weren’t for sporadic loan word mistakes—solid 9/10.
  - OpenAI’s models from this blog post seem to have been updated; the outputs sound slightly different today. But I’d rate them 7/10. The stresses and melody are correct, yet a slight Western accent is still heard in the pronunciation of national characters. This might be good enough, but Azure’s TTS is better.
  - I also now use ChatGPT Advanced Voice Mode, which allows a voice chat in Polish. I think it’s the same engine; I’d rate it 6-7/10. The language’s grammar, stresses, and melody are correct and natural, but there is a Western accent in the pronunciation of national characters.
  I keep my fingers crossed for your progress! 😊