A Guide to OpenAI Text-to-Speech (TTS) in Spring AI

1. Introduction

Nowadays, applications benefit greatly from neural network integration, such as knowledge bases, assistants, or analytics engines. One practical use case is converting text into speech. This process, known as Text-to-Speech (TTS), enables automated audio content creation with natural-sounding, human-like voices.

Modern TTS systems use deep learning to handle pronunciation, rhythm, intonation, and even emotion. Unlike early rule-based methods, these models are trained on large datasets and can generate expressive, multilingual speech, which is ideal for global applications like virtual assistants or inclusive education platforms.

In this tutorial, we’ll explore how to use OpenAI Text-to-Speech with Spring AI.

2. Dependencies and Configuration

We’ll start by adding the spring-ai-starter-model-openai dependency:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-model-openai</artifactId>
    <version>1.1.0</version>
</dependency>

Next, we’ll configure the Spring AI properties for the OpenAI model:

spring.ai.openai.api-key=${OPENAI_API_KEY} spring.ai.openai.audio.speech.options.model=tts-1 spring.ai.openai.audio.speech.options.voice=alloy spring.ai.openai.audio.speech.options.response-format=mp3 spring.ai.openai.audio.speech.options.speed=1.0

To use the OpenAI API, we must set the Open AI API key. We also have to specify the text-to-speech model name, voice, response format, and audio speed.

3. Build Text-to-Voice Application

Now, we’ll build our text-to-voice application. First of all, we’ll create the TextToSpeechService:

@Service
public class TextToSpeechService {
    private OpenAiAudioSpeechModel openAiAudioSpeechModel;
    @Autowired
    public TextToSpeechService(OpenAiAudioSpeechModel openAiAudioSpeechModel) {
        this.openAiAudioSpeechModel = openAiAudioSpeechModel;
    }
    public byte[] makeSpeech(String text) {
        SpeechPrompt speechPrompt = new SpeechPrompt(text);
        SpeechResponse response = openAiAudioSpeechModel.call(speechPrompt);
        return response.getResult().getOutput();
    }
}

Here, we use OpenAiAudioSpeechModel, which Spring AI preconfigures using our properties. We also define the makeSpeech() method to convert text into audio file bytes with OpenAiAudioSpeechModel.

Next, we create the TextToSpeechController:

@RestController
public class TextToSpeechController {
    private final TextToSpeechService textToSpeechService;
    @Autowired
    public TextToSpeechController(TextToSpeechService textToSpeechService) {
        this.textToSpeechService = textToSpeechService;
    }
    @GetMapping("/text-to-speech")
    public ResponseEntity<byte[]> generateSpeechForText(@RequestParam String text) {
        return ResponseEntity.ok(textToSpeechService.makeSpeech(text));
    }
}

Finally, we test our endpoint:

@SpringBootTest
@ExtendWith(SpringExtension.class)
@AutoConfigureMockMvc
@EnabledIfEnvironmentVariable(named = "OPENAI_API_KEY", matches = ".*")
class TextToSpeechLiveTest {
    @Autowired
    private MockMvc mockMvc;
    @Autowired
    private TextToSpeechService textToSpeechService;
    @Test
    void givenTextToSpeechService_whenCallingTextToSpeechEndpoint_thenExpectedAudioFileBytesShouldBeObtained() throws Exception {
        byte[] audioContent = mockMvc.perform(get("/text-to-speech")
          .param("text", "Hello from Baeldung"))
          .andExpect(status().isOk())
          .andReturn()
          .getResponse()
          .getContentAsByteArray();
        assertNotEquals(0, audioContent.length);
    }
}

We call the text-to-speech endpoint and verify the response code and non-empty content. If we save the content to a file, we get an MP3 file with our speech.

4. Add Streaming Real-time Audio Endpoint

We may face significant memory consumption when getting large audio content in one huge byte array. Also, sometimes we want to play the audio before it’s fully uploaded. For this, OpenAI supports streaming text-to-speech responses.

Let’s extend our TextToSpeechService to support this feature:

public Flux<byte[]> makeSpeechStream(String text) {
    SpeechPrompt speechPrompt = new SpeechPrompt(text);
    Flux<SpeechResponse> responseStream = openAiAudioSpeechModel.stream(speechPrompt);
    return responseStream
      .map(SpeechResponse::getResult)
      .map(Speech::getOutput);
}

We’ve added the makeSpeechStream() method. Here, we use the stream() method of OpenAiAudioSpeechModel to produce a stream of bytes chunks.

Next, we create the endpoint to stream bytes over HTTP:

@GetMapping(value = "/text-to-speech-stream", produces = MediaType.APPLICATION_OCTET_STREAM_VALUE)
public ResponseEntity<StreamingResponseBody> streamSpeech(@RequestParam("text") String text) {
    Flux<byte[]> audioStream = textToSpeechService.makeSpeechStream(text);
    StreamingResponseBody responseBody = outputStream -> {
        audioStream.toStream().forEach(bytes -> {
            try {
                outputStream.write(bytes);
                outputStream.flush();
            } catch (IOException e) {
                throw new UncheckedIOException(e);
            }
        });
    };
    return ResponseEntity.ok()
      .contentType(MediaType.APPLICATION_OCTET_STREAM)
      .body(responseBody);
}

Here we iterate over the byte stream and write it into the StreamingResponseBody. If we used WebFlux, we would return the Flux directly from the endpoint. As one option, we can use the application/octet-stream content type to indicate that the response is a stream.

Now let’s test our streaming method:

@Test
void givenStreamingEndpoint_whenCalled_thenReceiveAudioFileBytes() throws Exception {
    String longText = """
          Hello from Baeldung!
          Here, we explore the world of Java,
          Spring, and web development with clear, practical tutorials.
          Whether you're just starting out or diving deep into advanced
          topics, you'll find guides to help you write clean, efficient,
          and modern code.
          """;
    mockMvc.perform(get("/text-to-speech-stream")
        .param("text", longText)
        .accept(MediaType.APPLICATION_OCTET_STREAM))
      .andExpect(status().isOk())
      .andDo(result -> {
          byte[] response = result.getResponse().getContentAsByteArray();
          assertNotNull(response);
          assertTrue( response.length > 0);
      });
}

Here we call our streaming endpoint and verify it returns a byte array. MockMvc collects the full response body, but we can also read it as a stream.

5. Customize the Model Parameters for Specific Call

Sometimes we need to override model options for specific calls. For this, we can use OpenAiAudioSpeechOptions. Let’s update our TextToSpeechService to support custom speech options:

public byte[] makeSpeech(String text, OpenAiAudioSpeechOptions speechOptions) {
    SpeechPrompt speechPrompt = new SpeechPrompt(text, speechOptions);
    SpeechResponse response = openAiAudioSpeechModel.call(speechPrompt);
    return response.getResult().getOutput();
}

We’ve overridden makeSpeech() and added the OpenAiAudioSpeechOptions parameter. We use it as a parameter of the call to OpenAI API. If we pass an empty object, default options will apply.

Now we create another endpoint that accepts speech parameters:

@GetMapping("/text-to-speech-customized")
public ResponseEntity<byte[]> generateSpeechForTextCustomized(@RequestParam("text") String text, @RequestParam Map<String, String> params) {
    OpenAiAudioSpeechOptions speechOptions = OpenAiAudioSpeechOptions.builder()
      .model(params.get("model"))
      .voice(OpenAiAudioApi.SpeechRequest.Voice.valueOf(params.get("voice")))
      .responseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.valueOf(params.get("responseFormat")))
      .speed(Float.parseFloat(params.get("speed")))
      .build();
    return ResponseEntity.ok(textToSpeechService.makeSpeech(text, speechOptions));
}

Here we get a map of speech parameters and build the OpenAiAudioSpeechOptions.

Finally, let’s test the new endpoint:

@Test
void givenTextToSpeechService_whenCallingTextToSpeechEndpointWithAnotherVoiceOption_thenExpectedAudioFileBytesShouldBeObtained() throws Exception {
    byte[] audioContent = mockMvc.perform(get("/text-to-speech-customized")
      .param("text", "Hello from Baeldung")
      .param("model", "tts-1")
      .param("voice", "NOVA")
      .param("responseFormat", "MP3")
      .param("speed", "1.0"))
    .andExpect(status().isOk())
    .andReturn()
    .getResponse()
    .getContentAsByteArray();
    assertNotEquals(0, audioContent.length);
}

We called the endpoint and used the NOVA voice for this request. As expected, we received the audio bytes with the overridden voice.

6. Conclusion

Text-to-speech APIs make it possible to generate natural speech from text. With simple configuration and modern models, we can bring dynamic, spoken interaction into our applications.

In this article, we explored how to integrate our application with the OpenAI TTS model using Spring AI. In the same way, we can integrate with other TTS models or build our own.

As always, the code is available over on GitHub.

The post A Guide to OpenAI Text-to-Speech (TTS) in Spring AI first appeared on Baeldung.