Auralis

Making XTTS-v2 Fast, Resource-Efficient, and Async Safe

To enhance the magnificent XTTS-v2 by implementing a fast and resource-efficient version, the first step is actually understanding what the code is doing and why.

Coming from an absolute zero background in audio-related technology, it initially seemed overwhelming. However, with enough dedication (and perhaps a sleepless night), any code can be unraveled.

Usually, approaching tasks like this requires you to first understand the architecture and then start reading the code. While reading the paper first is usually a good strategy, in this case, I opted for a more hands-on approach by first running the code I needed to optimize. After that, wherever I couldn't understand the logic, I searched the paper for explanations.

So I grabbed the code from the repository and... nothing happened because I discovered that Coqui had gone down during 2024, and their repo had not received an update in four months. I started questioning if going this route would make sense, but by quickly checking its Hugging Face model card, I instantly knew that it had to be done. In the last month, they had more than 1 million downloads, so it seems like many people could still benefit from such an implementation.

After a bit of messing around with the requirements file, I managed to get a compatible environment. The first thing I noticed was that the code was probably adapted from a training script. While there's nothing intrinsically wrong with that, it surely isn't optimized for a production environment. I didn't even know why I decided to embark on this journey, but other than making an effort to push forward the OSS community, I wanted to integrate a good and reliable text-to-speech system into our UI Pulsar, and possibly every other UI out there!

When the inference method is called, it starts by tokenizing the input. Here's where the 'problems' start: they use a custom-formatted tokenizer, which could not be used in Hugging Face. So for maximum compatibility, I decided to port their tokenizer logic into a FastPreTrainedTokenizer, compatible with the Hugging Face API. By looking at their repo, I also saw that their model checkpoint was still in .pth format (which is a Torch saving format that is not preferred due to the risk of arbitrary code execution during the unpickling of the file), and that one file was even flagged as possibly dangerous.

First, the tokenizer had some strange behavior: for every supported language, there was a specified max length in characters—not tokens. I think they've found and set some hard-coded limits to the audio generation. In fact, they say that to contain the memory footprint you can truncate the input strings with the corresponding language's max character length. So I decided to use this strategy since this project is heavily optimization oriented. In order to maintain the best quality I decided to optimize the tokenizer a bit and I've introduced a different splitting technique, in where there are three main cases:

1. hard splits this corresponds to ending punctuation such as . ? !
2. mid splits those are punctuation that are usually placed in the middle of a phrase but still in correspondence of where a speaker inhale, examples are , ;
3. weak splits this is the worst case and we split in the neared space

We search within a +/- of 30 chars from the target length for a split and if the phrase ends with a dot, we remove it since it would produce strange sounds (sometimes it appeared that it was trying to say .com or .net but interrupted abruptly, we did not find any loss of quality in every other case).

I then implemented the tokenizer in a Hugging Face-compatible format.

Continuing the inference process, I found out that the XTTS-v2 architecture was composed mainly of a GPT-2-like model and a HiFi-GAN vocoder. In reality, there are a couple more components that are particularly noteworthy.

Conditional Encoder: A module composed of ResNet blocks and convolutions (fact check needed) that outputs a conditioning tensor in the format of (1, seq_len, hidden_dim), where seq_len (sequence length) is determined by the number of reference audios and by the limit set in the config file for the conditioning inputs. This component also outputs a conditioning latent that will be used by the HiFi-GAN decoder to 'clone' the reference voice.

From what I've grasped so far, I think the starting conditioning latents alter the speech modality, and the second one alters the speech pitch and intonation.

In XTTS-v2, they introduced a conditioning perceiver, which is a module composed of Attention Blocks interleaved with Causal Convolutions, that is basically processing the conditioning inputs to obtain a better (and standardized) conditioning input for the model. They also use a learned positional encoding, which is a rather novel approach.

After the text has been tokenized, they add a start/end of text generation token at the front and back of the sequence. They then encode it. The XTTS-v2 has two encoding and head components, and while for the first both are required in inference, for the latter just the audio (mel) component is needed.

So the text is encoded and marked with positional embeddings, and they're concatenated in front of the conditioning inputs along the sequence length dimension. Basically, it's like we are treating the way the speaker speaks as an embedded tokenized representation of the speaker's modality, *which is quite fascinating!*

So we have an embeddings tensor composed of

[[speaker conditioning][text embeddings]],

which is then passed to the get_forward method. Here, the code is a bit messy, but basically what they do is preallocate a tensor with placeholder tokens (value 1), one for each sequence length value of the contained tensor plus one. Essentially, since the inference model uses different tokens than the ones used by the audio (which do not have direct correspondence with a tokenizer), they're set to 1 (which is not used in the audio generation process) to account for the space in the key-value (KV) cache since the embeddings with this shape are actually used.

They add 1 to the sequence length because they add a start_audio_generation token (with id = 1024) so that the model actually knows to start generating the audio tokens.

Here, the audio token (1024) is embedded, positionally encoded, and then concatenated to the previous concatenated tensor. It is noteworthy that they do not actually positionally encode the token based on its id but rather with the latent representation of the embedded token, which is also *quite fascinating*. I did not find any mention of the motivation behind this; if you know, maybe write a comment—I would love to understand it better!

At this point, we have

[[conditioning embeddings][text embeddings][start_audio_generation_token embeddings]],

and we start the decoding process with the GPT-2 model. After that, we process the logits and sample the next token. Notably, the repetition penalty is way higher than in traditional language models; here, it is set between 5 and 10 in the provided examples, whereas in language modeling, it usually never exceeds 2!

After obtaining the token, they add a start_audio_generation token (which has been lost in the generation process) and four stop_generation tokens (which, with the one already present, make five). I don't actually get why they do this, and probably neither do they; in the code, there are comments that hint that they did not understand it fully ("Don't ask me why :)"). We'll get to my understanding of why later on. The process continues by embedding the audio tokens with the Mel embedder, positionally encoding them (with standard id-based positional encoding this time), and then they are concatenated to the original

[[speaker conditioning][text embeddings]]

‍ tensor, giving

[[speaker conditioning][text embeddings][encoded generated audio tokens]].

This is then passed as embedded_input into the GPT-2 model for a last pass, after which we don't take the next token prediction but rather the latent representation, and we use it as it is for the latent input of the HiFi-GAN (along with the conditioning input from earlier). Notably, they subtract the added five stop tokens from the embeddings. I think they are stuck with adding it from maybe a malformed training, where to emphasize the stop sequence they repeated it (or maybe because there was padding and the model learned to actually stop when five stop tokens are present). I think that when the last five tokens are not removed, the model might start to produce very unnatural audio that is typical of TTS models when they're asked to repeat a token for too long (like in Siri).

After understanding how it generates the audio, I started optimizing their code for inference. A lot of computations were repeated or unused in the inference process, so I decided to actually start afresh, incorporating only a few well-structured elements from their codebase.

The first thing that came to mind was to use [vLLM](https://vllm.ai/) to serve the internal GPT-2 model, and for the other components, I would manually render them to be asynchronous and non-blocking.

For the asynchronous part there wasn't nothing special, I just executed them in the main loop with asyncio to make them non blocking, but sinche the memory consumption were very high I optimized the HiFiGAN model to actually do almost all op in-place (since the code will not be used for training) slashing the memory usage.

Implementing the GPT-2 audio model in vLLM was not an easy task, however, mainly because vLLM is rather new to multimodality, and there isn't much implementation done regarding audio models (to date, just one model other than this).

The main difficulty in implementing a multimodal GPT-2 in vLLM was how the input are passed in the auto-regressive parts, and on how continuous batching is implemented. The first thing we needed to make sure of was to reserve the right number of tokens in the cache. So they must be the length of the prompt tokens (voice conditioning) plus text token ids plus one (for the start_audio_generation token).

We found that the original implementation was calculating the same token and text embeddings at every forward pass, so we deduplicated this and passed it as an argument.

Then we followed the original flow, adapting the components to be parallel-safe (even if the model is so small it will never be split probably).

Another thing that needs to be mentioned is that vLLM actually hard-limits the maximum value of the repetition penalty, which is a penalty that adjusts the logits (probabilities) of the tokens that are being sampled. They limit the value to 2, but we know that XTTS-v2 needs a repetition penalty from 5 to 10, so we created a logit processor that penalizes the logits with the repetition penalty but without limits on its value.

Also, vLLM does not support the output of logits, so for the last decoding step, we had modified the SamplingParams class to added an hidden state collector, which is an object that captures the hidden states before sampling.

The last problem to fix was that vllm automatically sets the position ids for the sequences and since every sequence has a conditioning part that must be excluded from the token count in the positioning phase we introduced a tracker component that track the real position id associated with the token.

After integrating all these optimizations, I finally had a version of XTTS-v2 that was not only faster and more resource-efficient but also asynchronous and safe(er) for production environments. The journey wasn't easy—filled with unexpected hurdles and deep dives into both the model's architecture and the tools I was using—but the end result was worth it.

By leveraging vLLM for the GPT-2 component and restructuring the code to eliminate unnecessary computations, we've significantly reduced inference time and resource consumption. The model now fits more naturally into production pipelines, and it will be integrated into our UI Pulsar, enhancing the user experience with reliable and efficient text-to-speech capabilities.

There's still room for improvement. For instance, understanding the exact reasons behind some of the original implementation choices—like the repeated stop tokens—could lead to even more optimizations. Additionally, refining the conditioning components and exploring the effects of different repetition penalties might unlock further performance gains.

Embarking on this project not only pushed me to expand my knowledge in audio-related technologies but also reinforced the importance of open-source contributions. By making these enhancements available, I hope to benefit others who are looking to implement fast and efficient text-to-speech solutions in their own applications.

If you're interested in exploring this optimized version or have insights into some of the quirks mentioned, feel free to reach out or contribute to the project. Together, we can continue to push forward the OSS community and make advanced technologies like XTTS-v2 accessible to all.