Google Researchers have revealed a text-to-music AI that creates songs that can last as long as five minutes.
Releasing a paper with their work and findings so far, the team introduced MusicLM to the world with a number of examples that do bear a surprising resemblance to their text prompts.
The researchers claim their model “outperforms previous systems both in audio quality and adherence to the text description”.
The examples are 30-second snippets of the songs, and include their input captions such as:
- “The main soundtrack of an arcade game. It is fast-paced and upbeat, with a catchy electric guitar riff. The music is repetitive and easy to remember, but with unexpected sounds, like cymbal crashes or drum rolls”.
- “A fusion of reggaeton and electronic dance music, with a spacey, otherworldly sound. Induces the experience of being lost in space, and the music would be designed to evoke a sense of wonder and awe, while being danceable”.
- “A rising synth is playing an arpeggio with a lot of reverb. It is backed by pads, sub bass line and soft drums. This song is full of synth sounds creating a soothing and adventurous atmosphere. It may be playing at a festival during two songs for a buildup”.
Using AI to generate music is nothing new – but a tool that can actually generate passable music based on a simple text prompt has yet to be showcased yet. That is until now, according to the team behind MusicLM.
The researchers explain in their paper the various challenges facing AI music generation. First there is a problem with the lack of paired audio and text data – unlike in text-to-image machine learning, where they say huge datasets have “contributed significantly” to recent advances.
For example OpenAI’s DALL-E tool, and Stable Diffusion, have both caused a swell in public interest in the area, as well as immediate use cases.
An additional challenge in AI music generation is that music is structured “along a temporal dimension” – a music track exists over a period of time. Therefore it is much harder to capture the intent for a music track with a basic text caption, as opposed to using a caption for a still image.
MusicLM is a step towards overcoming those challenges, the team says.
It is a “hierarchical sequence-to-sequence model for music generation” which uses machine learning to generate sequences for different levels of the song, such as the structure, the melody, and the individual sounds.
To learn how to do this, the model is trained on a large dataset of unlabeled music, along with a music caption dataset of more than 5,500 examples, which were prepared by musicians. This dataset has been publicly released to support future research.
The model also allows for an audio input, in the form of whistling or humming for example, to help to inform the melody of the song, which will then be “rendered in the style described by the text prompt”.
It has not yet been released to the public, with the authors acknowledging the risks of potential “misappropriation of creative content” should a generated song not differ sufficiently from the source material the model learned from.