Google MusicLM AI model claims to generate music from text
A team of researchers at Google published a paper detailing a new ‘language model’ (LM) called MusicLM on January 26. The latter is an Artificial Intelligence (AI) tool that has been trained on a massive amount of text and audio data, to create an engine that can generate 30-second pieces of music based on a written input. The project has not been published in public yet, but shows the potential for AI to one day assist musicians in producing their own audio tracks.
MusicLM is conceptually similar to AudioLM — a similar AI tool that used partial audio data, such as an incomplete sentence spoken by a user, to either create completed sentences and paragraphs, or even music tracks, based on the input. However, MusicLM is different in the sense that it has been trained on a much larger data set, according to the research paper published by the team of researchers at Google. It can also take text inputs and turn them into audio, in a manner similar to Microsoft’s voice-simulating AI tool Vall-E, OpenAI’s music generator Jukebox, and even non-audio generative AI tools such as the text-generating ChatGPT, and the image-generating Dall-E, Midjourney or Lensa.
This new crop of AI tools fall under a subset called generative AI, which literally stands for tools that can use a set of basic human inputs to ‘generate’ varying forms of content — based on the thousands of hours of data that these algorithms have been trained on. For instance, ChatGPT, which has seen widespread popularity and criticism alike, is trained on nearly 200 billion data parameters. Google’s MusicLM, according to the research paper published on it, is trained on over 280,000 hours of music and audio data — which corresponds to nearly 32 years of non-stop audio data.
As demonstrated in a report by US publication TechCrunch, MusicLM can take inputs such as asking for a specific genre or type of audio track, such as ‘an arcade video game theme’. It is here that it differs from AudioLM, which was largely limited to synthesizing statements and sentences. MusicLM is also different from Microsoft’s Vall-E, which the company showcased in demos on January 9 — the latter simply hears a three-second audio snippet spoken by a user, and seeks to replicate other sentences in that user’s voice, articulation and pronunciation.
However, the researchers have refrained from publishing MusicLM for public usage, for various reasons. According to the published report, MusicLM has an issue with using copyrighted tracks as part of the audio snippets it generates in approximately 1% of its creations. While the percentage number may seem small, the researchers state that since generative tools can be used to create hundreds of thousands of tracks, copyright infringement can become a major issue if the tool is published immediately.
MusicLM can also create and overlay human voices on the tracks it creates, but any lyrics created by it sound garbled and incoherent.
The tool comes at a time when an increasing number of artists have voiced concerns regarding the use of generative AI tools in creative fields such as paintings, writing, and now music generation. Groups of artists in the US have voiced concerns about using their copyrighted art in training the AI, and if that itself may fall under infringement of artists’ intellectual properties.
Generative AI tools also hit a roadblock recently with US publication Cnet, which issued a public letter last week explaining their use of AI in the newsroom to generate search engine-friendly content — an act that fellow publications highlighted could help it garner larger advertising revenue, while not being factually accurate. Alongside the factual discrepancy, an article by Futurism published on January 16 highlighted that Cnet’s AI-written articles also contained plagiarized content — thus showcasing the issues with generative AI at the moment.