Meta introduces new genAI model for text, image generation
Adding to its range of generative artificial intelligence offerings, Meta has now introduced a new text-to-image and image-to-text generation tool called CM3leon. Pronounced as chameleon, the new tool achieves state-of-the-art performance in text-to-image generation despite being trained on five times less compute as compared to previous transformer-based methods, Meta said in a blog.
CM3leon's architecture is based on a decoder-only transformer, similar to other generative AI models like GPT-3, GPT–4, ChatGPT, and LaMDa. However, the ability to input and generate both text and images sets CM3leon apart from other text-based models, Meta claims.
Text-only generative models are tuned on a wide range of tasks to improve their ability to generate accurate results as per the instruction prompts. Image generation models, however, are generally specialised for particular tasks. For CM3leon, Meta engineers have applied large-scale multitask instruction tuning (generally used for text-only generative models) for both text and image generation. This allows improved performance on tasks like caption generation, visual question answering, text-based editing, and conditional image generation.
“With the goal of creating high-quality generative models, we believe CM3leon’s strong performance across a variety of tasks is a step toward higher-fidelity image generation and understanding. Models like CM3leon could ultimately help boost creativity and better applications in the metaverse,” the blog read.
Last month, Meta announced a generative AI model called Voicebox for converting text to speech; it includes features to edit audio and work across languages.
The system has been trained on more than 50,000 hours of unfiltered audio. Specifically, Meta used recorded speech and transcripts from a bunch of public-domain audiobooks written in English, French, Spanish, German, Polish, and Portuguese. That diverse data set, according to Meta researchers, allows the system to “generate more conversational sounding speech, regardless of the languages spoken”.