Multimodal AI built with Tar Heel research
A computer science professor and his student teamed with Microsoft Research to produce breakthrough technology.
In the past year, researchers at UNC-Chapel Hill helped engineer one of the most significant breakthroughs in artificial intelligence.
Working with a team at Microsoft Research, Carolina computer science professor Mohit Bansal and his student Zineng Tang, a Microsoft intern, created the CoDi AI system — a model capable of generating any combination of outputs (e.g, text, images, videos, audio) from any combination of inputs.
Microsoft Research featured the project on its website last summer, and a few months later, the team presented the revamped CoDi-2 to much fanfare.
Why all the fuss? What makes CoDi such a big deal?
Previous generative AI systems performed one-to-one tasks. For instance, a user might type in “draw a picture of a frog” and get a photo of a frog (text-to-image) or submit a photo and get a caption (image-to-text).
CoDi isn’t limited to one-to-one tasks. Short for “composable diffusion,” CoDi was the first AI model that could take any combination of inputs – text, audio, photo, video – and produce any combination of outputs using the idea of “bridge alignment,” giving the tool immense creative power. Most importantly, it can do so without relying on a prohibitively large number of training objectives (which is computationally infeasible) or training data for all these combinations (which is unavailable).
“CoDi is a very novel model in the AI community because it can effectively and efficiently handle unseen combinations of input/output modalities without relying on training the model on such expensive and hard-to-find data,” said Bansal, the computer science department’s John R. & Louise S. Parker Professor and the director of its MURGe-Lab. “This opens up a lot of exciting new applications.”
The website for the CoDi project includes several examples of this multimodal generative process:
- User inputs picture of Times Square, an audio clip of rain and the text “teddy bear on a skateboard,” and CoDi produces a video clip of a skating teddy bear on a rainy day in Times Square.
- User inputs a picture of a forest and an audio clip of a piano, and CoDi produces a picture of a man playing piano in the forest with the text “playing piano in a forest.”
- User types “train coming into station,” and CoDi produces a video, with audio, of a train pulling in.
The recently released CoDi-2 extends CoDi-1 using a large language model framework and is even more intuitive and interactive, handling more complex instructions that interleave multiple modalities.
AI technology is still developing, but there’s no doubt that the CoDi project has made massive waves. Bansal’s student, Tang, was named a recipient of the prestigious 2023 Computing Research Association Outstanding Undergraduate Researcher Award — one of only four winners across North America. Tang received several top offers and is continuing his education as a doctoral student at the University of California, Berkeley.
Meanwhile, Bansal continues to dive into AI at Carolina, and he imagines a future where the technology could make a significant impact in the classroom. He is co-principal investigator and the core-AI lead for the National Science Foundation AI Institute for Engaged Learning. At the institute, they are using similar multimodal technology as AI assistants to improve the classroom experience for students and teachers, including Bansal’s newer work on video and diagram generation.
“Teachers and students will be able to create interesting, visual stories, especially with CoDi-2,” Bansal said. “They can even talk to it or interact with it, create complex videos, even trailers of complex concepts to visually explain them more easily and interactively build them.”