Not long ago, IIya Sutskever, the co-founder and chief scientist of OpenAI, wrote an article in The Batch Weekly-2020 Year-end Special, edited by Enda Wu, saying that “in 2021, language models will begin to understand the visual world”. At the beginning of this month, OpenAI announced two multi-modal artificial intelligence systems, DALL-E and CLIP, once again igniting the AI community.
DALL·E: A “GPT-3” with 12 billion parameters using a text-image data set, which can generate various images based on text;
CLIP: You can learn visual concepts effectively through natural language supervision. You only need to provide the name of the visual category to be recognized. Using CLIP, you can do any visual classification, similar to the “Zero-shot” function of GPT-2 and GPT-3 .
The name of DALL·E is derived from the composite word of the artist Salvador Dalí and Pixar’s “Robots” (WALL-E). The name itself is full of machine imagination and exploration of art. DALL-E is very similar to GPT-3. It is also a transformer language model that receives text and images as input, and outputs the final converted image in multiple forms. It can edit the properties of specific objects in the image, as you can see here. You can even control multiple objects and their properties at the same time. This is a very complex task, because the network must understand the relationships between objects and create images based on their understanding.
To put it simply, DALL·E is a version of GPT-3 with 12 billion parameters. It uses text-images to trai n the data set, input text, and generate corresponding
「a store front that has the word ‘OpenAI’ written on it」
「the exact same cat on the top as a sketch on the bottom”」