DALL-E is a neural network that can create images from text captions. It uses a dataset of text-image pairs and creates archetypal images of animals, artwork, and objects. Moreover, it can understand and plausibly relate these concepts and can also transform them into existing images. There are many controversies regarding DALL-E as people think that it might replace human artists, but don’t worry, this will not be the case.

DALL-E is a 12 billion parameter version of GPT-3 (Generative Pre-Trained Transformer) which can receive both the text and image data in a single stream of 1280 tokens. This special feature of DALL-E allows it to generate 4X greater resolution images than any other Open AI platform.

The model first takes the input image-caption pair in the form of CLIP and creates a representation as vectors, called image embeddings or text embeddings. Later, a prior model is built which takes the CLIP and tries to fit the embeddings into the model. This method is called ‘Image-Text Processing’. Consequently, these image embeddings are processed using an Unclip method called the “Decoder Diffuser Method” to generate real-time images. DALL-E is the combination of Image-Text Processing + Unclip (Decoder Diffuser Method)

Let me simplify it by giving an example: a house is surrounded by hills and a lake in the front. Going from the sentence to the mental imagery is what Image-text processing does. Translating the imagery you have in your mind into the real drawing is what Unclip does. Now, think about which feature best represents the sentence “a house surrounded with hills and a lake in the front” (there is a house, sun, hill, etc.,) and which represents the image (the color, the shades, the styles…). This process of encoding the features of a sentence and an image is what CLIP does. 

art credit: ashwin sangareddypeta

DALL-E has been impacting many sectors. For example, it can structure and restructure buildings by giving a three-dimensional view at each angle from a sequence of equally spaced angles, and can recover a smooth animation of the architecture of a building. Since DALL-E is an Open-Sourced Artificial Intelligence platform, it allows normal people to work on it without any contribution. Consider a case wherein a builder wants to restructure an existing apartment. Either he must do it manually or he has to hire an experienced architect to give him a safe plan of the building. If he uses DALL-E, he can input his instructions in the form of words. The model generates pictures of the re-structured building within a few seconds. 

DALL-E appears to be able to apply some optical distortions to scenes, such as a “spherical panorama or lens view”. This means that DALL-E can generate reflections too.

DALL-E has another cool feature of interpolation where it can transform one image into another. For example, if you want to see an ‘unmodernized’ version of an iPhone, DALL-E transforms the current version of the iPhone into the old version. On the other hand, if you want to transform a Victorian house into a modern house DALL-E simply does it by maintaining the semantic coherence of the original picture. You could also ask for changes in objects, landscapes, clothing, and more by changing a word in the prompt and get results in real-time.

art credit: ashwin sangareddypeta

DALL-E has also become an extremely powerful tool in the fashion field. In a fashion show named ‘Trillo’s Creative Arsenal’, Trillo, a famous fashion designer, created an impressive series of stop-motion composites combining real-world imagery with DALL-E’s synthetic creations. He created a beautiful 30-second fashion show that used AI to generate outfits in collaboration with his wife. This is an interesting and smart way to brainstorm costumes and fashion ideas.

In this way, time, effort, and budget are saved, as the process is being done automatically by DALL-E. Also, the ideas of the artist can easily be represented in the form of images or videos even before exhibiting them on models. They could also change the costumes based on the genders, such as a male mannequin dressed in a blue shirt with a leather jacket or a female mannequin dressed in a gold pleated skirt. 

Some disputes have been raised because DALL-E represents people and environment in one form or another depending on their identity (e.g race, gender, nationality, etc.,). It also uses deep fake technology which allows people to add or remove objects or people, even though it is prohibited by OpenAI’s content policy. While deep fakes can create pictures close to reality, it can also create images that are misleading. Also, DALL-E is very bad at spelling because it does not encode spelling information from the text present in the dataset images.

Furthermore, DALL-E cannot replace human work due to technical limitations. Lack of common sense, understanding, and coherence make DALL-E different from the way human beings think. The model can also be over-trained sometimes, which can make its behavior unreliable when it comes to compositional thinking. 

We cannot underestimate the ability of DALL-E nor can we overestimate it. It is a creative, versatile tool that can make our work easy in many ways. Nonetheless, there would not be any halt in the recruitment of people working in different sectors: DALL-E can make the work easier, but cannot replace human intelligence.