You’ve probably heard about some of the mind-blowing things that generative AI can do lately. Whether it’s photorealistic AI-generated images, human-like text, or creative works of art and music – generative AI is taking the world by storm.
The main types of generative AI include:
- Generative Adversarial Networks (GANs)
- Variational Autoencoders (VAEs) – Diffusion Models
- Large Language Models (LLMs)
- Text-to-Image/Image-to-Text Models
- Audio/Speech Synthesis Models
But with all the different types and technical jargon floating around, it can be confusing to understand how it all works. Don’t worry, I’ve got you covered. In this post, we’ll break down the major categories and examples of generative AI in a way that’s easy to digest. Let’s dive in!
Generative Adversarial Networks (GANs)
First up, we have GANs or Generative Adversarial Networks. Think of these as an AI version of an art forger and an art detective working against each other. The “generator” neural network tries to create fake images, text, etc. that look realistic. Meanwhile, the “discriminator” neural net analyzes fakes and tries to spot them.
This back-and-forth training process continues until the generator produces counterfeits that are indistinguishable from the real thing. Through this adversarial training, GANs can learn to generate incredibly sharp and convincing AI-generated imagery, video, text passages, and more.
GANs have been used to produce highly realistic fake portraits of humans that don’t exist, sharpen and upscale low-resolution images, and create AI-generated avatars and characters for video games, films, and metaverse applications.
Key GAN Architectures and Use Cases:
- StyleGAN – Generates high-resolution AI faces/portraits
- CycleGAN – Unpaired image-to-image translation (e.g. horse to zebra)
- Pix2Pix – Paired image-to-image translation tasks
- Text-to-Image GANs – Generate images from text descriptions
- BiGAN/ALI – Learn inverse mapping from image to text
Other emerging use cases include data augmentation to expand training datasets, automated content creation, AI image editing and manipulation, and even de-aging technology used in film/TV. GANs’ ability to generate realistic synthetic data is incredibly powerful.
Variational Autoencoders (VAEs)
Next, we have VAEs or Variational Autoencoders. These compress data like images or audio into a simplified code or latent space representation. Then they learn how to reconstruct the original data from that compressed code.
By introducing some randomness into that latent space, VAEs can generate brand-new data samples. The “variational” aspect allows them to model the probability distribution that represents the data.
Imagine having a friend describe a scene to you, and you draw a picture based on their description. That’s basically how VAEs work, except they turn descriptions into mathematical codes and decode them into new visuals, audio clips, or other data.
VAEs are used for all sorts of generative tasks like:
- Photo editing/manipulation
- Data augmentation and generation
- Image/video compression
- Recommendation engines
- Denoising and reconstruction tasks
By learning an underlying latent representation, VAEs disentangle and encode different factors of variation into the latent variables. This allows controlling and adjusting specific aspects when generating new outputs.
While often not as sharp as GANs, VAEs provide a flexible generative approach across domains like computer vision, natural language processing, and speech recognition.
Diffusion Models
Now we get into diffusion models like DALL-E 2, Stable Diffusion, and Google’s Imagen which have produced some of the most viral and impressive AI-generated artwork and imagery lately. These models work backward from typical generative AI.
Instead of learning to generate data from scratch, diffusion models start by adding random visual noise or static to training images. Then they learn to reverse that diffusion process step-by-step converting the noise into a realistic, high-quality image or creation.
Think of it like an Etch-A-Sketch toy. The diffusion model starts with random scribbles, and then carefully reconstructs the image incrementally based on the text prompt until a photorealistic final output is achieved.
This iterative denoising process allows diffusion models to generate amazingly detailed visuals that accurately depict even complex concepts and scenes described in natural language prompts.
Beyond static images, diffusion models are also being applied to generate videos, 3D objects, audio, molecular structures for drug discovery, and other modalities from high-level text descriptions.
Diffusion is one of the newest and most promising frontiers in generative AI, allowing humans to transform their imaginations into almost any visual reality seamlessly.
Large Language Models (LLMs)
While not always categorized as “generative AI” in a technical sense, large language models like GPT-3, PaLM, Jurassic-1, and others have brought incredible text generation capabilities into the mainstream zeitgeist.
These transformer-based models use self-attention mechanisms and are trained on internet-scale datasets to understand language, context, and how to produce relevant human-like text on virtually any topic given a prompt or conversational context.
So while they can’t directly generate images, audio, or other media from scratch, LLMs are generative in the sense that they generate brand-new written content like:
- Stories and creative writing
- Essays and analysis
- Code across dozens of programming languages
- Song lyrics and poetry
- Dialogue and conversational responses
LLMs allow fluent long-form writing that remains coherent and on-topic for pages at a time. They are a core component of chatbots, writing assistants, code auto-completion tools, and other text-generation applications.
Their ability to model language at such a high level and produce substantive content from simple prompts makes large language models one of the most versatile and accessible forms of generative AI.
Text-to-Image/Image-to-Text Models
We’ve talked about models that generate images and text. But now we’re seeing increasingly sophisticated multimodal models that can map between and fuse these different data types.
Text-to-image models like DALL-E 2, Stable Diffusion, and Google’s Imagen leverage diffusion and other techniques to create photorealistic images from text prompts or descriptions. You describe a scene or concept, and the model generates a high-resolution visual rendering.
These models combine natural language with skills in computer vision and image generation. By encoding both language and image data, they learn a joint embedding space that maps between the two modalities.
Meanwhile, image-to-text models like CLIP and others work in the opposite direction. They analyze visuals and generate detailed text captions describing the content, attributes, and context of an image.
The most advanced multimodal models can engage in back-and-forth dialogue, answering follow-up questions about generated images, making edits based on additional text feedback, or generating new visuals from those captions. These capabilities pave the way for generative AI systems that navigate between vision and language.
Audio/Speech Synthesis
Last, but not least, we have AI models that can generate realistic speech, music, sound effects, and other audio signals from text or simple symbolic inputs like musical notes.
For speech synthesis, models like WaveNet, Tacotron, and flow-based models can produce human-like voices speaking arbitrary sentences or conversational responses with natural cadence and intonation.
In music and audio generation, AI systems like Music Transformer, Riffusion, and GRUI learn to compose songs, melodies, rhythms, and more either from text descriptions like “generate a soulful blues song about heartbreak” or symbolic inputs like chord progressions.
These models leverage neural architectures that capture complex audio waveforms, modeling the low-level raw audio in addition to high-level semantic inputs. Applications span text-to-speech, audio editing, music generation, sound design for films/games, and more.
Neuralaudio synthesis is helping bring generative AI into auditory domains like music and voice in powerful new ways. The rich creative potential of AI-generated soundscapes and audio is just beginning to be tapped.
How Are Generative AI Models Trained?
Generative AI models go through a multi-stage training process to build their artificial creativity:
First is foundational pre-training on massive datasets of images, text, audio, etc. totaling billions of examples. This allows the models to learn broad patterns and distributions. Popular pre-training techniques include self-supervised tasks like next word/pixel prediction.
Next is the scaling challenge. State-of-the-art generative models now have billions of parameters requiring immense computing power. AI-optimized hardware like TPUs/GPUs, model/data parallelism, and improved optimization algorithms enable training these large-scale models.
Finally is the fine-tuning stage, where the broadly pre-trained model undergoes additional training on specialized datasets tailored to the target generative task – whether captioning images, generating reports, rendering objects, or composing music.
While expensive, in terms of data and compute demands, this multi-stage training paradigm endows generative AI models with broad knowledge that can be purposefully steered toward creative output.
Their uncanny ability to synthesize novel content from training data, rather than simply regurgitating it, is what makes the generative AI field so exciting.
Conclusion
There you have it – a guided tour through some of the different types and techniques driving the generative AI revolution across images, language, audio, and more.
Whether generating photorealistic visuals, coded software, creative writing, or entirely new musical compositions, these models are unleashing new waves of artificial intelligence that can produce substantive content rather than merely processing or regurgitating