Classification of Generative AI Models

Site: Saylor Academy
Course: PRDV430: AI for Business Applications
Book: Classification of Generative AI Models
Printed by: Guest user
Date: Tuesday, July 1, 2025, 9:57 AM

Description

Classification of Generative AI Models

Generative AI Model Architecture: This is the model's basic structure or design. It includes how its layers or neural networks and components are arranged and organized. The model's architecture determines how it processes and generates information, which makes it a critical aspect of its functionality and suitable for specific tasks. Table 4 describes the architecture components and training methods that are used in the generative AI models.

Table 4. Architecture components and training methods used in generative AI models.

Model Architecture Components Training Method
Variational Autoencoders Encoder–Decoder Variational Inference
Generative Adversarial Networks Generator–Discriminator Adversarial
Diffusion Models Noising (Forward)–Denoising Iterative Refinement
Transformers Encoder–Decoder Supervised
Language Models Recurrent Neural Networks Supervised
Normalizing Flow Models Coupling Layers Maximum-Likelihood Estimation
Hybrid Models Combination of Different Models Varied

The classification of generative models based on architecture provides insights into the specific components and training methods that define each model as shown in Figure 3. These architectural choices have significant implications for how the models generate new data points and learn from the available data. By understanding these distinctions, researchers and practitioners can choose the most suitable generative model for their specific task or explore hybrid approaches that combine different models to leverage their respective strengths. Variational autoencoders (VAEs) have an encoder–decoder architecture and use variational inference for training. They learn compressed representations of input data and generate new samples by sampling from the learned latent space. Generative adversarial networks (GANs) consist of a generator and a discriminator. They are trained adversarially, with the generator generating synthetic samples to fool the discriminator. GANs excel at generating realistic and diverse data. Diffusion models involve a noising step followed by a denoising step. They iteratively refine noisy inputs to generate high-quality samples. Training involves learning the dynamics of the diffusion process. Transformers employ an encoder–decoder architecture and utilize self-attention mechanisms for capturing global dependencies. They are commonly used in tasks like machine translation and generate coherent sequences through supervised training. Language models, often based on recurrent neural networks (RNNs), generate sequences by predicting the next token. They are trained through supervised learning and excel at generating natural language sequences. Normalizing flow models use coupling layers to transform data while preserving density. They learn complex distributions by transforming a simple base distribution, trained via maximum-likelihood estimation. Hybrid models combine different architectures and training methods to leverage their respective strengths. They offer flexibility and tailored generative capabilities by integrating elements from multiple models.

Futureinternet 15 00260 g003 550

Figure 3. Classification of the generative AI models based on the architecture.


Source: Ajay Bandi, Pydi Venkata Satya Ramesh Adapa, and Yudu Eswar Vinay Pratap Kumar Kuchi, https://www.mdpi.com/1999-5903/15/8/260
Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 License.

Variational Autoencoders (VAE)

A variational autoencoder (VAE) is a type of autoencoder that combines variational inference with an encoder–decoder architecture. Autoencoders consist of an encoder network that maps high-dimensional data to a low-dimensional representation and a decoder network that reconstructs the original input from the representation. However, traditional autoencoders lack the ability to generate new data points.

In Figure 4, in a VAE, the encoder network maps the input data (x) to the parameters of a probability distribution in a latent space (z) using input layer and hidden layer composed of neural network units, such as dense or convolutional layers. This distribution is often modeled as a multivariate Gaussian with mean and covariance parameters achieved in mean, variance layers. Samples are drawn from this latent space distribution in sampling layer, generated by the encoder, to produce new data points using the decoder network (y) with hidden and output layers. By sampling from the approximate posterior distribution in the latent space, VAEs can generate diverse outputs resembling the training data.

Futureinternet 15 00260 g004 550

Figure 4. Typical structure of variational autoencoder (VAE).

Neural networks, such as fully connected networks or convolutional neural networks (CNNs), are commonly used as encoders and decoders in VAEs. The specific architecture depends on the data and its complexity. For images or grid-like data, CNNs or deconvolutional neural networks (also known as convolutional transpose networks) are often employed as decoders. Training a VAE involves optimizing the model parameters by minimizing a loss function comprising a reconstruction loss and a regularization term. The reconstruction loss measures the discrepancy between the input data and the reconstructed output, while the regularization term computes the Kullback–Leibler (KL) divergence between the approximate posterior distribution and a chosen prior distribution in the latent space. This term promotes smoothness and regularization. The training process of a VAE includes selecting the network architecture, defining the loss function, and iterating through batches of input data. The encoder processes the data, latent space points are sampled, and the decoder reconstructs the input. The total loss, combining the reconstruction loss and regularization term, is computed, and gradients are used to update the model parameters through backpropagation.

While VAEs offer generative modeling and capture complex latent representations, they may suffer from issues such as blurry reconstructions and challenges in evaluating the quality of generated samples. Researchers have proposed various improvements to VAEs to address these concerns, such as vector-quantized variational autoencoder (VQ-VAE), which introduces a discrete latent space by quantizing encoder outputs, leading to a more structured and interpretable latent representation; recurrent variational autoencoder (RVAE) to sequential data by incorporating recurrent architectures, allowing for sequence generation and anomaly detection; constrained graph variational autoencoder (CGVAE) models graph-structured data using graph neural networks, enabling generation while preserving structural properties; crystal diffusion variational autoencoders (CDVAE), generating crystal structures in materials science, combining VAE with a diffusion process to learn a continuous representation of crystal structures; junction tree variational autoencoder (JT-VAE), leveraging junction trees, a type of graphical model, to capture complex dependencies between variables in structured domains like natural language processing or bioinformatics.

Normalizing Flow Models

Normalizing flow models are deterministic and invertible transformations between the raw data space and the latent space. Unlike other generative models such as GANs or VAEs, which introduce latent variables and transform them to generate new content, normalizing flow models directly solve the mapping transformation between two distributions by manipulating the Jacobian determinant. In Figure 5, normalizing flow applies a sequence of invertible transformations to a simple probability distribution (z) to model more complex probability distributions using an affine coupling layer in the encoder (flow). The decoding (inverse) function is designed to be the exact inverse of the encoding function using same affine coupling layers and quick to calculate, giving normalizing flows the property of tractability.

Futureinternet 15 00260 g005 550

Figure 5. Typical structure of normalizing flow model.

Coupling layers play a crucial role in normalizing flow models. They are used to perform reversible transformations on the input data and latent variables. Affine coupling transformations, a specific type of coupling layer, are commonly used in normalizing flows. These transformations model complex relationships between variables while maintaining invertibility. By using element-wise multiplication and addition, the Jacobian determinant can be efficiently computed. In a coupling layer, the input data are split into fixed and transformed parts. The fixed part is typically passed through unchanged, while the transformed part undergoes a transformation based on a function of the fixed part. This approach allows the model to focus on modeling complex relationships while preserving certain aspects of the input.

The design of an invertible function with expressive structures and efficient computation of the Jacobian determinant is a challenging task in normalizing flows. Affine coupling transformations address these challenges by providing a flexible and efficient way to model complex relationships and compute the Jacobian determinant. By applying a sequence of invertible transformations, normalizing flows can model complex probability distributions. These transformations are designed to be reversible, allowing for tractable likelihood computation. The encoder–decoder functions in normalizing flows are exact inverses of each other, enabling efficient calculations and maintaining tractability.

Normalizing flows offer the advantage of providing an exact likelihood evaluation and efficient sampling from complex probability distributions, enabling flexible generative modeling. However, a drawback of normalizing flows is the computational expense associated with training deep architectures, particularly for large-scale datasets. Additionally, achieving satisfactory performance may necessitate a significant amount of data during the training process. MoFlow, a flow-based graph generative model to learn invertible mappings between molecular graphs and their latent representations. In MoFlow, each component flow in the mixture is responsible for capturing different aspects of the data distribution. By combining these flows, MoFlow can better model diverse samples from complex data distributions. The mixture of flows can be learned using a gating mechanism that assigns weights or probabilities to each component flow, allowing the model to dynamically select the most appropriate flow for each input sample.

Generative Adversarial Networks (GAN)

Generative adversarial networks or GANs were first introduced by Ian Goodfellow in 2014. The GAN is based on the minimax two-person zero-sum game, in which one player profits only when the other suffers an equal loss. The two players in GAN are the generator and the discriminator. The generator's purpose is to trick the discriminator, while the discriminator's goal is to identify whether a sample is from a true distribution. The discriminator's output is a probability that the input sample is a true sample. A higher probability suggests that the sample is drawn from real-world data. In contrast, the closer the probability is to zero, the more probable the sample is a fake. When the probability approaches one-half infinity, the optimal answer is reached because the discriminator finds it difficult to check fake samples.

Typically, generator (G) and discriminator (D) are implemented using deep neural networks, working as latent function representations. The architecture of the GAN, illustrated in Figure 6, involves the G learning the data distribution from real samples and mapping it to a new space (generated samples) using dense/convolutional layers accompanied by its corresponding probability distribution. The primary objective of the GAN is to ensure that this probability distribution closely resembles the distribution of the training samples. The D receives input data, which can be either real data (x) from the training set or generated data produced by the generator. The discriminator then outputs a probability using dense/convolutional layers or scalar value that indicates whether the input is likely to come from the real data distribution.

Futureinternet 15 00260 g006 550

Figure 6. Typical structure of generative adversarial networks (GAN).

GAN (generative adversarial network) training faces several challenges, including gradient disappearance, difficulty in training, and poor diversity. These problems arise from the loss function used in GANs, which involves measuring and minimizing the distance between the real data distribution (Pr) and the generated data distribution (Pg).

During training, the discriminator aims to minimize cross-entropy by differentiating between real and generated samples. The optimal discriminator (D) takes the form given below.


D(x)=Pr(x)/(Pr(x)+Pg(x))

On the other hand, the generator (G) seeks to minimize a generator-specific loss function that includes an independent item to ensure diversity.

The loss function for the generator can be written as,

V(G)=KL(Pg||Pr)−2JSD(Pr||Pg)

where KL is the Kullback–Leibler divergence and JSD is the Jensen–Shannon divergence. Minimizing the JS divergence helps the generated samples resemble real ones. However, if there is little or no overlap between Pr and Pg, the JS divergence becomes a constant, leading to gradient vanishing and disappearance.

Additionally, training GANs can be challenging because the feedback from the discriminator can be close to zero when it is trained optimally, slowing down convergence. Moreover, determining when the discriminator is properly trained is difficult since there is no indicator for it.

Another problem is the poor diversity in the generated samples. The generator loss function V(G) can be reformulated to address this issue. Minimizing this loss function is equivalent to minimizing the KL divergence and the JSD, leading to more diverse generated samples. Several new models have been introduced to address these limitations of the original GAN, including issues like gradient disappearance, unstable training, and poor diversity. These new GAN models aim to enhance stability and improve the quality of the generated outputs.

Conditional generative adversarial networks (CGANs) have emerged as a solution to enhance the control and convergence speed of GANs in complex or large-scale datasets. By incorporating conditional variables, such as category labels, textual descriptions, or specific generated targets, CGANs provide guidance to the data generation process. This allows for supervised learning, targeted generation, and the ability to generate images with specific categories or labels. Moreover, CGANs can utilize image features as condition to generate corresponding word vectors, enabling effective cross-modal generation illustrated in Figure 7.

Futureinternet 15 00260 g007 550

Figure 7. Typical structure of conditional GAN.

Some of the GANs that incorporate this technique are conditional generative adversarial networks (CGAN), CGAN with Pix2Pix framework, conditional tabular GAN (CTGAN), conditional generative adversarial networks with text (TAC-GAN, TAGAN).

Wasserstein generative adversarial networks (WGANs) offer a novel approach to address the challenges faced by traditional GANs. By introducing the Wasserstein distance as a metric, WGANs provide a more stable training process and better gradient flow. The discriminator in WGANs, known as the "critic", assigns scores representing the distance between the real and fake data distributions. This distance is measured using the Wasserstein distance instead of the Jensen–Shannon divergence or Kullback–Leibler divergence used in other generative models. WGANs mitigate the issue of mode collapse, where GANs fail to capture the full diversity of the data, by effectively learning the underlying data distribution, even for complex and high-dimensional datasets. The generator and discriminator in WGANs are trained to minimize the Wasserstein distance, encouraging the generator to generate samples that closely resemble real data. This enables WGANs to produce more diverse and realistic outputs.

WGANs have found applications in various domains, such as image synthesis, text generation, and data augmentation. Their effectiveness in addressing mode collapse and providing a more reliable training process has made them a popular choice for researchers and practitioners working with generative models.

Deep convolutional generative adversarial networks (DCGANs) are a variant of GANs that leverage deep convolutional neural networks (CNNs) to enhance the quality of generated samples, particularly in the domain of image synthesis. DCGANs have proven to be highly effective in generating realistic and high-resolution images. DCGANs utilize convolutional layers in both the generator and discriminator networks, allowing them to capture spatial dependencies and patterns in the data. DCGANs introduce several key design principles, including the use of convolutional and transposed convolutional layers, batch normalization, and ReLU activation functions. These principles contribute to the stability of the training process, mitigate issues like mode collapse and allow for the generation of diverse and high-quality samples.

The benefits of DCGANs extend beyond image synthesis, with applications in areas such as image-to-image translation, style transfer, and data augmentation. The combination of deep convolutional architectures and adversarial training has propelled DCGANs as a go-to choice for generating visually appealing and realistic images in the field of deep learning.

Generative adversarial networks (GANs) have revolutionized various domains of computer vision and machine learning. They can be classified into different categories based on their specific tasks and applications. Image-to-image translation GANs focus on translating images between domains, with subcategories such as CycleGAN, DiscoGAN, and DTN. Super-resolution GANs enhance the resolution of low-resolution images, including SRGAN and VSRResFeatGAN. Text-to-image GANs generate images from textual descriptions, exemplified by AttnGAN and StackGAN. Tabular data GANs generate synthetic tabular data, with examples like CTGAN and TGAN. Defense and security GANs address security-related applications, including defense against adversarial attacks and steganography, such as defense GANs and SSGAN. Style-based GANs capture and manipulate artistic styles, including StyleGAN and StyleCLIP. Other GAN types encompass diverse applications like BigGAN for high-resolution images, ExGANs for variation generation, and SegAN for semantic segmentation and various other GANs and are listed below. These categories demonstrate the versatility and advancement of GANs in various domains, enabling tasks such as image translation, super-resolution, text-to-image synthesis, data generation, security applications, style manipulation, and more. GANs continue to drive innovation and push the boundaries of generative models in the field of artificial intelligence.

Diffusion Models

Diffusion models are a type of generative model that operates by progressively introducing noise into data until it conforms to a desired distribution. The main idea behind diffusion models is to learn the process of reversing this diffusion, enabling the generation of valid samples. In Figure 8 the forward pass of a diffusion model, Gaussian noise is iteratively added to the data in a series of steps. This noise corrupts the original data, gradually degrading its quality. As the noise level increases with each step, the images become increasingly distorted or destroyed. The objective of the diffusion model is to learn the dynamics of this diffusion process. By observing the corrupted data and the corresponding noise levels, the model learns to estimate the conditional probability distribution that describes the relationship between the corrupted data and the noise levels. Once the diffusion process is learned, the model can then perform the reverse pass, starting from the corrupted data and progressively removing the noise in each step. This process of denoising leads to the generation of valid and realistic samples that resemble the original data distribution.

Futureinternet 15 00260 g008 550

Figure 8. Typical structure of diffusion model.

There are three sub-types that differ in their implementation of the forward and backward diffusion pass. These sub-types are denoising diffusion probabilistic models (DDPMs), score-based generative models (SGMs), and stochastic differential equations (SDEs).

Denoising Diffusion Probabilistic Models (DDPMs): DDPMs, also known as denoising score-matching models, incorporate a two-step process for diffusion. They apply Markov chains to progressively corrupt data with Gaussian noise and then reverse the forward diffusion process by learning Markov transition kernels. DDPMs focus on modeling the diffusion process and its associated reversibility.

Score-based Generative Models (SGMs): SGMs, also referred to as score-matching models, work directly with the gradient of the log density (score function) of the data. They perturb the data with noise at multiple scales and jointly estimate the score function of all noisy data distributions using a neural network conditioned on different noise levels. This decoupling of training and inference steps enables flexible sampling.

Stochastic Differential Equations (SDEs): SDEs generalize diffusion models into continuous settings. They formulate noise perturbations and denoising processes as solutions to stochastic differential equations. By leveraging the probabilistic flow of these equations, the reverse generation process can be modeled. Probability flow ordinary differential equations (ODEs) can also be utilized to represent the reverse process.

Diffusion models employ neural network architectures to capture the complex dependencies and patterns in the data. These architectures can consist of various layers, such as convolutional layers for image data or recurrent layers for sequential data. The network is trained to learn the conditional probability distribution that describes the relationship between the corrupted data and the noise levels. The training objective of diffusion models is typically based on maximum-likelihood estimation or other probabilistic frameworks. The model parameters are optimized to minimize the discrepancy between the generated samples and the original data distribution. Various techniques such as gradient descent and backpropagation are employed to train the model effectively.

Diffusion models, such as the deep diffusion generative models (DDGM), have gained prominence as strong generative models in recent years. They take a novel technique to modeling complicated data distributions by diffusing a given input iteratively towards a target distribution. However, to address specific difficulties or increase performance in specific scenarios, variations in diffusion models are necessary. The latent diffusion model (LDM) is a variant of the diffusion model that operates in latent space. It is a generative model that aims to learn the underlying data distribution by applying a diffusion process to the latent variables instead of the observed data. The latent diffusion model can develop more meaningful representations and capture the underlying structure of the data distribution by acting in latent space. It enables the efficient and effective generation of high-quality samples with desired attributes. The latent diffusion model has been used to produce varied and realistic samples in a variety of fields, including image generation, text generation, video generation, and audio synthesis.

The geometry complete diffusion model (GCDM) is an extension of the diffusion model that incorporates geometric constraints and priors into the diffusion process. It leverages the underlying geometric structure of the data to guide the diffusion process, resulting in improved generation quality and better preservation of geometric properties. The GCDM takes into account geometric relationships such as distances, angles, and shape characteristics, allowing for more precise and controlled generation of samples.

The video diffusion model (VDM) is a specific type of diffusion model designed for generating videos. It extends the diffusion process to the temporal dimension, allowing for the generation of coherent and dynamic sequences of frames. The VDM progressively corrupts the video frames with noise perturbations and then learns to denoise and generate realistic video sequences. It captures the temporal dependencies and dynamics of the data distribution, enabling the generation of videos with smooth transitions and realistic motion.

Language Models

Language models (LMs) have undergone a significant transformation in recent years, evolving from their traditional role of generating or evaluating fluent natural text to becoming powerful tools for text understanding. This shift has been achieved through the utilization of language modeling as a pre-training task for feature extractors, where the hidden vectors learned during language modeling are leveraged in downstream language understanding systems. LMs have proven instrumental in a wide range of applications, enabling tasks such as answering factoid questions, addressing commonsense queries, and extracting factual knowledge about entity relations. At its core, a language model is a computational framework that aims to understand and generate human-like text. It operates based on the fundamental principle of probabilistic prediction, where it learns patterns and dependencies in sequences of words to estimate the likelihood of a particular word given the preceding context. By capturing statistical regularities in language, LMs can generate coherent and contextually relevant text. This is achieved by training the model on vast amounts of text data, allowing it to learn the distribution of words, phrases, and syntactic structures in each language.

The components of a language model consist of the training data, the architecture of the model itself, and the inference mechanism used for generating text. The training data serve as the foundation for learning the underlying patterns and probabilities in language. The architecture of the model encompasses various neural network architectures, such as recurrent neural networks (RNNs), transformers, or a combination of both, which enable the model to capture long-range dependencies and contextual information. The inference mechanism involves utilizing the trained model to generate text based on input prompts or predicting missing words in each context. In Figure 9, the RNN architecture, the input sequence X is processed step by step, where X(t) represents the input at each time step. The goal is to predict an output sequence y. At each time step, the RNN takes the current input X(t) and the previous hidden state h(t − 1) as inputs. The hidden state h(t) represents the network's memory and is computed using a set of learnable parameters and activation functions. In some cases, cell state is used alongside the hidden state, as seen in long short-term memory (LSTM) and gated recurrent unit (GRU) variants. The cell state acts as a long-term memory component. The hidden state h(t) is then used to generate the output sequence y(t), which can be used for tasks like sequence-to-sequence predictions.

Futureinternet 15 00260 g009 550

Figure 9. Recurrent Neural Network Architecture.

Language models are used for a variety of tasks, which are supported by different types of language models such as the visual language model (VLM), which combines textual and visual information to understand and generate language in the context of visual data. By leveraging visual input, such as images or videos, VLMs can accurately interpret the content and generate captions, answer questions, and perform other language-related tasks. A collaborative language model (CLM) is developed through the collective effort of multiple individuals or organizations. The collaborative nature of CLMs incorporates diverse perspectives and insights to enhance the quality and reliability of their language generation capabilities. By leveraging the collective wisdom of contributors and subject matter experts.The large language model (LLM) represents language models that have been trained on extensive textual data and possess many parameters. With billions of parameters, LLMs, like GPT-3, demonstrate the ability to generate sophisticated and human-like text across a wide range of topics and writing styles. These language model variants play crucial roles in natural language processing and have the potential to enhance various applications and systems reliant on human-like text generation.

Transformers

The transformer model has revolutionized the field of natural language processing (NLP) by replacing traditional recurrent neural networks (RNNs) with a self-attention mechanism. This model has achieved state-of-the-art performance on various language tasks while being computationally efficient and highly parallelizable. The core component of the transformer model is the self-attention mechanism, which allows the model to focus on different parts of the input sequence simultaneously when making predictions. Unlike RNNs that process sequential information step by step, the transformer considers the entire input sequence at once, effectively capturing dependencies between tokens. Transformer architecture consists of an encoder and a decoder, both comprising multiple layers of self-attention and feed-forward neural networks. The encoder processes the input sequence, while the decoder generates the output sequence. The self-attention mechanism in the transformer enables the model to selectively attend to relevant parts of the input sequence, facilitating the capture of long-range dependencies and improving translation quality, among other tasks.

The attention module in the transformer adopts a multi-head design. The self-attention is formulated as a scaled dot-product, where the input queries (Q), keys (K), and values (V) are combined to calculate the attention weights. The scaling factor of √dk is applied to normalize the dot-product scores. The resulting attention weights are then multiplied with the values and summed up to produce the final output.

Attention(Q, K, V) = softmax(QKT / √dk)V

The transformer model employs multiple layers of self-attention and fully connected point-wise layers in both the encoder and decoder components illustrated in Figure 10. This architecture allows the model to effectively capture and process the complex relationships and dependencies within the input and output sequences.

Futureinternet 15 00260 g010 550

Figure 10. Transformer architecture.

Transformers vary in their architectures, specific network designs, and training objectives depending on the application and input data.

BERT (Bidirectional Encoder Representations from Transformers): BERT consists of a multi-layer bidirectional transformer encoder. It employs a masked language modeling (MLM) objective during pre-training. It randomly masks words in the input text and trains the model to predict the masked words based on their context. BERT also uses a next sentence prediction (NSP) task, where it learns to predict if two sentences are consecutive in each document. BERT is pre-trained on a large corpus of text, such as Wikipedia and Book Corpus. It utilizes unsupervised learning and large-scale transformer architectures to capture general language representations. After pre-training, BERT can be fine-tuned on specific downstream tasks using supervised learning with task-specific datasets.

GPT (Generative Pre-trained Transformer): GPT employs a multi-layer transformer decoder. GPT is trained using an autoregressive language modeling objective. It predicts the next word in a sequence based on the previous context, enabling the generation of fluent and contextually relevant text. GPT is pre-trained on a large corpus of text, such as web pages and books. It learns to generate text by conditioning on the preceding context. Fine-tuning of GPT can be performed on specific tasks by providing task-specific prompts or additional training data.

T5 (Text-to-Text Transfer Transformer): T5 employs a transformer architecture like BERT but follows a text-to-text framework. It can handle various NLP tasks using a unified approach. T5 is trained using a text-to-text format, where both input and output are text strings. It leverages a combination of unsupervised and supervised learning objectives for pre-training and fine-tuning.

The field of transformers has witnessed remarkable progress, leading to the development of several influential models for various natural language processing (NLP) tasks. One prominent model is the adaptive text-to-speech (AdaSpeech) system, which focuses on generating highly realistic and natural-sounding synthesized speech. It employs advanced techniques to overcome limitations in traditional text-to-speech systems, enabling more expressive and dynamic speech synthesis.

For code-related tasks, researchers have introduced specialized transformer models such as code understanding BERT (CuBERT), CodeBERT, CODEGEN, and CodeT5. CuBERT is specifically designed for code comprehension, leveraging the power of transformers to understand and analyze source code. CodeBERT, on the other hand, performs code-related tasks like code generation, bug detection, and code summarization. CODEGEN focuses on generating code snippets given natural language descriptions, facilitating the automation of programming tasks. CodeT5, inspired by the T5 architecture, excels in various code-related tasks, including code summarization, translation, and generation. The feed-forward transformer (FFT) model is a versatile transformer architecture that has demonstrated exceptional performance across multiple NLP tasks. It leverages a feed-forward neural network to process and transform input sequences, enabling effective modeling of complex language patterns and semantic relationships. The GPT language model (Codex), based on the GPT-3 architecture, has gained significant attention for its ability to generate coherent and contextually relevant text. It excels in tasks such as text completion, question answering, and text generation. InstructGPT (GPT-3) is another powerful language model that can understand and generate human-like text based on specific prompts. It has been extensively used in various conversational AI applications, virtual assistants, and creative writing assistance. Grapher is a transformer model designed to process and understand graphical data. It leverages graph neural networks and self-attention mechanisms to capture dependencies and relationships within structured data, enabling tasks such as graph classification, node-level prediction, and link prediction. Language models for dialog applications (LaMDA) are transformer-based models specifically tailored for conversational tasks. They enhance dialogue understanding and generation by capturing context, nuances, and conversational dynamics. LaMDA models have shown promise in improving conversational agents, chatbots, and virtual assistants. In the realm of multimodal tasks that involve both text and visual information, transformer-based models have also made significant contributions. MotionCLIP focuses on understanding and generating textual descriptions of videos, bridging the gap between language and visual understanding. Muse explores the connection between text and image, enabling tasks such as text-based image retrieval and image captioning. The pre-trained language model (PLM)/visual GPT is a multimodal model that combines text and visual information to generate coherent and contextually relevant captions for images. Other notable transformer models include T5X, text-to-text transfer transformer (T5), TFix, w2v-BERT (Word2Vec and BERT), and WT5 (Why, T5?). T5X extends the T5 architecture to handle even more complex NLP tasks and demonstrates superior performance in tasks such as machine translation and text summarization. TFix focuses on addressing issues related to fairness, transparency, and explainability in transformer models. w2v-BERT combines Word2Vec and BERT to enhance the representation of word semantics within the transformer framework. WT5 focuses on training text-to-text models to explain their predictions. It builds upon the architecture of the text-to-text transfer transformer (T5) model. The primary objective of WT5 is to enhance the interpretability and explainability of the T5 model by providing insights into the reasoning behind its predictions.

Hybrid Models

Hybrid generative AI models are models that combine multiple generative AI techniques or architectures to leverage their respective strengths and produce improved results. These models aim to overcome limitations or enhance the capabilities of individual generative models by integrating different approaches.

Adversarial autoencoder (AAE): AAE is a type of generative model that combines elements of both autoencoders and generative adversarial networks (GANs). It is designed to learn a compact latent representation of input data while generating realistic samples from that latent space. The autoencoder is integrated with a GAN framework in an adversarial autoencoder. The autoencoder acts as a generator network, taking in random noise and creating samples in the latent space. Instead of attempting to discriminate between actual and false samples, the discriminator network seeks to distinguish between samples from the true latent distribution and samples produced by the autoencoder in Figure 11.

Futureinternet 15 00260 g011 550

Figure 11. Adversarial autoencoder architecture (AAE).

An AAE's training consists of two major stages, the reconstruction stage where the autoencoder is trained to correctly reconstruct the input data. It reduces the reconstruction loss between the input and output, which encourages the autoencoder to develop a meaningful representation. Coming to the second stage which is the adversarial stage, where the discriminator is trained to differentiate samples derived from the actual latent distribution from samples produced by the autoencoder. The generator (autoencoder) seeks to produce samples that deceive the discriminator. This adversarial training pushes the autoencoder to generate realistic latent space samples. The AAE may learn a compact latent representation that captures the main features of the input data while generating realistic samples from that latent space by combining the reconstruction and adversarial phases. Adversarial training prevents mode collapse and makes the generator explore its entire latent space. Adversarial autoencoders have been employed in a wide range of applications, including image generation, anomaly detection, and data synthesis.

PixelCNN: PixelCNN is a type of generative model that belongs to the family of autoregressive models and is specifically tailored for generating images pixel by pixel. It utilizes convolutional layers to capture spatial dependencies within the image. PixelCNN models the conditional probability distribution of each pixel given its preceding context. By modeling this conditional distribution, PixelCNN can generate images that exhibit realistic textures and local coherence.

During training, PixelCNN is typically trained using maximum-likelihood estimation. The model takes an image as input and is trained to maximize the likelihood of generating that image. PixelCNN employs a process called autoregression for generating new images. It starts with an empty canvas and generates the pixels one by one, conditioning each prediction on the previously generated pixels. This autoregressive process allows the model to capture complex dependencies and generate coherent images. PixelCNN has demonstrated success in tasks such as image completion, super-resolution, and image synthesis.

Variational Autoencoder with Generative Adversarial Networks (VAE-GAN): This hybrid model combines the generative capabilities of variational autoencoders (VAEs) and generative adversarial networks (GANs). The VAE component helps encode and decode input data, while the GAN component enhances the realism and diversity of the generated samples. Introspective adversarial networks and Mol-CycleGAN are examples of this combination. In introspective adversarial networks, there are other techniques to improve its performance, such as multiscale dilated convolution blocks and orthogonal regularization. These techniques help the model to capture long-range dependencies in the image, prevent overfitting, and generate images that are more realistic and coherent. Mol-CycleGAN extends the CycleGAN framework to molecular embeddings in the latent space of JT-VAE utilizes the latent space of JT-VAE (junction tree variational autoencoder) as the embedding representation. The latent space is created by a neural network during the training process. The advantage of using the latent space embedding is that the distance between molecules can be defined directly in this space, enabling the calculation of the loss function. VAE-GANs have been successfully applied in various domains, including image synthesis, text generation, and music composition.

Generative Adversarial Networks (GAN) with Dense Convolutional Neural Networks (DenseNet) or Residual Neural Networks (ResNet): Dense convolutional neural networks (DenseNet) are known for their dense connections, which facilitate feature reuse and enhance the flow of gradients throughout the network. DenseNet architectures have shown remarkable performance in image classification tasks by capturing intricate patterns and representations in the data. When combined with generative adversarial networks (GANs), DenseNet can contribute to the generator component of the GAN framework. By utilizing DenseNet as the generator, the hybrid model benefits from its powerful feature learning capabilities and the ability to capture complex patterns and details in the data. ResNet is also used in similar way but there is a slight difference between them ResNet's skip connections enable training of very deep networks, while DenseNet's dense connectivity promotes parameter efficiency and better information flow. This combination of models is done in CycleGAN, PGGAN. CycleGAN is a powerful framework for unsupervised image translation by leveraging the concept of cycle consistency and utilizing architectures such as ResNet and PatchGAN to achieve impressive results in various image-to-image translation tasks. The PGGAN discriminator, formed by combining PatchGAN and G-GAN, provides fine-grained evaluation of local image patches and incorporates gradient penalty regularization, enhancing the training stability and diversity of generated samples in the PGGAN framework.

Generative Adversarial Networks (GAN) with Recurrent Neural Networks (RNN) or Convolutional Neural Networks (CNN): Combining RNNs or CNNs with GANs, it becomes possible to generate sequences that possess both coherence and realism. The RNN component provides the ability to model sequential dependencies, ensuring that the generated sequences flow naturally and exhibit contextual understanding. The GAN component, on the other hand, improves the diversity and quality of the generated sequences by leveraging the adversarial training process. In the RTT-GAN, the generator of the GAN employs a hierarchical structure and attention mechanisms to retain contextual states at various levels and a hierarchical structure and attention mechanisms to retain contextual states at various levels. The hierarchy is formed by a paragraph-level recurrent neural network (RNN), a sentence-level RNN, and a word-level RNN, along with two attention modules. The paragraph RNN encodes the current paragraph state by considering preceding sentences. The spatial–visual attention module selectively focuses on semantic regions, guided by the current paragraph state, to generate the visual representation of the sentence. Consequently, the sentence RNN can encode a topic vector for the newly generated sentence. The discriminator LSTM RNN takes the sentence embeddings of all preceding sentences as inputs. It computes the topic smoothness value of the current constructed paragraph description at each recurrent step, assessing the coherence of topics across the generated sentences. With these multi-level assessments, the model can generate long yet realistic descriptions, maintaining both sentence-level plausibility and topic coherence. In the CNN-GAN model, a convolutional encoder–decoder network is utilized for generating new content by jointly training it with adversarial networks. This training setup aims to ensure coherence between the generated pixels and the existing ones. These CNN-based methods have demonstrated the ability to generate realistic and plausible content in highly structured images, such as faces, objects, and scenes.

Generative Adversarial Networks (GAN) with Denoising Diffusion Probabilistic Models (DDPM) and Transformers: Combination DDPMs, GANs, and transformers can create a hybrid generative AI model with enhanced capabilities. This combination allows for the generation of diverse and high-quality samples while leveraging the strengths of each component. DiffGAN-TTS and ProDiff implement this combination of models. DiffGAN-TTS is a novel text-to-speech (TTS) model that achieves high-fidelity and efficient speech synthesis. It takes inspiration from the denoising diffusion GAN model and models the denoising distribution using an expressive acoustic generator. This generator is trained adversarially to match the true denoising distribution, ensuring high-quality output spectrograms. DiffGAN-TTS ability to allow large denoising steps during inference. This reduces the number of denoising steps required and accelerates the sampling process. To further enhance sampling efficiency, DiffGAN-TTS incorporates an active shallow diffusion mechanism. ProDiff utilizes generator-based parameterization, where the denoising model directly predicts clean data using a neural network. This approach has shown advantages in accelerating sampling from complex distributions. By directly predicting clean data, ProDiff avoids the need to estimate gradients and achieves faster synthesis.

Transformer with Recurrent Neural Network (RNN): The combination of transformers and RNNs can leverage the strengths of both architectures, allowing for improved modeling of sequential data with long-term dependencies and global context understanding. This combination is useful for tasks such as speech recognition, time series forecasting, and video processing, where both local temporal dependencies and global context are crucial for accurate predictions. MolT5 implements three baseline models for the tasks of molecule captioning and molecule generation. The first baseline is a four-layer GRU recurrent neural network with a bidirectional encoder. This model leverages the sequential nature of the data and captures contextual information from both past and future. The second baseline is based on the transformer architecture, consisting of six encoder and decoder layers. Transformers utilize self-attention mechanisms to capture global dependencies and have been successful in various sequence-to-sequence tasks. The third baseline is based on the T5 model, a pre-trained sequence-to-sequence model. Three T5 checkpoints, namely small, base, and large, are fine-tuned for molecule captioning and molecule generation. T5 models have shown strong performance in natural language processing tasks.

Transformer with Graph Convolutional Network (GCN): For tasks that require graph-structured data, this hybrid model combines the strength of transformers and GCNs. Transformers excel at sequence-to-sequence tasks and have demonstrated success in natural language processing and image processing. GCNs, on the other hand, are especially intended to handle graph-structured data and capture node relationships. This hybrid model can effectively capture both the sequential dependencies of the data and the graph-based relationships by combining transformers and GCNs, enabling enhanced modeling and representation learning in graph-based tasks such as node classification, link prediction, graph generation and molecule structure generation.

Transformer with Long Short-Term Memory (LSTM): The transformer architecture with long short-term memory (LSTM) is a type of recurrent neural network known for its ability to capture long-term dependencies in sequential data. Transformers are powerful models for sequence processing, leveraging self-attention mechanisms to capture dependencies across the sequence. The GTR-LSTM encoder provides a graph-based approach to encoding triples, considering the structural relationships between entities in a knowledge graph. By incorporating attention mechanisms and entity masking, the model aims to generate coherent and meaningful output sequences based on the input graph.

Vision Transformers with Residual Neural Networks (ResNet): Vision transformers leverage the self-attention mechanism of transformers to capture long-range dependencies and enable effective modeling of image data. The combination of ResNet and vision transformers can benefit from both the local feature extraction capabilities of ResNet and the global context understanding of vision transformers, resulting in improved image understanding and representation.

Diffusion probabilistic models with Contrastive Language-Image Pretraining (CLIP): Diffusion modeling is a powerful technique for modeling complex data distributions and generating high-quality samples. CLIP, on the other hand, is a state-of-the-art method for learning visual representations from images and corresponding textual descriptions. DiffusionCLIP combines the power of diffusion modeling and the guidance of CLIP to enable precise and controlled image manipulation. It leverages pretrained diffusion models and the CLIP loss to fine-tune the diffusion model and generate samples that align with a target textual description, which opens new possibilities for image generation and manipulation tasks.

Convolutional Neural Network (CNN) with Bidirectional Encoder Representations from Transformers (BERT): CLAP (contrastive learning for audio and text pairing) is a model that jointly trains an audio encoder and a text encoder to learn the similarity or dissimilarity between audio and text pairs. The goal is to enable zero-shot classification by computing embeddings for audio and text and using cosine similarity to measure their similarity. The model takes audio and text pairs as input, which are separately processed by the audio encoder and text encoder. The encoders extract meaningful representations from the audio and text inputs. These representations are then projected into a joint multimodal space using linear projections.

Convolutional Sequence-to-Sequence Learning (ConvS2S): This is a neural network architecture that was introduced for sequence-to-sequence tasks, such as machine translation or speech recognition. It leverages convolutional neural networks (CNNs) to process input sequences and generate output sequences, providing an alternative to the commonly used recurrent neural networks (RNNs) Unlike RNN-based models that rely on sequential processing, ConvS2S applies parallel convolutions across the input sequence. This enables more efficient computation and allows for better utilization of parallel processing capabilities, leading to faster training and inference times. The use of convolutions also helps capture local dependencies in the input sequence, which can be beneficial for tasks where context is primarily determined by nearby elements. The architecture of ConvS2S typically consists of an encoder and a decoder. The encoder is composed of several layers of 1D convolutional filters followed by non-linear activation functions. These filters capture different patterns and features in the input sequence, allowing for effective representation learning. The decoder, on the other hand, employs similar convolutional layers but with additional techniques like attention mechanisms to generate the output sequence. Unlike RNN-based models that rely on sequential processing, ConvS2S applies parallel convolutions across the input sequence. This enables more efficient computation and allows for better utilization of parallel processing capabilities, leading to faster training and inference times. The use of convolutions also helps capture local dependencies in the input sequence, which can be beneficial for tasks where context is primarily determined by nearby elements. The architecture of ConvS2S typically consists of an encoder and a decoder. The encoder is composed of several layers of 1D convolutional filters followed by non-linear activation functions. These filters capture different patterns and features in the input sequence, allowing for effective representation learning. The decoder, on the other hand, employs similar convolutional layers but with additional techniques like attention mechanisms to generate the output sequence.