Variational Autoencoder gives same output image for every input mnist image when using KL divergence

S

4

11

When not using KL divergence term, the VAE reconstructs mnist images almost perfectly but fails to generate new ones properly when provided with random noise.

When using KL divergence term, the VAE gives the same weird output both when reconstructing and generating images.

Here's the pytorch code for the loss function:

def loss_function(recon_x, x, mu, logvar):
    BCE = F.binary_cross_entropy(recon_x, x.view(-1, 784), size_average=True)
    KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())    
    return (BCE+KLD)

recon_x is the reconstructed image, x is the original_image, mu is the mean vector while logvar is the vector containing the log of variance.

What is going wrong here? Thanks in advance :)

Saxony answered 30/5, 2018 at 14:42 Comment(0)

I

11

A possible reason is the numerical unbalance between the two losses, with your BCE loss computed as an average over the batch (c.f. size_average=True) while the KLD one is summed.

Irresolvable answered 8/6, 2018 at 11:48 Comment(0)

S

3

Multiplying KLD with 0.0001 did it. The generated images are a little distorted, but similarity issue is resolved.

Saxony answered 30/5, 2018 at 15:0 Comment(2)

I had the same mistake as you, as pointed out by the initial answer, but even after fixing it, this was the only thing that really worked. Some stuff I read online suggested that this might be more of an issue with not learning the variance of the output distribution – Go 16/9, 2020 at 14:26

@Go can you point to the stuff you read online? I have the same issue and I would like to solve it. – Dreyfus 29/11, 2021 at 17:20

R

3

Yes, try out with different weight factor for the KLD loss term. Weighing down the KLD loss term resolves the same reconstruction output issue in the CelebA dataset (http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html).

Ragout answered 10/2, 2019 at 8:53 Comment(0)

S

0

There are many possible reasons for that. As benjaminplanche stated you need to use .mean instead of .sum reduction. Also, KLD term weight could be different for different architecture and data sets. So, try different weights and see the reconstruction loss, and latent space to decide. There is a trade-off between reconstruction loss (output quality) and KLD term which forces the model to shape a gausian like latent space.

To evaluate different aspects of VAEs I trained a Vanilla autoencoder and VAE with different KLD term weights. Note that, I used the MNIST hand-written digits dataset to train networks with input size 784=28*28 and latent size 30 dimensions. Although for data samples in range of [0, 1] we normally use a Sigmoid activation function, I used a Tanh for experimental reasons.

Vanilla Autoencoder:

Autoencoder(
  (encoder): Encoder(
    (nn): Sequential(
      (0): Linear(in_features=784, out_features=30, bias=True)
    )
  )
  (decoder): Decoder(
    (nn): Sequential(
      (0): Linear(in_features=30, out_features=784, bias=True)
      (1): Tanh()
    )
  )
)

Afterward, I implemented the VAE model as shown in the following code blocks. I trained this model with different KLD weights from the set {0.5, 1, 5}.

class VAE(nn.Module):

    def __init__(self,dim_latent_representation=2):

        super(VAE,self).__init__()
        
        class Encoder(nn.Module):
            def __init__(self, output_size=2):
                super(Encoder, self).__init__()
                # needs your implementation
                self.nn = nn.Sequential(
                nn.Linear(28 * 28, output_size),
                )

            def forward(self, x):
                # needs your implementation
                return self.nn(x)                

        class Decoder(nn.Module):
            def __init__(self, input_size=2):
                super(Decoder, self).__init__()
                # needs your implementation
                self.nn = nn.Sequential(
                nn.Linear(input_size, 28 * 28),
                nn.Tanh(),
                )

            def forward(self, z):
                # needs your implementation
                return self.nn(z)
                
        self.dim_latent_representation = dim_latent_representation
        self.encoder = Encoder(output_size=dim_latent_representation)    
        self.mu_layer = nn.Linear(self.dim_latent_representation, self.dim_latent_representation)
        self.logvar_layer = nn.Linear(self.dim_latent_representation, self.dim_latent_representation)       
        self.decoder = Decoder(input_size=dim_latent_representation)
    # Implement this function for the VAE model
    def reparameterise(self, mu, logvar):
        
        if self.training:
            std = logvar.mul(0.5).exp_()
            eps = std.data.new(std.size()).normal_()
            return eps.mul(std).add_(mu)
        else:
            return mu

    def forward(self,x):
        
        # This function should be modified for the DAE and VAE
        x = self.encoder(x)
        mu, logvar = self.mu_layer(x), self.logvar_layer(x)
        z = self.reparameterise(mu, logvar)
        return self.decoder(z), mu, logvar

Vanilla Autoencoder
- Training loss: 0.4089 Validation loss
- Validation loss (reconstruction error) : 0.4171
VAE Loss = MSE + 0.5 * KLD
- Training loss: 0.6420
- Validation loss (reconstruction error) : 0.6060
VAE Loss = MSE + 1 * KLD
- Training loss: 0.6821
- Validation loss (reconstruction error) : 0.6550
VAE Loss = MSE + 5 * KLD
- Training loss: 0.7122
- Validation loss (reconstruction error) : 0.7154

Here you can see output results from different models. I also visualized the 30 dimensional latent space in 2D using sklearn.manifold.TSNE transformation.

We observe a low loss value for the vanilla autoencoder with 30D bottleneck size which results in high-quality reconstructed images. Although loss values increased in VAEs, the VAE arranged the latent space such that gaps between latent representations for different classes decreased. It means we can get better manipulated (mixed latents) output. Since VAE follows an isotropic multivariate normal distribution at the latent space, we can generate new unseen images by taking samples from the latent space with higher quality compared to the Vanilla autoencoder. However, the reconstruction quality was reduced (loss values increased) since the loss function is a weighted combination of MSE and KLD terms to be optimized where the KLD term forces the latent space to resemble a Gaussian distribution. As we increased the KLD weight, we achieved a more compact latent space closer to the prior distribution by sacrificing the reconstruction quality.

Scurf answered 2/12, 2022 at 20:1 Comment(0)

Recommended topics

Hot tags