KL-Divergence of two GMMs

Asked 27/9, 2014 at 22:44 Answered 16/9, 2023 at 14:4

Solved python numpy statistics scipy scikit-learn

I have two GMMs that I used to fit two different sets of data in the same space, and I would like to calculate the KL-divergence between them.

Currently I am using the GMMs defined in sklearn (http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GMM.html) and the SciPy implementation of KL-divergence (http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.entropy.html)

How would I go about doing this? Do I want to just create tons of random points, get their probabilities on each of the two models (call them P and Q) and then use those probabilities as my input? Or is there some more canonical way to do this within the SciPy/SKLearn environment?

Westerman answered 27/9, 2014 at 22:44 Comment(1)

The closed form doesn't exist. Take a look at this paper to get approximation of it. scholar.google.co.kr/… – Queston 28/9, 2014 at 8:10

There's no closed form for the KL divergence between GMMs. You can easily do Monte Carlo, though. Recall that KL(p||q) = \int p(x) log(p(x) / q(x)) dx = E_p[ log(p(x) / q(x)). So:

def gmm_kl(gmm_p, gmm_q, n_samples=10**5):
    X = gmm_p.sample(n_samples)
    log_p_X, _ = gmm_p.score_samples(X)
    log_q_X, _ = gmm_q.score_samples(X)
    return log_p_X.mean() - log_q_X.mean()

(mean(log(p(x) / q(x))) = mean(log(p(x)) - log(q(x))) = mean(log(p(x))) - mean(log(q(x))) is somewhat cheaper computationally.)

You don't want to use scipy.stats.entropy; that's for discrete distributions.

If you want the symmetrized and smoothed Jensen-Shannon divergence KL(p||(p+q)/2) + KL(q||(p+q)/2) instead, it's pretty similar:

def gmm_js(gmm_p, gmm_q, n_samples=10**5):
    X = gmm_p.sample(n_samples)
    log_p_X, _ = gmm_p.score_samples(X)
    log_q_X, _ = gmm_q.score_samples(X)
    log_mix_X = np.logaddexp(log_p_X, log_q_X)

    Y = gmm_q.sample(n_samples)
    log_p_Y, _ = gmm_p.score_samples(Y)
    log_q_Y, _ = gmm_q.score_samples(Y)
    log_mix_Y = np.logaddexp(log_p_Y, log_q_Y)

    return (log_p_X.mean() - (log_mix_X.mean() - np.log(2))
            + log_q_Y.mean() - (log_mix_Y.mean() - np.log(2))) / 2

(log_mix_X/log_mix_Y are actually the log of twice the mixture densities; pulling that out of the mean operation saves some flops.)

Tana answered 27/9, 2014 at 22:59 Comment(2)

Hi Dougal, I am trying to use gym_js function that you defined to figure out the robustness of my model but not sure what the n_samples is doing here and how to interpret the return value? If I use all dataset that I have to make a models gmm_p and gmm_q with same number of clusters and feed into this function can I confirm the robustness of my model based on the js metric here? Thanks. – Scoreboard 18/6, 2019 at 5:8

I came here to see if I can calculate the js divergence between two distributions and it looks like I could (the code adapted from this answer lies as answer here: stats.stackexchange.com/questions/345915/…) – Powerhouse 27/7, 2019 at 13:47

The logic is correct in Danica's answer. I tried with sklearn 1.3.0, but the result is not reliable (especially, the score function). So I wrote the code with the help of ChatGPT to calculate KL.

import numpy as np
from scipy.stats import multivariate_normal
from sklearn.mixture import GaussianMixture


def gmm_pdf(x, weights, means, covariances):

    K = len(weights)
    pdf = 0.0
    
    for k in range(K):
        # Calculate the PDF for each Gaussian component
        component_pdf = weights[k] * multivariate_normal.pdf(x, mean=means[k], cov=covariances[k])
        pdf += component_pdf

    return pdf

def gmm_kl(gmm_p, gmm_q, n_samples=1e7):
    X = gmm_p.sample(n_samples)[0]
    p_X = (gmm_pdf(X, gmm_p.weights_, gmm_p.means_, gmm_p.covariances_))
    q_X = (gmm_pdf(X, gmm_q.weights_, gmm_q.means_, gmm_q.covariances_))
    return np.mean(np.log(p_X/q_X))

Antefix answered 16/9, 2023 at 14:4 Comment(0)

Recommended topics

Hot tags