Member-only story

Evaluation metrics for generative models

5 min readAug 31, 2022

A machine learning algorithm is said learn task T from experience E, if its performance measured using performance measure P improves for T with E.

To measure the performance of models there have been many quantitative techniques suggested in past literature for different task such as:-

Confusion matrix, AUC, PR, F1-score and accuracy for classification task
MAE, MSE, RMSE, R2 and adj-R2 for regression task
Dunn’s index and silhouette coefficient for Clustering task

Generative modelling is a different animal from tasks discussed till now. It tries to learn the probability distribution of the experiences and then generate samples looking similar to what it learned, originated from the probability distribution that it learned. Or simply, the model tries to Generate by Generalising. In this post we will learn about some of the evaluation metrics used for generative modelling.

Before moving ahead let me ask you a question. Which of these two sets of generated samples “look” better?

sample images taken from latest google research articles

These are the samples generated by two of famous SOTA research works from Google by the name IMAGEN(a diffusion based generative model) and PARTI(an autoregressive generative model). Do you think that we can distinctively say that images on left are better or the one on right are better?

Well, not really. The reason for this denial is that Generation by Generalising is hard to define and assess. One simple way to generate is to memorise the training data well and then generalise as in classification or regression. But memorising the training set would give excellent samples close to training distribution which are clearly undesirable form perspective of new data point generation. We can summarise this argument as:-

Generation is a Qualitative task

And there are no good measures for qualitatively judging the things as its very subjective. But for training a model, decreasing a loss function and for reaching a minima we need something quantitative which could evaluate the learning process. There are many proposed ways to quantify generation task as Quantitative evaluation of a qualitative task can have many answers.

Evaluation metrics for generative models

Written by Nikhil Verma

No responses yet