{"id":692358,"date":"2020-09-24T10:32:48","date_gmt":"2020-09-24T17:32:48","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=692358"},"modified":"2021-04-28T11:16:56","modified_gmt":"2021-04-28T18:16:56","slug":"measuring-dataset-similarity-using-optimal-transport","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/measuring-dataset-similarity-using-optimal-transport\/","title":{"rendered":"Measuring dataset similarity using optimal transport"},"content":{"rendered":"\n
\"\"\/<\/figure>\n\n\n\n

Is FashionMNIST, a dataset of images of clothing items labeled by category, more similar to MNIST or to USPS, both of which are classification datasets of handwritten digits? This is a pretty hard question to answer, but the solution could have an impact on various aspects of machine learning. For example, it could change how practitioners augment a particular dataset to improve the transferring of models across domains or how they select a dataset to pretrain on, especially in scenarios where labeled data from the target domain of interest is scarce.<\/p>\n\n\n\n

In our recent paper, \u201cGeometric Dataset Distances via Optimal Transport,\u201d<\/a> we propose the Optimal Transport Dataset Distance, or the OTDD for short, an approach to defining and computing similarities, or distances<\/em>, between classification datasets. The OTDD relies on optimal transport (OT), a flexible geometric method for comparing probability distributions, and can be used to compare any two datasets<\/em>, regardless of whether their label sets are directly comparable. As a bonus, the OTDD returns a coupling<\/em> of the two datasets being compared, which can be understood as a set of soft correspondences between individual items in the datasets. Correspondences can be used to answer questions such as the following: Given a data point in one dataset, what is its corresponding point in the other dataset? In this post, we show the distances and correspondences obtained with our method for five popular benchmark datasets and give an overview of how the OTDD is computed, what it has to do with shoveling dirt, and why it\u2019s a promising tool for transfer learning.<\/p>\n\n\n\n

Why is measuring distance between labeled <\/em>datasets hard?<\/h2>\n\n\n\n

Comparing any two distinct classification datasets, like the datasets of clothing and handwritten digits mentioned above, poses at least three obvious challenges:<\/p>\n\n\n\n

  1. They might have different cardinality<\/em>, or number of points.<\/li>
  2. They might have different native dimensionality (for example, MNIST digits are 28 \u00d7 28 pixels, while USPS digits are 16 \u00d7 16).<\/li>
  3. Their labels might correspond to different concepts, as is the case with FashionMNIST and MNIST and USPS\u2014fashion items versus digits.<\/li><\/ol>\n\n\n\n

    Note the first two challenges are also applicable to unlabeled<\/em> datasets, but the third challenge\u2014which, as we\u2019ll see below, is the most difficult\u2014is specific to labeled datasets.<\/p>\n\n\n\n

    Intuitively, the number of examples should have little bearing on the distance between datasets (after all, whether MNIST has 70,000 points or 30,000, it\u2019s still essentially<\/em> MNIST). We can enforce this invariance to dataset size by thinking about datasets as probability distributions, <\/em>from which finitely many samples are drawn, and comparing those instead. Similarly, the dimension of the input should not play a major role\u2014if any\u2014in the distance we seek. For example, the essence of MNIST is the same regardless of image size. Here, we\u2019ll assume that images are up- or down-sampled as needed to make images in the two datasets being compared the same size (we discuss how to relax this in the paper). <\/p>\n\n\n\n\t

    \n\t\t\n\n\t\t

    \n\t\tMicrosoft research podcast<\/span>\n\t<\/p>\n\t\n\t

    \n\t\t\t\t\t\t
    \n\t\t\t\t\n\t\t\t\t\t\"photo\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t
    \n\n\t\t\t\t\t\t\t\t\t

    What\u2019s Your Story: Lex Story<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

    Model maker and fabricator Lex Story helps bring research to life through prototyping. He discusses his take on failure; the encouragement and advice that has supported his pursuit of art and science; and the sabbatical that might inspire his next career move.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

    \n\t\t\t\t\t
    \n\t\t\t\t\t\t\n\t\t\t\t\t\t\tListen now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t<\/div>\n\t<\/div>\n\t\n\n\n

    The last of these challenges\u2014dealing with datasets having disjoint label sets\u2014is much harder to overcome. Indeed, how can we compare the category \u201cshoe\u201d from FashionMNIST to the \u201c6\u201d category in MNIST? And what if the number of categories is different in the two datasets? For example, what if we\u2019re comparing MNIST, which has 10 categories, to ImageNet, which has 1,000? Our solution to this conundrum, in a nutshell, relies on representing each label by the collection of points with that label and, as is the case with enforcing invariance to dataset size, formally treating the collections as probability distributions. Thus, we can compare any two categories across different datasets by comparing their associated collections\u2014understood, again, as probability distributions\u2014in feature space, which also then allows us to extend the comparison across the entire datasets themselves.<\/p>\n\n\n\n

    The approach we\u2019ve sketched so far banks on being able to compute distances between two different kinds of probability distributions, those corresponding to the labels and those corresponding to the entire datasets. In addition, ideally, we need to do this calculation in a computationally feasible way. Enter optimal transport, which provides the backbone of our approach.<\/p>\n\n\n\n

    Optimal transport: Comparing by \u2018transporting\u2019<\/h2>\n\n\n\n

    Optimal transport traces its roots back to 18th-century France, where the mathematician Gaspard Monge was concerned with finding optimal ways to transport dirt and rubble from one location to another. (opens in new tab)<\/span><\/a>Let\u2019s consider an individual using a shovel to move dirt, a simplified version of the scenario Monge had in mind. By his formulation (below), each movement of the shovel between two piles of dirt carries a cost proportional to the distance traveled by the shovel multiplied by the mass of dirt carried. Then, the total cost of transporting dirt between the piles is the sum of the cost of these individual movements. <\/p>\n\n\n\n

    \"\"
    Optimal transport was born as a method to find least-cost schemes to transport dirt and rubble from one place to another. Thinking about probability distributions as piles of dirt, optimal transport intuitively quantifies their dissimilarity in terms of how much and how far the \u201cdirt,\u201d or probability mass, must be shoveled to transform one pile into the other.<\/figcaption><\/figure><\/div>\n\n\n\n

    But what does dirt and shoveling have to do with statistics or machine learning? As it turns out, the intuitive framework devised by Monge provides an ideal formulation for comparing probability distributions. Let us think of probability density functions as the piles of dirt, where the \u201cheight\u201d of the pile corresponds to the probability density at that point, and shoveling<\/em> dirt between the piles as moving probability from one point to another, at a cost proportional to the distance between these two points. Optimal transport gives us a way to quantify the similarity between two probability density functions in terms of the lowest total cost incurred by completely shoveling one pile into the shape and location of the other. <\/p>\n\n\n\n

    Formally, the general optimal transport problem between two probability distributions \\(\\alpha \\) and \\(\\beta\\) over a space \\(\\mathcal{X}\\) is defined as:<\/p>\n\n\n\n

    \\(\\min_{\\pi \\in \\Pi(\\alpha, \\beta) } \\int_{\\mathcal{X} \\times \\mathcal{X}} d(x,x’)\\text{d}\\pi(x,x’)\\)<\/p>\n\n\n\n

    <\/div>\n\n\n\n

    Here \\(\\pi\\) is a joint distribution (formally, a coupling) with marginals \\(\\alpha \\) and \\(\\beta\\). When the cost \\(c(x,y)\\) is taken to be the distance \\(d(x,x\u2019) = \\| x \u2013 x\u2019 \\|^p\\), the value of this problem is known as the p-Wasserstein distance, and it\u2019s denoted by \\(\\text{W}_p\\).<\/p>\n\n\n\n

    Distances between feature-label pairs<\/h2>\n\n\n\n

    Using optimal transport to compare two probability distributions requires defining a distance between points<\/em> sampled from those distributions. In our case, in which we\u2019re comparing two datasets, each point \\(z\\) is a pair comprising a feature vector\u2014an image for the datasets discussed here\u2014and a label. So we need to be able to compute a distance between, let\u2019s say, the pair (\\(x\\),“six”), where \\(x\\) is an image of a \u201c6\u201d from MNIST, and the pair (\\(x’\\),“shoe”), where \\(x’\\) is an image of a shoe from FashionMNIST. The first part is easy: We can compute distances between the images using various standard approaches. Defining a distance between their labels is, as we discussed earlier, much more complicated. But is it worth it? What happens if we ignore the label and just use the features to compute the distance? The visualization below shows what could go wrong. Ignoring the labels might lead us to believe two datasets are very similar when in fact, from a classification perspective, they\u2019re quite different.<\/p>\n\n\n\n

    \n