Portrait on green background, header for New England Machine Learning Day event page

May 18, 2015

New England Machine Learning Day 2015

9:00 AM

Location: Cambridge, MA, USA

9:50 – 10:00
Opening remarks

10:00 – 10:30, Tamara Broderick, MIT
Statistical and computational trade-offs from Bayesian approaches to unsupervised learning

The flexibility, modularity, and coherent uncertainty estimates provided by Bayesian posterior inference have made this approach indispensable in a variety of domains. Since posteriors for many problems of interest cannot be calculated exactly, much work has focused on delivering accurate posterior approximations—though the computational cost of these approximations can sometimes be prohibitive, particularly in a modern, large-data context. Focusing on unsupervised learning problems, we illustrate in a series of vignettes how we can trade off some typical Bayesian desiderata for computational gains and vice versa. On one end of the spectrum, we sacrifice learning uncertainty to deliver fast, flexible methods for point estimates. In particular, we consider taking limits of Bayesian posteriors to obtain novel K-means-like objective functions as well as scalable, distributed algorithms. On the other end, we consider mean-field variational Bayes (MFVB), a popular and fast posterior approximation method that is known to provide poor estimates of parameter covariance. We develop an augmentation to MFVB that delivers accurate estimates of posterior uncertainty for model parameters.

10:35 – 11:05, David Jensen, University of Massachusetts Amherst
Why ML Needs Causality

Machine learning has risen to extraordinary success and prominence primarily by representing and reasoning about statistical associations. Apparently, we can perform a great many useful tasks without having to concern ourselves with representing and reasoning about causality. In this talk, I will explain the key differences between associational and causal models and describe some of the recent technical developments in causal inference. I will explain why many existing machine learning tasks are really causal reasoning in disguise, why an increasing number of machine learning tasks will explicitly require causal models, and why researchers and practitioners who understand causal reasoning will succeed where others fail.

11:10 – 11:40, Tim Kraska, Brown
Tupleware: Redefining Modern Analytics

There is a fundamental discrepancy between the targeted and actual users of current analytics frameworks. Most systems are designed for the data and infrastructure of the Googles and Facebooks of the world—petabytes of data distributed across large cloud deployments consisting of thousands of cheap commodity machines. Yet, the vast majority of companies operate or rent clusters in the range of a few dozen nodes and analyze relatively small data sets of up to a few terabytes. Targeting these users fundamentally changes the way we should build analytics systems.In this talk, Tim will present Tupleware, a new system developed at Brown University specifically aimed at the challenges faced by the typical user. The main difference of Tupleware to other frameworks is, that it automatically compiles complex machine learning workflows into highly efficient distributed programs instead of interpreting the workflows at run-time. Our initial experiments show, that Tupleware is 30x – 300x faster than Spark and up to 6000x faster than Hadoop for common machine learning algorithms.

11:40 – 1:45
Lunch and posters

1:45 – 2:15, Suvrit Sra, MIT
Geometric optimization in machine learning

2:20 – 2:50, Jennifer Listgarten, Microsoft Research
Genome and Epigenome-Wide Association Studies

Understanding the genetic underpinnings of disease is important for screening, treatment, drug development, and basic biological insight. Genome-wide associations, wherein individual or sets of genetic markers are systematically scanned for association with disease are one window into disease processes. Naively, these associations can be found by use of a simple statistical test. However, a wide variety of confounders lie hidden in the data, leading to both spurious associations and missed associations if not properly addressed. These confounding factors include population structure, family relatedness, cell type heterogeneity, and environmental confounders. I will discuss the state-of-the art statistical approaches based on linear mixed models for conducting these analyses, in which the confounders are automatically deduced, and then corrected for, by the data and model.

2:50 – 3:20
Coffee break

3:20 – 3:50, Krzysztof Gajos, Harvard
Interactive Intelligent Systems

3:55 – 4:25, Jennifer Dy, Northeastern
Learning in Complex Data

Machine learning as a field has become more and more important due to the ubiquity of data collection in various disciplines. Coupled with this data collection is the hope that new discoveries or knowledge can be learned from data. My research spans both fundamental research in machine learning and the application of those methods to biomedical imaging, health, science and engineering. Multi-disciplinary research is instrumental to the growth of the various areas involved. Real problems provide rich sources of complex data that challenges the state of the art in machine learning; at the same time, machine learning algorithms provide additional resources to scientists in making non-trivial contributions to their field of research. In this talk, I present two examples of innovation in machine learning motivated by challenges from complex medical domains.Chronic Obstructive Pulmonary Disease (COPD) is a lung disease characterized by airflow limitation due to noxious particles or gases (e.g., cigarette smoke). COPD is known to be a heterogeneous disease with genetic factors predisposing individuals to varying levels of disease severity as a function of exposure. An improved understanding of this heterogeneity should lead to better stratification of patients for prognosis and personalized treatments. However, standard clustering algorithms are limited because they do not take into account the interplay of the different features. We introduce a transformative way of looking at subtyping/clustering by recasting it in terms of discovering associations of individuals to disease trajectories (i.e., grouping individuals based on their similarity in how their health changes as a response to environmental and/or disease causing variables). This led us to the development of a nonparametric Bayesian model of mixture of disease trajectories.

Skin cancer is one of the most common types of cancer. Standard clinical screening and diagnosis is performed using dermoscopy and visual examination. When an abnormal lesion is suspected, biopsy is carried out which is invasive, painful and leaves a scar. Our goal is to develop novel machine learning and image analysis algorithms that aid physicians in the early detection of skin cancer from 3D reflectance confocal microscopy images in vivo without resorting to biopsy. A key feature of interest is the dermal-epidermal junction which separates the dermis and epidermis layers of the skin. However, automated segmentation is challenging due to poor image contrast and high variability of objects in biology and medicine. This led us to the design of a novel generative probabilistic latent shape spatial Poisson process model that can take into account the uncertainty in number of objects, location, shape, and appearance. We develop a Gibbs sampler that addresses the variation in model order and nonconjugacy that arise in this setting.

4:30 – 5:00, Ankur Moitra, MIT
Beyond Matrix Completion

Here we study some of the statistical and algorithmic problems that arise in recommendation systems. We will be interested in what happens when we move beyond the matrix setting, to work with higher order objects — namely, tensors. To what extent does inference over more complex objects yield better predictions, but at the expense of the running time? We will explore the computational vs. statistical tradeoffs for some basic problems about recovering approximately low rank tensors from few observations, and will show that our algorithms are nearly optimal among all polynomial time algorithms, under natural complexity-theoretic assumptions.

This is based on joint work with Boaz Barak.