“Learning from Crowdsourced Multi-Labeling: A Variational Bayesian Approach” by Dr. Junming Yin
- Dr. Junming Yin
Assistant Professor
Department of Management Information Systems
University of Arizona
Dr. Yin plans to present two studies :
Learning from Crowdsourced Multi-Labeling: A Variational Bayesian Approach
Microtask crowdsourcing has emerged as a cost-effective approach for obtaining large-scale labeled data. Crowdsourcing platforms, such as MTurk, provide an online marketplace where task requesters can submit a batch of microtasks for a crowd of workers to complete for a small monetary compensation. As the information collected from a crowd can be prone to errors, additional algorithmic techniques are needed to infer the ground truth labels from noisy annotations by workers with heterogeneous quality. Moreover, it would be very beneficial to identify and possibly filter out low-quality workers to foster the creation of a healthy and sustainable crowdsourcing ecosystem. Much of the existing literature on crowd labeling has focused on the single-label setting. However, in many application domains, it is common that each item to be annotated can be assigned to multiple categories simultaneously. In this paper, we present a variety of new approaches for modeling label dependency and worker quality in the context of multi-label crowdsourcing. To capture label dependency, we introduce three methods based on a Bayesian mixture of Bernoulli distributions, its Dirichlet process extension, and a multivariate logit-normal distribution. We also propose two distinct generative models for characterizing shared and hierarchical structures of worker quality. Efficient collapsed and Laplace variational inference algorithms are then developed to jointly infer ground truth labels and worker quality. Extensive simulation and MTurk experiments show that the models based on integrating Bernoulli mixtures and shared structure of worker quality achieve a significant improvement over other state-of-the-art methods. Our study clearly highlights that joint and effective modeling of label dependency and worker quality is crucial to the design of a multi-label crowdsourcing system. The proposed framework also has great potential to be extended to a broader range of applications, in which different opinions need to be combined to measure multiple perspectives of an object.
Relaxed Multivariate Bernoulli Distribution and Its Applications to Deep Generative Models
Recent advances in variational auto-encoder (VAE) have demonstrated the possibility of approximating the intractable posterior distribution with a variational distribution parameterized by a neural network. To optimize the variational objective of VAE, the reparameterization trick is commonly applied to obtain a low-variance estimator of the gradient. The main idea of the trick is to express the variational distribution as a differentiable function of parameters and a random variable with a fixed distribution. To extend the reparameterization trick to inference involving discrete latent variables, a common approach is to use a continuous relaxation of the categorical distribution as the approximate posterior. However, when applying continuous relaxation to the multivariate cases, multiple variables are typically assumed to be independent, making it suboptimal in applications where modeling dependency is crucial to the overall performance. In this work, we propose a multivariate generalization of the Relaxed Bernoulli distribution, which can be reparameterized and can capture the correlation between variables via a Gaussian copula. We demonstrate its effectiveness in two tasks: density estimation with Bernoulli VAE and semi-supervised multi-label classification.