EE Systems Seminar
Query Complexity of Clustering
Abstract: Clustering is one of the most fundamental and popular methods for data classification, and used widely in statistical data analysis. However, due to inherent ambiguity of data representation and poor data quality, clustering is a very challenging task for any automated process. As a remedy, in recent years, a fruitful approach has been to incorporate some supervision to the process. Using a domain expert or crowd to obtain labeled data has resulted in dramatic improvement in accuracy of clustering. While many heuristics have been proposed, the field lacks systematic theoretical study to understand the performances and limitations of this model. We call this clustering with an oracle: where we are given a set of n elements to be clustered into k (unknown) clusters and an oracle which can interactively answer pair-wise queries of the form, "do u and v belong to the same cluster?'' The goal is to minimize the number of such queries while obtaining the optimum clustering. We provide a thorough analysis of the prominent heuristic algorithms for crowd-based clustering. We justify experimental observations with our analysis and information theoretic lower bounds. One of our main contributions is to show the power of side information in the form of a similarity matrix: a matrix that represents noisy pair-wise relationships such as one computed by some automated function on attributes of the elements. The reduction in query complexity under general models of similarity matrix is stunning and this remains true even when the answer of each query can be erroneous with certain probability and 'resampling' is not allowed. Our results include a general framework for proving lower bounds for classification problems in the interactive setting. Our algorithms are computationally efficient and are completely parameter free, i.e., works without any knowledge of k or the similarity matrix distribution. Time permitting we will also discuss interesting connections to popular community detection models such as the stochastic block model. (This is a joint work with Barna Saha).
Bio: Arya Mazumdar is an assistant professor in the College of Information and Computer Sciences at the University of Massachusetts Amherst. Prior to this, Arya was an assistant professor at University of Minnesota-Twin Cities, and a postdoctoral scholar at Massachusetts Institute of Technology. Arya received his Ph.D. from University of Maryland, College Park, in 2011, where his thesis won a Distinguished Dissertation Fellowship Award. Arya is a recipient of the 2015 NSF CAREER award and the 2010 IEEE ISIT Jack K. Wolf Student Paper Award. He spent the summers of 2008 and 2010 at the Hewlett-Packard Laboratories, Palo Alto, CA, and IBM Almaden Research Center, San Jose, CA, respectively. Arya's research interests include information theory, coding theory, and their applications to CS theory/learning.
Host: Babak Hassibi
Contact: Katie Pichotta email@example.com