Thursday, February 23, 2017 at 3:00pm
Much of modern data is generated by humans and drives decisions made in a variety of settings, such as recommendations for online markets, analysis of social networks, or denoising crowdsourced labels. Due to the complexities of human behavior, the precise data model is often unknown, creating a need for for flexible models with minimal assumptions. A minimal property that is natural for many datasets is "exchangeability", i.e. invariant under relabeling of the dataset, which naturally leads to a nonparametric latent variable model. The corresponding inference problem can be formulated as matrix or graphon estimation.
We propose similarity-based inference algorithms for such nonparametric latent variable models, and we provide theoretical guarantees that bound the error. Our method can be computed in a distributed manner, lending to good scalability properties. As a byproduct, our analysis explains a longstanding mystery of why the collaborative filtering heuristic performs well in practice. While classical collaborative filtering typically requires a dense dataset, we propose a new method which compares larger radius neighborhoods of data to compute similarities, and show that the estimate converges even for very sparse datasets, which has implications towards sparse graphon estimation. For denoising crowd-sourced labels, our algorithm provides guarantees under flexible models allowing for heteregeneity of task and worker types.
Christina Lee is a Ph.D. candidate in the Laboratory for Information and Decision Systems (LIDS) at Massachusetts Institute of Technology. She received her B.S. in Computer Science from California Institute of Technology in 2011, and she received her M.S. in Electrical Engineering and Computer Science from MIT in 2013. She is a recipient of the MIT Jacobs Presidential Fellowship, the NSF Graduate Research Fellowship, and the Claude E. Shannon Research Assistantship. Her research focuses on designing scalable statistical algorithms for processing social data based on principle from statistical inference. Applications include recommendation systems, low-cost crowdsourcing, community structure in a social network, marketplaces like Uber, and image de-noising.