CD3 Seminar

Tuesday April 11, 2017 2:00 PM

Building Training Sets for Astronomical Data; A Bayesian Feature Transformation for Domain Adaptation

Speaker: Pavlos Protopapas, Inst. for Applied Computational Science, Cambridge, MA
Location: Powell-Booth 100 (Seminar Room)
Abstract:Supervised data mining and machine learning rely on the availability of labeled data. When sufficient training data is available, supervised models achieve high performance in many domains. However, labeled data is often much scarcer than unlabeled data and much more expensive and difficult to obtain. Moreover, when models that perform well in one setting are applied to data from a different but related domain -- e.g. from a different telescope or sensor -- performance often drops significantly. Additionally, the enormous rate at which unlabeled data is being generated in astronomy greatly surpasses the rate at which labeled data becomes available. Domain adaptation aims to learn from a domain where labeled data is available, the 'domain', and through some adaptation perform well on a different domain, the 'target domain'. In this talk, I present a new probabilistic model that represents the source and target distributions as two Gaussian mixtures and finds a transformation between the feature spaces of the domains to transfer labeled data between them. Our approach allows working with data available in one domain as if it belonged to the other, enabling the training of models in the target domain from training sets adapted from the source domain. We evaluate our proposal in simulated data and the problem of variable star classification. In the latter, we use data from multiple different astronomical surveys with different characteristics in terms of sensor sensitivity, atmospheric conditions, and data sampling frequency, among others.

Contact: Tracy Sheffer at 4116 tracy@cd3.caltech.edu