Rigorous Systems Research Group (RSRG) Seminar
Structured data, such as sequences, trees, graphs and hypergraphs, are prevalent in a number of interdisciplinary areas such as network analysis, knowledge engineering, computational biology, drug design and materials science. The availability of large amount of such structured data has posed great challenges for the machine learning community. How to represent such data to capture their similarities or differences? How to learn predictive models from a large amount of such data, and efficiently? How to learn to generate structured data de novo given certain desired properties?
A common approach to tackle these challenges is to first design a similarity measure, called the kernel function, between two data points, based on either statistics of the substructures or probabilistic generative models; and then a machine learning algorithm will optimize a predictive model based on such similarity measure. However, this elegant two-stage approach has difficulty scaling up, and discriminative information is also not exploited during the design of similarity measure.
In this talk, I will present Structure2Vec, an effective and scalable approach for representing structured data based on the idea of embedding latent variable models into a feature space, and learning such feature space using discriminative information. Interestingly, Structure2Vec extracts features by performing a sequence of nested nonlinear operations in a way similar to graphical model inference procedures, such as mean field and belief propagation. In applications involving genome and protein sequences, drug molecules and energy materials, Structure2Vec consistently produces the-state-of-the-art predictive performance. Furthermore, in the materials property prediction problem involving 2.3 million data points, Structure2Vec is able to produces a more accurate model yet being 10,000 times smaller. In the end, I will also discuss potential improvements over current work, possible extensions to network analysis and computer vision, and thoughts on the structured data design problem.