Seminar at SMU Delhi

January 9, 2013 (Wednesday) , 3:30 PM at Webinar
Speaker: Manik Varma, Microsoft Research and IIT Delhi
Title: On Computational Advertising & Big Data: A Machine Learning Approach to Recommending Advertiser Bid Phrases from Web Pages
Abstract of Talk
Recommending phrases from web pages for advertisers to bid on against search engine queries is an important research problem with direct commercial impact. Most approaches have found it infeasible to determine the relevance of all possible queries to a given ad landing page and have focussed on making recommendations from a small set of phrases extracted (and expanded) from the page using NLP techniques. In this paper, we eschew this paradigm, and demonstrate that it is possible to efficiently predict the relevant subset of queries from a large set of monetizable ones by posing the problem as a multi-label learning task with each query being represented by a separate label. Multi-label learning focuses on predicting the most relevant \emph{subset} of labels in contrast to multi-class classification which involves predicting only a single label. In this paper, we develop Multi-label Random Forests to tackle problems with millions of labels. We propose a novel node splitting criterion that: (a) facilitates large scale training over millions of data points, features and categories by learning from only positive data; (b) overcomes the well-known random forest limitation of overfitting in high dimensional spaces; and (c) leads to efficient prediction with costs that are logarithmic in the number of labels. To compensate for numerous missing and incorrect labels, we do not train on the given labels directly but extend our random forest classifier to train on beliefs inferred about the state of each label. These beliefs are generated through a novel sparse semi-supervised learning formulation optimized via an efficient distributed iterative hard thresholding algorithm. We carry out extensive experiments on benchmark data sets demonstrating that our approach yields superior classification results as compared to state-of-the-art multi-label algorithms and other random forest formulations . We then tackle bid phrase recommendation problems with 90 million training points, 5 million dimensions and 10 million labels which are well beyond the scope of current multi-label algorithms. We show significant gains over NLP techniques on a large test set of 5 million ads using multiple metrics.