Seminar at SMU Delhi
January 9, 2013 (Wednesday) ,
3:30 PM at Webinar
Speaker:
Manik Varma,
Microsoft Research and IIT Delhi
Title:
On Computational Advertising & Big Data: A Machine Learning Approach to Recommending Advertiser Bid Phrases from Web Pages
Abstract of Talk
Recommending phrases from web pages for advertisers to bid on against
search engine queries is an important research problem with direct
commercial impact. Most approaches have found it infeasible to
determine the relevance of all possible queries to a given ad landing
page and have focussed on making recommendations from a small set of
phrases extracted (and expanded) from the page using NLP techniques.
In this paper, we eschew this paradigm, and demonstrate that it is
possible to efficiently predict the relevant subset of queries from a
large set of monetizable ones by posing the problem as a multi-label
learning task with each query being represented by a separate label.
Multi-label learning focuses on predicting the most relevant
\emph{subset} of labels in contrast to multi-class classification
which involves predicting only a single label. In this paper, we
develop Multi-label Random Forests to tackle problems with millions of
labels. We propose a novel node splitting criterion that: (a)
facilitates large scale training over millions of data points,
features and categories by learning from only positive data; (b)
overcomes the well-known random forest limitation of overfitting in
high dimensional spaces; and (c) leads to efficient prediction with
costs that are logarithmic in the number of labels. To compensate for
numerous missing and incorrect labels, we do not train on the given
labels directly but extend our random forest classifier to train on
beliefs inferred about the state of each label. These beliefs are
generated through a novel sparse semi-supervised learning formulation
optimized via an efficient distributed iterative hard thresholding
algorithm.
We carry out extensive experiments on benchmark data sets
demonstrating that our approach yields superior classification results
as compared to state-of-the-art multi-label algorithms and other
random forest formulations . We then tackle bid phrase recommendation
problems with 90 million training points, 5 million dimensions and 10
million labels which are well beyond the scope of current multi-label
algorithms. We show significant gains over NLP techniques on a large
test set of 5 million ads using multiple metrics.