The information retrieval models underpinning search engines require reams of manually annotated data, if they’re trained using machine learning. That’s because they must handle not only a range of queries, but any and all data tangential to these queries. Fortunately, an approach (“Content-Based Weak Supervision for Ad-Hoc Re-Ranking”) detailed by scientists at Amazon’s Alexa division could pave the way for models requiring less manual supervision. They, in turn, might increase the size of training data sets from tens of thousands of entries to millions, leading to better-performing systems in the future.
As the team explains, AI-based retrieval algorithms are typically trained on a query and two documents: a “relevant” document that satisfies a user’s search for information, and a related but non-relevant one that doesn’t. The documents in question are labeled manually by humans as either relevant or non-relevant, and during training, AI systems learn to maximize the difference between relevancy scores they assign to processed samples.
By contrast, the researchers’ method leverages the fact that the bulk of information retrieval training data — news articles and Wikipedia entries — is already associated with relevant texts in the articles and sections they introduce. In other words, they hypothesize that headlines and titles can be used in place of search strings for training purposes.
The team first collected millions of document-title pairs from the New York Times’ online repository and from Wikipedia, and from each pair, they used the query and the relevant text (along with texts related to the query but less relevant than the associated text) to train machine learning models. They then tapped a corpus from AOL consisting of customer queries and search results to establish a baseline, to which they applied an algorithm that identified relevant and non-relevant texts for each query. Lastly, they supplemented the AOL data set with a set of about 25,000 hand-annotated samples and algorithmically-selected samples from the test data.
In order to suss out the efficacy of their approach, the team trained AI systems separately on each of the four test sets — New York Times (NYT), Wikipedia, AOL, and the hand-annotated set — and scored the cumulative relevance of the top 20 results of each using a metric dubbed normalized discounted cumulative gain (nDCG). They report that of the baselines, the combination of the AOL data set and an AI architecture called a position-aware convolutional recurrent relevance network, or PACRR, yielded the best results. On the same system, the NYT data set offered a 12% increase in nDCG, and when the system was trained on examples that were difficult to distinguish from data in a given new target domain, the score improved by 35%.
“By using our approach, one can effectively train neural ranking models on new domains without behavioral data and with only limited in-domain data,” wrote the coauthors.