Learning to Select Actions for Resource-bounded Information Extraction

Learning to Select Actions for Resource-bounded Information Extraction

Pallika Kanani, Andrew McCallum

20 October 2011

Given a database with missing or uncertain information, our goal is to extract specific information from a large corpus such as the Web under limited resources. We cast the information gathering task as a series of alternative, resource-consuming actions to choose from and propose a new algorithm for learning to select the best action to perform at each time step. The function that selects these actions is trained using an online, error-driven algorithm called SampleRank. We present a system that finds the faculty directory pages of top Computer Science departments in the U.S. and show that the learning-based approach accomplishes this task very efficiently under a limited action budget, obtaining approximately 90% of the overall F1 using less than 2% of actions. If we apply our method to the task of filling missing values in a large scale database with millions of rows and a large number of columns, the system can obtain just the required information from the Web very efficiently.


Venue : N/A

External Link: http://people.cs.umass.edu/~pallika/publications/Kanani2011TechReport.pdf