Resource-bounded Information Extraction: Acquiring Missing Feature Values On Demand

Resource-bounded Information Extraction: Acquiring Missing Feature Values On Demand

Pallika Kanani, Andrew McCallum, Shaohan Hu

24 June 2010

We present a general framework for the task of extracting speci c information \on demand" from a large corpus such as the Web under resource-constraints. Given a database with missing or uncertain information, the proposed system automatically formulates queries, is- sues them to a search interface, selects a subset of the documents, ex- tracts the required information from them, and lls the missing values in the original database. We also exploit inherent dependency within the data to obtain useful information with fewer computational resources. We build such a system in the citation database domain that extracts the missing publication years using limited resources from the Web. We discuss a probabilistic approach for this task and present rst results. The main contribution of this paper is to propose a general, comprehensive architecture for designing a system adaptable to di erent domains.


Venue : N/A

External Link: http://people.cs.umass.edu/~pallika/publications/pakdd2010kanani.pdf