To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks.  As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the web to locate pages about specific topics.  Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or ‘crawl,” the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways.  The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output “completeness” (e.g., in terms of recall).  Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition.  In this paper, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way.  Towards this goal, we show how to analyze query-and crawl-based plans in terms of both execution time and output completeness.  We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans.  Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases.  We identify these properties and present efficient techniques for estimating the associated cost-model parameters.  Overall, our approach helps predict the most appropriate execution plans for a task, resulting in significant efficiency and output completeness benefits.  We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.