To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks

Eugene Agichtein
Luis Gravano
Panagiotis Ipeirotis
Pranay Jain

Venue: Proceedings of the 2006 ACM International Conference on Management of Data (SIGMOD), 2006
Jun 2006
Status: Refereed
Type: Conference
Acceptance rates: 58/446=13% accepted; Best Paper Award

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or ‘crawl,” the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output “completeness” (e.g., in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this paper, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query-and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated cost-model parameters. Overall, our approach helps predict the most appropriate execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.

SQoUT – Structured Querying over Unstructured Text

Panos Ipeirotis

To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks

Panos Ipeirotis

To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks

Related Files:

Related Projects:

Panos Ipeirotis