Modeling Query-Based Access to Text Databases

Searchable text databases abound on the web.  Applications that require access to such databases often resort to querying to extract relevant documents because of two main reasons.  First, some text databases on the web are not “crawlable”, and hence the only way to retrieve their documents is via querying.  Second, applications often require only a small fraction of a database’s contents, so retrieving relevant documents via querying is an attractive choice from an efficiency viewpoint, even for crawlable databases.  Often an application’s query-based strategy starts with a small number of user-provided queries.  Then, new queries are extracted-in an application-dependent way- from the documents in the initial query results, and the process iterates.  The success of this common type of strategy relies on retrieved documents “contributing” new queries.  If new documents fail to produce new queries, then the process might stall before all relevant documents are retrieved.  In this paper, we develop a graph-based “reachability” metric that allows to characterize when an application’s query-based strategy will successfully “reach” all documents that the application needs.  We complement our metric with an efficient sampling-based technique that accurately estimates the reachability associated with a text database and an application’s query-based strategy.  We report preliminary experiments backing the usefulness of our metric and the accuracy of the associated estimation technique over real text databases and for two applications.