Building Query Optimizers for Information Extraction: The SQoUT Project

Text documents often embed data that is structured in nature.  This structured data is increasingly exposed using information extraction systems, which generate structured relations from documents, introducing an opportunity to process expressive, structured queries over text databases.  This paper discusses our SQoUT1 project, which focuses on processing structured queries over relations extracted from text databases.  We show how, in our extraction-based scenario, query processing can be decomposed into a sequence of basic steps: retrieving relevant text documents, extracting relations from the documents, and joining extracted relations for queries involving multiple relations.  Each of these steps presents different alternatives and together they form a rich space of possible query execution strategies.  We identify execution efficiency and output quality as the two critical properties of a query execution, and argue that an optimization approach needs to consider both properties.  To this end, we take into account the user specified requirements for execution efficiency and output quality, and choose an execution strategy for each query based on a principled, cost-based comparison of the alternative execution strategies.