Text Joins for Data Cleansing and Integration
The SQL scripts described in [2,3]
SDARTS is a protocol for metasearching over document collections.
You may consider using SDARTS if:
SDARTS was developed as part of PERSIVAL (an NSF Digital Library Initiative–Phase 2 project) at the Computer Science Department of Columbia University. SDARTS is a hybrid of two previously existing protocols, STARTS and SDLIP. SDARTS is essentially an instantiation of the SDLIP protocol with a richer set of metadata, which can be effectively used for building sophisticated metasearchers. SDARTS makes a wide variety of collections with heterogeneous interfaces accessible under one uniform interface. The SDARTS toolkit provides ready-to-use, configurable wrappers. They can be used directly for wrapping locally available text and XML collections, and for wrapping web-accessible databases. (more…)
Unstructured text data is ubiquitous and, not surprisingly, many users and applications rely on textual data for a variety of tasks. The current paradigm for handling text data, popularized by search engines, is essentially a keyword “lookup” operation, followed by a sophisticated ranking of the results. There is very limited support for “structured” queries, no support for queries that need to combine information from multiple sources, and no support for queries that need to aggregate results from multiple web pages. Furthermore, users have to go through a large number of returned documents to identify and construct the required answer. In the last years, research in information extraction showed how to retrieve structured information from unstructured textual data. Such systems allow users to ask complicated questions over unstructured text and get concrete answers, thus enabling users to spend less time searching for information and more time analyzing and understanding the results. (more…)
You might have bought something on eBay and left a short feedback posting, summarizing your interaction with the seller, such as “Lightning fast delivery! Sloppy packaging, though.” Similarly, you might have visited Amazon and written a review for the latest digital camera that you bought, such as “The picture quality is fantastic, but the shutter speed lags badly.” While reading an online review, you may have also come across identity descriptive social information disclosed by reviewers about themselves such as their ‘Real name’, ‘Geographical location’, ‘Hobbies’, ‘Nick name’, etc. Or while searching for a used product in electronic second-hand markets such as those hosted by Amazon, you might have come across the description posted by the seller such as “Brand new device with original packaging! Factory authorized dealer! Full manufacturer’s warranty.” (more…)
Many valuable text databases on the web have non-crawlable contents that are “hidden” behind search interfaces. Hence traditional search engines do not index this valuable information. One way to facilitate access to “hidden-web” databases is through commercial Yahoo!-like directories, which organize these databases manually into categories that users can browse. (more…)