Research Projects

  • Data Cleaning

    Data Cleaning

    2001 - 2007

    Text Joins for Data Cleansing and Integration
    SQL Scripts

    The SQL scripts described in [2,3]

    (more…)

  • Data Cleaning

    SDARTS: A Protocol and Toolkit for Metasearching

    2001 - 2004

    SDARTS is a protocol for metasearching over document collections.

    You may consider using SDARTS if:

    • You want to search (one or multiple) text or XML collections that you havefrom a single search interface.
    • You want to search remote document collections that export their metadataunder the Open Archives protocol.
    • You want to search multiple web-based document collections from one, single search interface.

    SDARTS was developed as part of PERSIVAL (an NSF Digital Library Initiative–Phase 2 project) at the Computer Science Department of Columbia University. SDARTS is a hybrid of two previously existing protocols, STARTS and SDLIP. SDARTS is essentially an instantiation of the SDLIP protocol with a richer set of metadata, which can be effectively used for building sophisticated metasearchers. SDARTS makes a wide variety of collections with heterogeneous interfaces accessible under one uniform interface. The SDARTS toolkit provides ready-to-use, configurable wrappers. They can be used directly for wrapping locally available  text and XML collections, and for wrapping web-accessible databases. (more…)

  • Data Cleaning

    SQoUT – Structured Querying over Unstructured Text

    2006 - 2010

    Unstructured text data is ubiquitous and, not surprisingly, many users and applications rely on textual data for a variety of tasks. The current paradigm for handling text data, popularized by search engines, is essentially a keyword “lookup” operation, followed by a sophisticated ranking of the results. There is very limited support for “structured” queries, no support for queries that need to combine information from multiple sources, and no support for queries that need to aggregate results from multiple web pages. Furthermore, users have to go through a large number of returned documents to identify and construct the required answer. In the last years, research in information extraction showed how to retrieve structured information from unstructured textual data. Such systems allow users to ask complicated questions over unstructured text and get concrete answers, thus enabling users to spend less time searching for information and more time analyzing and understanding the results. (more…)

  • Data Cleaning

    The EconoMining Project

    2006 - precent

    You might have bought something on eBay and left a short feedback posting, summarizing your interaction with the seller, such as “Lightning fast delivery! Sloppy packaging, though.” Similarly, you might have visited Amazon and written a review for the latest digital camera that you bought, such as “The picture quality is fantastic, but the shutter speed lags badly.” While reading an online review, you may have also come across identity descriptive social information disclosed by reviewers about themselves such as their ‘Real name’, ‘Geographical location’, ‘Hobbies’, ‘Nick name’, etc. Or while searching for a used product in electronic second-hand markets such as those hosted by Amazon, you might have come across the description posted by the seller such as “Brand new device with original packaging! Factory authorized dealer! Full manufacturer’s warranty.” (more…)

  • Data Cleaning

    QProber: Classifying and Searching “Hidden-Web” Text Databases

    2008 - precent

    Many valuable text databases on the web have non-crawlable contents that are “hidden” behind search interfaces. Hence traditional search engines do not index this valuable information. One way to facilitate access to “hidden-web” databases is through commercial Yahoo!-like directories, which organize these databases manually into categories that users can browse. (more…)