Filter by type:

Sort by year:

A Framework for Quality Assurance in Crowdsourcing

Working paper
  Jing Wang, Panagiotis G. Ipeirotis, and Foster Provost

 Venue: NYU-CBA Working Paper CBA-13-06

Dec 2015

The emergence of online paid crowdsourcing platforms, such as Amazon Mechanical Turk (AMT), presents us huge opportunities to distribute tasks to human workers around the world, on-demand and at scale. In such settings, online workers can come and complete tasks posted by a company, and work for as long or as little as they wish. Given the absolute freedom of choice, crowdsourcing eliminates the overhead of the hiring (and dismissal) process. However, this flexibility introduces a different set of inefficiencies: verifying the quality of every submitted piece of work is an expensive operation, which often requires the same level of effort as performing the task itself. Many research challenges emerge in this paid-crowdsourcing setting. How can we ensure that the submitted work is accurate? How can we estimate the quality of the workers, and the quality of the submitted results? How should we pay online workers that have imperfect quality? We present a comprehensive scheme for managing quality of crowdsourcing processes: First, we present an algorithm for estimating the quality of the participating workers and, by extension, of the generated data. We show how we can separate systematic worker biases from unrecoverable errors and generate an unbiased “worker quality” measurement that can be used to objectively rank workers according to their performance. Next, we describe a pricing scheme that identifies the fair payment level for a worker, adjusting the payment level according to the contributed information by each worker. Our pricing policy, which pays workers based on their expected quality, reservation wage, and expected lifetime, estimates not only the payment level but also accommodates measurement uncertainties and allows the workers to receive a fair wage, even in the presence of temporary incorrect estimations of quality. Our experimental results demonstrate that the proposed pricing strategy performs better than the commonly adopted uniform-pricing strategy. We conclude the paper by describing strategies that build on our quality control and pricing framework, to build crowdsourced tasks of increasingly higher complexity, while still maintaining a tight quality control of the process, even if we allow participants of unknown quality to join the process.


Link to publication page

Surviving Social Media Overload: Predicting Consumer Footprints on Product Search Engines

Working paper
  Beibei Li, Anindya Ghose, and Panagiotis Ipeirotis
Dec 2015

On Hiring Decisions in Online Labor Markets

Working paper
  Marios Kokkodis, Panagiotis Papadimitriou, and Panagiotis Ipeirotis
Dec 2015

Reputation Transferability in Online Labor Markets

Journal
  Marios Kokkodis and Panagiotis Ipeirotis

 Venue: Management Science, Forthcoming

Dec 2015 | Refereed

Online workplaces such as oDesk, Amazon Mechanical Turk, and TaskRabbit have been growing in importance over the last few years. In such markets, employers post tasks on which remote contractors work and deliver the product of their work online. As in most online marketplaces, reputation mechanisms play a very important role in facilitating transactions, since they instill trust and are often predictive of the employer’s future satisfaction. However, labor markets are usually highly heterogeneous in terms of available task categories; in such scenarios, past performance may not be an accurate signal of future performance. To account for this natural heterogeneity, in this work, we build models that predict the performance of a worker based on prior, category-specific feedback. Our models assume that each worker has a category-specific quality, which is latent and not directly observable; what is observable, though, is the set of feedback ratings of the worker and of other contractors with similar work histories. Based on this information, we provide a series of models of increasing complexity that successfully estimate the worker’s quality. We start by building a binomial and a multinomial model under the implicit assumption that the latent qualities of the workers are static. Next, we remove this assumption, and we build linear dynamic systems that capture the evolution of these latent qualities over time. We evaluate our models on a large corpus of over a million transactions (completed tasks) from oDesk, an online labor market with hundreds of millions of dollars in transaction volume. Our results show an improved accuracy of up to 25% compared to feedback baselines, and significant improvement over the commonly-used collaborative filtering approach. Our study clearly illustrates that reputation systems should present different reputation scores, depending on the context in which the worker has been previously evaluated and the job for which the worker is applying.


Link to publication page

Building a Bird Recognition App and Large Scale Dataset With Citizen Scientists: The Fine Print in Fine-Grained Dataset Collection

Conference
  Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie

 Venue: The Twenty-Eighth IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015)

Jun 2015 | Refereed

We introduce tools and methodologies to collect high quality, large scale fine-grained computer vision datasets using citizen scientists – crowd annotators who are passionate and knowledgeable about specific domains such as birds or airplanes. We worked with citizen scientists and domain experts to collect NABirds, a new high quality dataset containing 48,562 images of North American birds with 555 categories, part annotations and bounding boxes. We find that citizen scientists are significantly more accurate than Mechanical Turkers at zero cost. We worked with bird experts to measure the quality of popular datasets like CUB- 200-2011 and ImageNet and found class label error rates of at least 4%. Nevertheless, we found that learning algorithms are surprisingly robust to annotation errors and this level of training data corruption can lead to an acceptably small increase in test error if the training set has sufficient size. At the same time, we found that an expert-curated high quality test set like NABirds is necessary to accurately measure the performance of fine-grained computer vision systems. We used NABirds to train a publicly available bird recognition service deployed on the web site of the Cornell Lab of Ornithology


Link to publication page
pdf

The Global Opportunity in Online Outsourcing

Misc Presentation
  Siou Chew Kuek, Cecilia Paradi-Guilford, Toks Fayomi, Saori Imaizumi, Panos Ipeirotis, Patricia Pina, and Manpreet Singh

 Venue: The World Bank

Jun 2015 | Unrefereed

A System for Scalable and Reliable Technical-Skill Testing in Online Labor Markets

Journal
  Maria Christoforaki and Panos Ipeirotis

 Venue: Computer Networks, Forthcoming

May 2015 | Refereed

The emergence of online labor platforms, online crowdsourcing sites, and even Massive Open Online Courses (MOOCs), has created an increasing need for reliably evaluating the skills of the participating users (e.g., “does a candidate know Java”) in a scalable way. Many platforms already allow job candidates to take online tests to asses their competence in a variety of technical topics. However the existing approaches face many problems. First, cheating is very common  in online testing without supervision, as the test questions often “leak” and become easily available online along with the answers. Second, technical-skills, such as programming, require the tests to be frequently updated in order to reflect the current state-of-the-art. Third, there is very limited evaluation of the tests themselves, and how effectively they measure the skill that the users are tested for.  In this article we present a platform, that continuously generates test questions and evaluates their quality as predictors of the user skill level. Our platform leverages content that is already available on question answering sites such as Stack Overflow and re-purposes these questions to generate tests. This approach has some major benefits: we continuously generate new questions, decreasing the impact of cheating, and we also create questions that are closer to the real problems that the skill holder is expected to solve in real life. Our platform leverages the use of Item Response Theory to evaluate the quality of the questions. We also use external signals about the quality of the workers to examine the external validity of the generated test questions: Questions that have external validity also have a strong predictive ability for identifying early the workers that have the potential to succeed in the online job marketplaces. Our experimental evaluation shows that our system generates questions of comparable or higher quality compared to existing tests, with a cost of approximately $3 to $5 dollars per question, which is lower than the cost of licensing questions from existing test banks, and an order of magnitude lower than the cost of producing such questions from scratch using experts.


Link to publication page

Getting More for Less: Optimized Crowdsourcing with Dynamic Tasks and Goals

Conference
  Ari Kobren, Chun How Tan, Panos Ipeirotis, and Evgeniy Gabrilovich

 Venue: 24th International World Wide Web Conference (WWW 2015)

May 2015 | Refereed | 131/929 = 14.1% accepted

In crowdsourcing systems, the interests of contributing participants and system designers are often at odds. Participants seek to perform easy tasks, which offer them instant gratification, while system designers want them to complete more difficult tasks, which bring higher value to the crowdsourced application. Here we present techniques that optimize the crowdsourcing process for both parties involved, by jointly maximizing the worker longevity in the system and the true value that the system derives from worker participation.

We first present models that predict the “survival probability” of a worker at any given moment, that is, the probability that she will proceed to the next task offered by the system. We then leverage this survival model to dynamically decide what task to assign and what motivating goals to present to the worker. This allows us to jointly optimize for the short-term (getting a difficult task done) and for the long-term (keeping workers engaged for longer periods of time).

We show that dynamically assigning tasks significantly increases the value of a crowdsourcing system.
In an extensive empirical evaluation, we observed that our task allocation strategy increases the amount of information collected by up to 117.8%. We also explored the utility of motivating workers with goals. We demonstrate that setting specific, static goals can be highly detrimental to the long-term worker participation, as the completion of a goal (e.g., earning a badge) is also a common drop-off point for many workers. We show that setting the goals dynamically, in conjunction with judicious allocation of tasks, increases the amount of information collected by the crowdsourcing system by up to 249%, compared to the existing baselines that use fixed objectives.


Link to publication page

The Dynamics of Micro-Task Crowdsourcing — The Case of Amazon Mechanical Turk

Conference
  Djellel Eddine Difallah, Michele Catasta, Gianluca Demartini, Panos Ipeirotis, and Philippe Cudré-Mauroux

 Venue: 24th International World Wide Web Conference (WWW 2015)

May 2015 | Refereed | 131/929 = 14.1% accepted

Micro-task crowdsourcing is rapidly gaining popularity among research communities and businesses as a means to leverage Human Computation in their daily operations. Unlike any other service, a crowdsourcing platform is in fact a marketplace subject to human factors that affect its performance, both in terms of speed and quality. Indeed, such factors shape the dynamics of the crowdsourcing market. For example, a known behavior of such markets is that increasing the reward of a set of tasks would lead to faster results. However, it is still unclear how different dimensions interact with each other: reward, task type, market competition, requester reputation, etc.

In this paper, we adopt a data-driven approach to (A) perform a long-term analysis of a popular micro-task crowdsourcing platform and understand the evolution of its main actors (workers, requesters, tasks, and platform). (B) We leverage the main findings of our five year log analysis to propose features used in a predictive model aiming at determining the expected performance of any batch at a specific point in time. We show that the number of tasks left in a batch and the time at which the batch was posted are two key features of the prediction. (C) Finally, we conduct an analysis of the demand (new tasks posted by the requesters) and supply (number of tasks completed by the workforce) and show how they affect task prices on the marketplace.


Link to publication page

Beat the Machine: Challenging Humans to Find a Predictive Model’s “Unknown Unknowns”

Journal
  Joshua Attenberg, Panagiotis G. Ipeirotis, and Foster Provost

 Venue: ACM Journal on Data and Information Quality, Volume 6, Issue 1, March 2015

Feb 2015 | Refereed

We present techniques for gathering data that expose errors of automatic predictive models. In certain common settings,  traditional methods for evaluating predictive models tend to miss rare-but-important errors—most importantly, cases for  which the model is confident of its prediction (but wrong). In this paper, we present a system that, in a game-like setting,  asks humans to identify cases that will cause the predictive model-based system to fail. Such techniques are valuable in  discovering problematic cases that may not reveal themselves during the normal operation of the system, and may include  cases that are rare but catastrophic. We describe the design of the system, including design iterations that did not  quite work. In particular, the system incentivizes humans to provide examples that are difficult for the model to handle,  by providing a reward proportional to the magnitude of the predictive model’s error. The humans are asked  to “Beat the Machine” and find cases where the automatic model (“the Machine”) is wrong. Experiments  show that the humans using Beat the Machine identify more errors than traditional techniques for discovering errors in  predictive models, and indeed, they identify many more errors where  the machine is (wrongly) confident it is correct.  Further, the cases the humans identify seem to be not simply outliers, but coherent areas missed completely by the model.  Beat the Machine identifies the “unknown unknowns.”  Beat the Machine has been deployed at industrial scale by several companies.  The main impact has been that firms are changing their perspective on and practice of evaluating predictive models.


Link to publication page

Hiring Behavior Models for Online Labor Markets

Conference
  Marios Kokkodis, Panagiotis Papadimitriou, and Panagiotis G. Ipeirotis

 Venue: Eighth ACM International Conference on Web Search and Data Mining (WSDM 2015)

Feb 2015 | Refereed | 40/238 = 16.8% accepted

In an online labor marketplace employers post jobs, receive freelancer applications and make hiring decisions. These hiring decisions are based on the freelancer’s observed (e.g., education) and latent (e.g., ability) characteristics. Because of the heterogeneity that appears in the observed characteristics, and the existence of latent ones, identifying and hiring the best possible applicant is a very challenging task. In this work we study and model the employer’s hiring behavior. We assume that employers are utility maximizers and make rational decisions by hiring the best possible applicant at hand. Based on this premise, we propose a series of probabilistic models that estimate the hiring probability of each applicant. We train and test our models on more than 600,000 job applications obtained by oDesk.com, and we show evidence that the proposed models outperform currently in-use baselines. To get further insights, we conduct an econometric analysis and observe that the attributes that are strongly correlated with the hiring probability are whether or not the freelancer and the employer have previously worked together, the available information on the freelancer’s profile, the countries of the employer and the freelancer and the skillset of the freelancer. Finally, we find that the faster a freelancer applies to an opening, the higher is the probability to get the job.


Link to publication page

The Utility of Skills in Online Labor Markets

Conference
  Marios Kokkodis and Panagiotis Ipeirotis

 Venue: Thirty-Fifth International Conference on Information Systems (ICIS 2014)

Dec 2014 | Refereed

In this work, we define the utility of having a certain skill in an Online Labor Market (OLM), and we propose that this utility is strongly correlated with the level of expertise of a given worker. However, the actual level of expertise for a given skill and a given worker is both latent and dynamic. What is observable is a series of characteristics that are intuitively correlated with the level of expertise of a given skill. We propose to build a Hidden Markov Model (HMM), which estimates the latent and dynamic levels of expertise, based on the observed characteristics. We build and evaluate our approaches on a unique transactional dataset from oDesk.com. Finally, we estimate the utility of a series of skills and discuss how certain skills (e.g. ‘editing’) provide a higher expected payoff once a person masters them over others (e.g. ‘microsoft excel’).


Link to publication page

STEP: A Scalable Testing and Evaluation Platform

Conference
  Maria Christoforaki and Panos Ipeirotis

 Venue: Second AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2014)

Nov 2014 | Refereed | 26/82 ~ 30% acceptance rate; Best Paper Award

The emergence of online crowdsourcing sites, online work platforms, and even Massive Open Online Courses (MOOCs), has created an increasing need for reliably evaluating the skills of the participating users in a scalable way. Many platforms already allow users to take online tests and verify their skills, but the existing approaches face many problems. First of all, cheating is very common in online testing without supervision, as the test questions often “leak” and become easily available online together with the answers. Second, technical skills, such as programming, require the tests to be frequently updated in order to reflect the current state-of-the-art. Third, there is very limited evaluation of the tests themselves, and how effectively they measure the skill that the users are tested for.

In this paper, we present a Scalable Testing and Evaluation Platform (STEP), that allows continuous generation and evaluation of test questions. STEP leverages already available content, on Question Answering sites such as Stack Overflow and re-purposes these questions to generate tests. The system utilizes a crowdsourcing component for the editing of the questions, while it uses automated techniques for identifying promising QA threads that can be successfully re-purposed for testing. This continuous question generation decreases the impact of cheating and also creates questions that are closer to the real problems that the skill holder is expected to solve in real life. STEP also leverages the use of Item Response Theory to evaluate the quality of the questions. We also use external signals about the quality of the workers. These identify the questions that have the strongest predictive ability in distinguishing workers that have the potential to succeed in the online job marketplaces. Existing approaches contrast in using only internal consistency metrics to evaluate the questions. Finally, our system employs an automatic “leakage detector” that queries the Internet to identify leaked versions of our questions. We then mark these questions as “practice only,” effectively removing them from the pool of questions used for evaluation. Our experimental evaluation shows that our system generates questions of comparable or higher quality compared to existing tests, with a cost of approximately $3-$5 dollars per question, which is lower than the cost of licensing questions from existing test banks.


Link to publication page

Efficient Filtering on Hidden Document Streams

Conference
  Eduardo Ruiz, Vagelis Hristidis, and Panagiotis G. Ipeirotis

 Venue: 8th International AAAI Conference on Weblogs and Social Media (ICWSM 2014)

Jun 2014 | Refereed | 64/279 = 23% accepted

Many online services like Twitter and GNIP offer streaming programming interfaces that allow real-time information filtering based on keyword or other conditions. However, all these services specify strict access constraints, or charge a cost based on the usage. We refer to such streams as “hidden streams” to draw a parallel to the well-studied hidden Web, which similarly restricts access to the contents of a database through a querying interface. At the same time, the users’ interest is often captured by complex classification models that, implicitly or explicitly, specify hundreds of keyword-based rules, along with the rules’ accuracies.

In this paper, we study how to best utilize a constrained streaming access interface to maximize the number of retrieved relevant items, with respect to a classifier, expressed as a set of rules. We consider two problem variants. The static version assumes that the popularity of the keywords is known and constant across time. The dynamic version lifts this assumption, and can be viewed as an exploration-vs.-exploitation problem. We show that both problems are NP-hard, and propose exact and bounded approximation algorithms for various settings, including various access constraint types. We experimentally evaluate our algorithms on real Twitter data.


Link to publication page

Composing and Analyzing Crowdsourcing Workflows

Conference
  Panagiotis G. Ipeirotis, Greg Little, and Thomas W. Malone

 Venue: Collective Intelligence 2014

Jun 2014 | Unrefereed

When trying to crowdsource complex tasks, it is often better to break the big task into a set of smaller tasks and connect them into a production workflow. Such workflows often allow a set of workers to work together in a task, creating an “assembly line for knowledge work” and allow the completion of tasks that are too complex or difficult for any participating individual to accomplish. Recent research has focused on analyzing and optimizing existing workflows, using Partially Observed Markov Decision Processes (POMDPs). The results indicate that using decision-theoretic approaches for dynamically adjusting the settings of crowdsourcing workflows and exchanging among multiple, existing workflows can lead to substantial improvements in both the quality and the overall cost of executing a workflow.

In this work, we present a method for creating and analyzing arbitrary workflows, and generating predictions about the resulting quality and cost of a given workflow. Specifically, we focus on the following questions:
(1) Given a workflow for a task, can we predict the cost and time for completing the task, together with the resulting quality of the generated product?
(2) Given a workflow for a task, can we generate alternative workflows that are expected to be better than the current one?


Link to publication page

Quizz: Targeted Crowdsourcing with a Billion (Potential) Users

Conference
  Panos Ipeirotis and Evgeniy Gabrilovich

 Venue: 23rd International World Wide Web Conference (WWW 2014)

Apr 2014 | Refereed | 84/650=12.9% accepted

We describe Quizz, a gamified crowdsourcing system that simultaneously assesses the knowledge of users and acquires new knowledge from them. Quizz operates by asking users to complete short quizzes on specific topics; as a user answers the quiz questions, Quizz estimates the user’s competence. To acquire new knowledge, Quizz also incorporates questions for which we do not have a known answer; the answers given by competent users provide useful signals for selecting the correct answers for these questions. Quizz actively tries to identify knowledgeable users on the Internet by running advertising campaigns, effectively leveraging “for free” the targeting capabilities of existing, publicly available, ad placement services. Quizz quantifies the contributions of the users using information theory and sends feedback to the advertising system about each user. The feedback allows the ad targeting mechanism to further optimize ad placement.

Our experiments, which involve over ten thousand users, confirm that we can crowdsource knowledge curation for niche and specialized topics, as the advertising network can automatically identify users with the desired expertise and interest in the given topic. We present controlled experiments that examine the effect of various incentive mechanisms, highlighting the need for having short-term rewards as goals, which incentivize the users to contribute. Finally, our cost-quality analysis indicates that the cost of our approach is below that of hiring workers through paid-crowdsourcing platforms, while offering the additional advantage of giving access to billions of potential users all over the planet, and being able to reach users with specialized expertise that is not typically available through existing labor marketplaces.


Link to publication page

Trust, but Verify: Predicting Contribution Quality for Knowledge Base Construction and Curation

Conference
  Chun How Tan, Eugene Agichtein, Evgeniy Gabrilovich, and Panos Ipeirotis

 Venue: Seventh ACM Conference on Web Search and Data Mining (WSDM 2014)

Feb 2014 | Refereed | 64/356 = 18% accepted

The largest publicly available knowledge repositories, such as Wikipedia and Freebase, owe their existence and growth to volunteer contributors around the globe. While the majority of contributions are correct, errors can still creep in, due to editors’ carelessness, misunderstanding of the schema, malice, or even lack of accepted ground truth. If left undetected, inaccuracies often degrade the experience of users and the performance of applications that rely on these knowledge repositories. We present a new method, CQUAL, for automatically predicting the quality of contributions submitted to a knowledge base. Significantly expanding upon previous work, our method holistically exploits a variety of signals, including the user’s domains of expertise as reflected in her prior contribution history, and the historical accuracy rates of different types of facts. In a large-scale human evaluation, our method exhibits precision of 91% at 80% recall. Our model verifies whether a contribution is correct immediately after it is submitted, significantly alleviating the need for post-submission human reviewing.


Link to publication page

Examining the Impact of Ranking on Consumer Behavior and Search Engine Revenue

Journal
  Anindya Ghose, Panagiotis G. Ipeirotis, and Beibei Li

 Venue: Management Science

Mar 2014 | Refereed

In this paper, we study the effects of three different kinds of search engine rankings on consumer behavior and search engine revenues: direct ranking effect, interaction effect between ranking and product ratings, and personalized ranking effect. We combine a hierarchical Bayesian Model estimated on approximately one million online sessions from Travelocity, together with randomized experiments using a real-world hotel search engine application. Our archival data analysis and randomized experiments are consistent in demonstrating the following: (1) a consumer utility-based ranking mechanism can lead to a significant increase in overall search engine revenue. (2) Significant interplay occurs between search engine ranking and product ratings. An inferior position on the search engine affects “higher-class” hotels more adversely. On the other hand, hotels with a lower customer rating are more likely to benefit from being placed on the top of the screen. These findings illustrate that product search engines could benefit from directly incorporating signals from social media into their ranking algorithms. (3) Our randomized experiments also reveal that an “active” (wherein users can interact with and customize the ranking algorithm) personalized ranking system leads to higher clicks but lower purchase propensities and lower search engine revenue compared to a “passive” (wherein users cannot interact with the ranking algorithm) personalized ranking system. This result suggests that providing more information during the decision-making process may lead to fewer consumer purchases because of information overload. Therefore, product search engines should not adopt personalized ranking systems by default. Overall, our study unravels the economic impact of ranking and its interaction with social media on product search engines.


Link to publication page

Quality-Based Pricing for Crowdsourced Workers

Working paper
  Jing Wang, Panagiotis G. Ipeirotis, and Foster Provost

 Venue: NYU-CBA Working Paper CBA-13-06

Jun 2013 | Unrefereed

The emergence of online paid crowdsourcing platforms, such as Amazon Mechanical Turk (AMT), presents us huge opportunities to distribute tasks to human workers around the world, on-demand and at scale. In such settings, online workers can come and complete tasks posted by a company, and work for as long or as little as they wish. Given the absolute freedom of choice, crowdsourcing eliminates the overhead of the hiring (and dismissal) process. However, this flexibility introduces a different set of inefficiencies: verifying the quality of every submitted piece of work is an expensive operation, which often requires the same level of effort as performing the task itself. Many research challenges emerge in this paid-crowdsourcing setting. How can we ensure that the submitted work is accurate? How can we estimate the quality of the workers, and the quality of the submitted results? How should we pay online workers that have imperfect quality? We present a comprehensive scheme for managing quality of crowdsourcing processes: First, we present an algorithm for estimating the quality of the participating workers and, by extension, of the generated data. We show how we can separate systematic worker biases from unrecoverable errors and generate an unbiased “worker quality” measurement that can be used to objectively rank workers according to their performance. Next, we describe a pricing scheme that identifies the fair payment level for a worker, adjusting the payment level according to the contributed information by each worker. Our pricing policy, which pays workers based on their expected quality, reservation wage, and expected lifetime, estimates not only the payment level but also accommodates measurement uncertainties and allows the workers to receive a fair wage, even in the presence of temporary incorrect estimations of quality. Our experimental results demonstrate that the proposed pricing strategy performs better than the commonly adopted uniform-pricing strategy. We conclude the paper by describing strategies that build on our quality control and pricing framework, to build crowdsourced tasks of increasingly higher complexity, while still maintaining a tight quality control of the process, even if we allow participants of unknown quality to join the process.


Link to publication page

Repeated Labeling Using Multiple Noisy Labelers

Journal
  Panagiotis G. Ipeirotis, Foster Provost, Victor S. Sheng, and Jing Wang

 Venue: Data Mining and Knowledge Discovery, March 2014, Volume 28, Issue 2, pp 402-441

Mar 2014 | Refereed

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Amazon’s Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a set of robust techniques that combine different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire. For certain label-quality/cost regimes, the benefit is substantial.


Link to publication page

Have you Done Anything Like That? Predicting Performance Using Inter-category Reputation

Conference
  Marios Kokkodis and Panagiotis G. Ipeirotis

 Venue: Sixth International Conference on Web Search and Data Mining (WSDM 2013)

Feb 2013 | Refereed | 73/387=19% accepted

Online labor markets such as oDesk, Amazon Mechanical Turk, and TaskRabbit have been growing in importance over the last few years. In these markets, employers post tasks on which remote contractors work and deliver the product of their work. As in most online marketplaces, reputation mechanisms play a very important role in facilitating transactions, since they instill trust and are often predictive of the future satisfaction of the employer. However, labor markets are usually highly heterogeneous in terms of available task categories; in such scenarios, past performance may not be a representative signal of future performance. To account for this heterogeneity, in our work, we build models that predict the performance of a worker based on prior, category-specific feedback. Our models assume that each worker has a category-specific quality, which is latent and not directly observable; what is observable, though, is the set of feedback ratings of the worker and of other contractors with similar work histories. Based on this information, we build a multi-level, hierarchical scheme that deals effectively with the data sparseness, which is inherent in many cases of interest (i.e., contractors with relatively brief work histories). We evaluate our models on a large corpus of real transactional data from oDesk, an online labor market with hundreds of millions of dollars in transaction volume. Our results show an improved accuracy of up to 47% compared to the existing baseline.


Link to publication page

Facilitating Document Annotation using Content and Querying Value

Journal
  Eduardo J. Ruiz, Vagelis Hristidis, and Panagiotis G. Ipeirotis

 Venue: IEEE Transactions on Knowledge and Data Engineering (TKDE)

Feb 2014 | Refereed

A large number of organizations today generate and share textual descriptions of their products, services, and actions. Such collections of textual data contain significant amount of structured information, which remains buried in the unstructured text. While information extraction algorithms facilitate the extraction of structured relations, they are often expensive and inaccurate, especially when operating on top of text that does not contain any instances of the targeted structured information. We present a novel alternative approach that facilitates the generation of the structured metadata by identifying documents that are likely to contain information of interest and this information is going to be subsequently useful for querying the database. Our approach relies on the idea that humans are more likely to add the necessary metadata during creation time, if prompted by the interface; or that it is much easier for humans (and/or algorithms) to identify the metadata when such information actually exists in the document, instead of naively prompting users to fill in forms with information that is not available in the document. As a major contribution of this paper, we present algorithms that identify structured attributes that are likely to appear within the document, by jointly utilizing the content of the text and the query workload. Our experimental evaluation shows that our approach generates superior results compared to approaches that rely only on the textual content or only on the query workload, to identify attributes of interest.


Link to publication page

Bonus, Disclosure, and Choice: What Motivates the Creation of High-Quality Paid Reviews?

Conference
  Jing Wang, Anindya Ghose, and Panagiotis G., Ipeirotis

 Venue: Thirty Third International Conference on Information Systems (ICIS 2012)

Dec 2012 | Refereed

The emergence of online crowdsourcing sites has opened up new channels for third parties and companies to solicit paid reviews from people. In this paper, we investigate 1) how the introduction of monetary payments affects review quality, and 2) the impact of bonus rewards, sponsorship disclosure, and choice freedom on the quality of paid reviews. We conduct a 2×2×2 between-subjects experiment on Amazon Mechanical Turk. Our results indicate that there are no significant quality differences between paid and unpaid reviews. The quality of paid reviews improves by both the presence of additional performance-contingent rewards and the requirement to add disclosure text about material connections, and deteriorates by the restrictions imposed on the product set to be reviewed. These results have implications for websites and companies who are seeking legitimate reviews for their online products from paid workers.


Link to publication page

Search Less, Find More? Examining Limited Consumer Search with Social Media and Product Search Engines

Conference
  Anindya Ghose, Panagiotis G. Ipeirotis, and Beibei Li

 Venue: Thirty Third International Conference on Information Systems (ICIS 2012)

Dec 2012 | Refereed | Best Theme Paper Award

With the proliferation of social media, consumers’ cognitive costs during information-seeking can become non-trivial during an online shopping session. We propose a dynamic structural model of limited consumer search thatcombines an optimal stopping framework with an individual-level choice model. We estimate the parameters of the model using a dataset of approximately 1 million online search sessions resulting in bookings in 2117 U.S. hotels. The model allows us to estimate the monetary value of the the search costs incurred by users of product search engines in a social media context. On average, searching an extra page on a search engine costs consumers \$39.15 and examining an additional offer within the same page has a cost of \$6.24, respectively. A good recommendation saves consumers, on average, \$9.38, whereas a bad one costs $18.54. Our policy experiment strongly supports this finding by showing that the quality of ranking can have significant impact on consumers’ search efforts, and customized ranking recommendations tend to polarize the distribution of consumer search intensity. Our model-fit comparison demonstrates that the dynamic search model provides the highest overall predictive power compared to the baseline static models. Our dynamic model indicates that consumers have lower price sensitivity than a static model would have predicted, implying that consumers pay a lot of attention to non-price factors during an online hotel search.


Link to publication page

Designing Ranking Systems for Hotels on Travel Search Engines by Mining User-Generated and Crowd-Sourced Content

Journal
  Anindya Ghose, Panagiotis G. Ipeirotis, and Beibei Li

 Venue: Marketing Science, Vol. 31, No. 3, May/June 2012

May 2012 | Refereed

User-generated content on social media platforms and product search engines is changing the way consumers shop for goods online. However, current product search engines fail to effectively leverage information created across diverse social media platforms. Moreover, current ranking algorithms in these product search engines tend to induce consumers to focus on one single product characteristic dimension (e.g., price, star rating). This approach largely ignores consumers’ multidimensional preferences for products. In this paper, we propose to generate a ranking system that recommends products that provide, on average, the best value for the consumer’s money. The key idea is that products that provide a higher surplus should be ranked higher on the screen in response to consumer queries. We use a unique data set of U.S. hotel reservations made over a three-month period through Travelocity, which we supplement with data from various social media sources using techniques from text mining, image classification, social geotagging, human annotations, and geomapping. We propose a random coefficient hybrid structural model, taking into consideration the two sources of consumer heterogeneity the different travel occasions and different hotel characteristics introduce. Based on the estimates from the model, we infer the economic impact of various location and service characteristics of hotels. We then propose a new hotel ranking system based on the average utility gain a consumer receives from staying in a particular hotel. By doing so, we can provide customers with the “best-value” hotels early on. Our user studies, using ranking comparisons from several thousand users, validate the superiority of our ranking system relative to existing systems on several travel search engines. On a broader note, this paper illustrates how social media can be mined and incorporated into a demand estimation model in order to generate a new ranking system in product search engines. We thus highlight the tight linkages between user behavior on social media and search engines. Our interdisciplinary approach provides several insights for using machine learning techniques in economics and marketing research.


Link to publication page

Answering General Time-Sensitive Queries

Journal
  Wisam Dakka, Luis Gravano, and Panagiotis G. Ipeirotis

 Venue: IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 24, No. 2, February 2012

Feb 2012 | Refereed

Time is an important dimension of relevance for a large number of searches, such as over blogs and news archives. So far, research on searching over such collections has largely focused on locating topically similar documents for a query. Unfortunately, topic similarity alone is not always sufficient for document ranking. In this paper, we observe that, for an important class of queries that we call time-sensitive queries, the publication time of the documents in a news archive is important and should be considered in conjunction with the topic similarity to derive the final document ranking. Earlier work has focused on improving retrieval for “recency” queries that target recent documents. We propose a more general framework for handling time-sensitive queries and we automatically identify the important time intervals that are likely to be of interest for a query. Then, we build scoring techniques that seamlessly integrate the temporal aspect into the overall ranking mechanism. We present an extensive experimental evaluation using a variety of news article data sets, including TREC data as well as real web data analyzed using the Amazon Mechanical Turk. We examine several techniques for detecting the important time intervals for a query over a news archive and for incorporating this information in the retrieval process. We show that our techniques are robust and significantly improve result quality for time-sensitive queries compared to state-of-the-art retrieval techniques.


Link to publication page

Content and Context: Identifying the Impact of Qualitative Information on Consumer Choice

Conference
  Sinan Aral, Panagiotis G. Ipeirotis, and Sean Taylor

 Venue: Proceedings of the International Conference on Information Systems (ICIS), 2011

Dec 2011 | Refereed

Managers and researchers alike suspect that the vast amounts of qualitative information in blogs, reviews, news stories, and experts’ advice influence consumer behavior. But, does qualitative information impact or rather reflect consumer choices? We argue that because message content and consumer choice are endogenous, non-random selection and conflation of awareness and persuasion complicate causal estimation of the impact of message content on outcomes. We apply Latent Dirichlet Allocation to characterize the topics of transcribed content from 2,397 stock recommendations provided by Jim Cramer on his show Mad Money. We demonstrate that selection bias and audience prior awareness create measurable biases in estimates of the impact of content on stock prices. Comparing recommendation content to prior news, we show that he is less persuasive when he uses more novel arguments. The technique we develop can be applied in a variety of settings where marketers can present different messages depending on what subjects know.


Link to publication page

Examining the Impact of Search Engine Ranking and Personalization on Consumer Behavior: Combining Bayesian Modeling with Randomized Field Experiments

Workshop
  Anindya Ghose, Panagiotis G. Ipeirotis, and Beibei Li

 Venue: Proceedings of the Workshop on Information Systems and Economics (WISE), 2011

Dec 2011 | Refereed

In this paper, we examine how different ranking and personalization mechanisms on product search engines influence consumer online search and purchase behavior. To investigate these effects, we combine archival data analysis with randomized field experiments. Our archival data analysis is based on a unique dataset containing approximately 1 million online sessions from Travelocity over a 3-month period. Using a hierarchical Bayesian model, we first jointly estimate the relationship among consumer click and purchase behavior, and search engine ranking decisions. To evaluate the causal effect of search engine interface on user behavior, we conduct randomized field experiments. The field experiments are based on a real-world hotel search engine application designed and built by us. By manipulating the default ranking method of search results, and by enabling or disabling a variety of personalization features on the hotel search engine website, we are able to empirically identify the causal impact of search engines on consumers’ online click and purchase behavior.

The archival data analysis and the randomized experiments are consistent in demonstrating that ranking has a significant effect on consumer click and purchase behavior. We find that hotels with a higher reputation for providing superior services are more adversely affected by an inferior screen position. In addition, a consumer utility-based ranking mechanism yields the highest click and purchase propensities in comparison to existing benchmark systems such as ranking based on price or customer ratings. Our randomized experiments on the impact of active vs. passive personalization mechanisms on user behavior indicate that although active personalization (wherein users can interact with the recommendation algorithm) can lead to a higher click-through rate compared to passive personalization, it leads to a lower conversion rate when consumers have a planned purchase beforehand. This finding suggests that active personalization strategies should not be adopted ubiquitously by product search engines. On a broader note, our inter-disciplinary approach provides a methodological framework for how econometric modeling, randomized field experiments, and IT-based artifacts can be integrated in the same study towards deriving causal relationships between variables of interest.


Link to publication page

Relevance-Based Retrieval on Hidden-Web Text Databases without Ranking Support

Journal
  Vagelis Hristidis, Yuheng Hu, and Panagiotis G. Ipeirotis

 Venue: IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 23, No. 10, October 2011

Oct 2011 | Refereed

Many online or local data sources provide powerful querying mechanisms but limited ranking capabilities. For instance, PubMed allows users to submit highly expressive Boolean keyword queries, but ranks the query results by date only. However, a user would typically prefer a ranking by relevance, measured by an information retrieval (IR) ranking function. A naive approach would be to submit a disjunctive query with all query keywords, retrieve all the returned matching documents, and then rerank them. Unfortunately, such an operation would be very expensive due to the large number of results returned by disjunctive queries. In this paper, we present algorithms that return the top results for a query, ranked according to an IR-style ranking function, while operating on top of a source with a Boolean query interface with no ranking capabilities (or a ranking capability of no interest to the end user). The algorithms generate a series of conjunctive queries that return only documents that are candidates for being highly ranked according to a relevance metric. Our approach can also be applied to other settings where the ranking is monotonic on a set of factors (query keywords in IR) and the source query interface is a Boolean expression of these factors. Our comprehensive experimental evaluation on the PubMed database and a TREC data set show that we achieve order of magnitude improvement compared to the current baseline approaches.


Link to publication page

Estimating the Helpfulness and Economic Impact of Product Reviews: Mining Text and Reviewer Characteristics

Journal
  Anindya Ghose and Panagiotis G. Ipeirotis

 Venue: IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 23, No. 10, October 2011

Oct 2011 | Refereed

With the rapid growth of the Internet, the ability of users to create and publish content has created active electronic communities that provide a wealth of product information. However, the high volume of reviews that are typically published for a single product makes harder for individuals as well as manufacturers to locate the best reviews and understand the true underlying quality of a product. In this paper, we reexamine the impact of reviews on economic outcomes like product sales and see how different factors affect social outcomes such as their perceived usefulness. Our approach explores multiple aspects of review text, such as subjectivity levels, various measures of readability and extent of spelling errors to identify important text-based features. In addition, we also examine multiple reviewer-level features such as average usefulness of past reviews and the self-disclosed identity measures of reviewers that are displayed next to a review. Our econometric analysis reveals that the extent of subjectivity, informativeness, readability, and linguistic correctness in reviews matters in influencing sales and perceived usefulness. Reviews that have a mixture of objective, and highly subjective sentences are negatively associated with product sales, compared to reviews that tend to include only subjective or only objective information. However, such reviews are rated more informative (or helpful) by other users. By using Random Forest-based classifiers, we show that we can accurately predict the impact of reviews on sales and their perceived usefulness. We examine the relative importance of the three broad feature categories: “reviewer-related” features, “review subjectivity” features, and “review readability” features, and find that using any of the three feature sets results in a statistically equivalent performance as in the case of using all available features. This paper is the first study that integrates econometric, text mining, and predictive modeling techniques toward a more complete analysis of the information captured by user-generated online reviews in order to estimate their helpfulness and economic impact.


Link to publication page

What’s the Right Price? Pricing Tasks for Finishing on Time

Workshop
  Siamak Faridani, Björn Hartmann, and Panagiotis G. Ipeirotis

 Venue: Proceedings of the 3rd Human Computation Workshop (HCOMP), 2011

Aug 2011 | Refereed | 16/52=30% accepted

Many practitioners currently use rules of thumb to price tasks on online labor markets. Incorrect pricing leads to task starvation or inefficient use of capital. Formal pricing policies can address these challenges. In this paper we argue that a pricing policy can be based on the trade-off between price and desired completion time.We show how this duality can lead to a better pricing policy for tasks in online labor markets. This paper makes three contributions. First, we devise an algorithm for job pricing using a survival analysis model. We then show that worker arrivals can be modeled as a non-homogeneous Poisson Process (NHPP). Finally using NHPP for worker arrivals and discrete choice models we present an abstract mathematical model that captures the dynamics of the market when full market information is presented to the task requester. This model can be used to predict completion times and pricing policies for both public and private crowds.


Link to publication page

Beat the Machine: Challenging Workers to Find the Unknown Unknowns

Workshop
  Josh Attenberg, Panagiotis G. Ipeirotis, and Foster Provost

 Venue: Proceedings of the 3rd Human Computation Workshop (HCOMP), 2011

Aug 2011 | Refereed | 16/52=30% accepted

We present techniques for gathering data that expose errors of automatic predictive models. In certain common settings, traditional methods for evaluating predictive models tend to miss rare-but-important errors-most importantly, rare cases for which the model is confident of its prediction (but wrong). In this paper we present a system that, in a game-like setting, asks humans to identify cases what will cause the predictive-model-based system to fail. Such techniques are valuable in discovering problematic cases that do not reveal themselves during the normal operation of the system, and may include cases that are rare but catastrophic. We describe the design of the system, including design iterations that did not quite work. In particular, the system incentivizes humans to provide examples that are difficult for the model to handle, by providing a reward proportional to the magnitude of the predictive model’s error. The humans are asked to “Beat the Machine” and find cases where the automatic model (“the Machine”) is wrong. Experiments show that the humans using Beat the Machine identify more errors than traditional techniques for discovering errors in from predictive models, and indeed, they identify many more errors where the machine is confident it is correct. Further, the cases the humans identify seem to be not simply outliers, butcoherent areas missed completely by the model. Beat the machine identifies the “unknown unknowns”.


Link to publication page

Deriving the Pricing Power of Product Features by Mining Consumer Reviews

Journal
  Nikolay Archak, Anindya Ghose, and Panagiotis G. Ipeirotis

 Venue: Management Science, Vol. 57, No 8, August 2011

Aug 2011 | Refereed

Increasingly, user-generated product reviews serve as a valuable source of information for customers making product choices online. The existing literature typically incorporates the impact of product reviews on sales based on numeric variables representing the valence and volume of reviews. In this paper, we posit that the information embedded in product reviews cannot be captured by a single scalar value. Rather, we argue that product reviews are multifaceted, and hence the textual content of product reviews is an important determinant of consumers’ choices, over and above the valence and volume of reviews. To demonstrate this, we use text mining to incorporate review text in a consumer choice model by decomposing textual reviews into segments describing different product features. We estimate our model based on a unique data set from Amazon containing sales data and consumer review data for two different groups of products (digital cameras and camcorders) over a 15-month period. We alleviate the problems of data sparsity and of omitted variables by providing two experimental techniques: clustering rare textual opinions based on pointwise mutual information and using externally imposed review semantics. This paper demonstrates how textual data can be used to learn consumers’ relative preferences for different product features and also how text can be used for predictive modeling of future changes in sales.


Link to publication page

The Need for Standardization in Crowdsourcing

Workshop
  Panagiotis G. Ipeirotis and John J. Horton

 Venue: Proceedings of the Workshop on Crowdsourcing and Human Computation at CHI, 2011

May 2011 | Refereed

Crowdsourcing has shown itself to be well-suited for the accomplishment of certain kinds of small tasks, yet many crowdsourceable tasks still require extensive structuring and managerial effort before using a crowd is feasible. We argue that this overhead could be substantially reduced via standardization. In the same way that task standardization enabled the mass production of physical goods, standardization of basic “building block” tasks would make crowdsourcing more scalable. Standardization would make it easier to set prices, spread best practices, build meaningful reputation systems and track quality. All of this would increase the demand for paid crowdsourcing-a development we argue is positive on both efficiency and welfare grounds. Standardization would also allow more complex processes to be built out of simpler tasks while still being able to predict quality, cost and time to completion. Realizing this vision will require interdisciplinary research effort as well as buy-in from online labor platforms.


Link to publication page

Towards a Theory Model for Product Search

Conference
  Beibei Li, Anindya Ghose, and Panagiotis G. Ipeirotis

 Venue: Proceedings of the 20th International World-Wide Web Conference (WWW), 2011

Mar 2011 | Refereed | 81/658=12% accepted; Best Paper Award

With the growing pervasiveness of the Internet, online search for products and services is constantly increasing. Most product search engines are based on adaptations of theoretical models devised for information retrieval. However, the decision mechanism that underlies the process of buying a product is different than the process of locating relevant documents or objects.

We propose a theory model for product search based on expected utility theory from economics. Specifically, we propose a ranking technique in which we rank highest the products that generate the highest surplus, after the purchase. In a sense, the top ranked products are the “best value for money” for a specific user. Our approach builds on research on “demand estimation” from economics and presents a solid theoretical foundation on which further research can build on. We build algorithms that take into account consumer demographics, heterogeneity of consumer preferences, and also account for the varying price of the products. We show how to achieve this without knowing the demographics or purchasing histories of individual consumers but by using aggregate demand data. We evaluate our work, by applying the techniques on hotel search. Our extensive user studies, using more than 15,000 user-provided ranking comparisons, demonstrate an overwhelming preference for the rankings generated by our techniques, compared to a large number of existing strong state-of-the-art baselines.


Link to publication page

The Computer is the New Sewing Machine: Benefits and Perils of Crowdsourcing

Misc Presentation
  Praveen Paritosh, Panagiotis G. Ipeirotis, Matt Cooper, and Siddharth Suri

 Venue: Panel at the 20th International World-Wide Web Conference (WWW), 2011

Mar 2011 | Invited

There is increased participation by the developing world in the global manufacturing marketplace: the sewing machine in Bangladesh can be a means to support an entire family. Crowdsourcing for cognitive tasks consists of asking humans for questions that are otherwise impossible to answer by algorithms, e.g., is this image pornographic, are these two addresses the same, what is the translation for this text in French? In the last five years, there has been an exponential growth in the size of the global cognitive marketplace: Amazon.com’s Mechanical Turk has an estimated 500,000 active workers in over 100 countries, and there are dozens of other companies in this space. This turns the computer into a modern-day sewing machine, where cognitive work of various levels of difficulty will pay anywhere from 5 to 50 dollars a day. Unlike outsourcing, which usually requires college education, competence at these tasks might be a month or even less of training. At its best, this could be a powerful bootstrap for a billion people. At its worst, this can lead to unprecedented exploitation. In this panel, we discuss the technical, social and economic questions and implications that a global cognitive marketplace raises.


Link to publication page

A Demo Search Engine for Products

Workshop
  Beibei Li, Anindya Ghose, and Panagiotis G. Ipeirotis

 Venue: Proceedings of the 20th International World-Wide Web Conference (WWW), 2011

Mar 2011 | Refereed

Most product search engines today build on models of relevance devised for information retrieval. However, the decision mechanism that underlies the process of buying a product is different than the process of locating relevant documents or objects. We propose a theory model for product search based on expected utility theory from economics. Specifically, we propose a ranking technique in which we rank highest the products that generate the highest surplus, after the purchase. We instantiate our research by building a demo search engine for hotels that takes into account consumer heterogeneous preferences, and also accounts for the varying hotel price. Moreover, we achieve this without explicitly asking the preferences or purchasing histories of individual consumers but by using aggregate demand data. This new ranking system is able to recommend consumers products with “best value for money” in a privacy-preserving manner. The demo is accessible at http://nyuhotels.appspot.com/


Link to publication page

Managing Crowdsourced Human Computation: A Tutorial

Misc Presentation
  Panagiotis G. Ipeirotis and Praveen K. Paritosh

 Venue: Tutorial at the 20th International World-Wide Web Conference (WWW), 2011

Mar 2011 | Refereed

The tutorial covers an emerging topic of wide interest: Crowd-sourcing. Speci cally, we cover areas of crowdsourcing related to managing structured and unstructured data in a web-related content. Many researchers and practitioners today see the great opportunity that becomes available through easily-available crowdsourcing platforms. However, most newcomers face the same questions: How can we manage the (noisy) crowds to generate high quality output? How to estimate the quality of the contributors? How can we best structure the tasks? How can we get results in small amounts of time and minimizing the necessary resources? How to setup the incentives? How should such crowdsourcing markets be setup? Their presented material will cover topics from a variety of fields, including computer science, statistics, economics, and psychology. Furthermore, the material will include real-life examples and case studies from years of experience in running and managing crowdsourcing applications in business settings.


Link to publication page

Estimating the Completion Time of Crowdsourced Tasks Using Survival Analysis Models

Workshop
  Jing Wang, Siamak Faridani, and Panagiotis G. Ipeirotis

 Venue: Proceedings of the Workshop on Crowdsourcing for Search and Data Mining (WSDM11-CSDM), 2011

Feb 2011 | Refereed

In order to seamlessly integrate a human computation component (e.g., Amazon Mechanical Turk) within a larger production system, we need to have some basic understanding of how long it takes to complete a task posted for completion in a crowdsourcing platform. We present an analysis of the completion time of tasks posted on Amazon Mechanical Turk, based on a dataset containing 165,368 HIT groups, with a total of 6,701,406 HITs, from 9,436 requesters, posted over a period of 15 months. We model the completion time as a stochastic process and build a statistical method for predicting the expected time for task completion. We use a survival analysis model based on Cox proportional hazards regression. We present the preliminary results of our work, showing how time-independent variables of posted tasks (e.g., type of the task, price of the HIT, day posted, etc) affect completion time. We consider this a first step towards building a comprehensive optimization module that provides recommendations for pricing, posting time, in order to satisfy the constraints of the requester.


Link to publication page

Managing Crowdsourcing Workers

Workshop
  Jing Wang, Panagiotis G. Ipeirotis, and Foster Provost

 Venue: Winter Conference on Business Intelligence, 2011

Jan 2011 | Unrefereed

The emergence of online crowdsourcing services such as Amazon Mechanical Turk, presents us huge opportunities to distribute micro-tasks at an unprecedented rate and scale. Unfortunately, the high verification cost and the unstable employment relationship give rise to opportunistic behaviors of workers, which in turn exposes the requesters to quality risks. Currently, most requesters rely on redundancy to identify the correct answers. However, existing techniques cannot separate the true (unrecoverable) error rates from the (recoverable) biases that some workers exhibit, which would lead to incorrect assessment of worker quality. Furthermore, massive redundancy is expensive, increasing significantly the cost of crowdsourced solutions. In this paper, we present an algorithm that can easily separate the true error rates from the biases. Also, we describe how to seamlessly integrate the existence of “gold” data for learning the quality of workers. Next, we bring up an approach for actively testing worker quality in order to quicky identify spammers or malicious workers. Finally, we present experimental results to demonstrate the performance of our proposed algorithm.


Link to publication page

System, Method, Software Arrangement and Computer-Accessible Medium for Incorporating Qualitative and Quantitative Information into an Economic Model

Misc Presentation
 

 Venue: US Patent 7,848,979

Dec 2010

A system, method, software arrangement and computer-accessible medium can be provided for incorporating quantitative and qualitative information into an economic model is provided. An exemplary method for analyzing qualitative information associated with a characteristic of at least one entity based on associated quantitative information, includes, obtaining first information which contains at least in part a qualitative information relating to at least one of the at least one entity; determining second information associated with at least one attribute of the characteristic obtained from the first information; obtaining third information which contains at least in part quantitative information associated with at least one of at least one entity; and establishing fourth information as a function of the second information and the third information to determine which of at least one attribute affects the characteristic. For example, an observable economic variable can be characterized using numerical and qualitative information associated with one or more of the entities. The influence of the quantitative and qualitative information on the observable economic variable for a given entity relative to other entities may be determined using statistical regressions.


Link to publication page

Analyzing the Amazon Mechanical Turk Marketplace

Journal
  Panagiotis G. Ipeirotis

 Venue: ACM XRDS (Crossroads), Vol. 17, No 2, Winter 2010

Dec 2010 | Refereed

Amazon Mechanical Turk (AMT) is a popular crowdsourcing marketplace, introduced by Amazon in 2005. The marketplace is named after an 18th century “automatic” chess-playing machine, which was handily beating humans in chess games. Of course, the robot was not using any artificial intelligence algorithms back then. The secret of the Mechanical Turk machine was a human operator, hidden inside the machine, who was the real intelligence source. AMT is also a marketplace for small tasks that cannot be easily automated. For example, humans can tell if two different descriptions correspond to the same product, can easily tag an image with descriptions of its content, or can easily transcribe with high quality an audio snippet-though all those tasks are extremely difficult for computers to do. Using Mechanical Turk, computers can use a programmable API to post tasks on the marketplace, which are then fulfilled by human users. This API-based interaction gives the impression that the task can be automatically fulfilled, hence the name. In the marketplace, employers are known as requesters and they post tasks, called human intelligence tasks, or HITs. The HITs are then picked up by online users, referred to as workers, who complete them in exchange for a small payment, typically a few cents per HIT. Since the concept of crowdsourcing is relatively new, many potential participants have questions about the AMT marketplace. For example, a common set of questions that pop up in an “introduction to crowdsourcing and AMT” session are the following:
• Who are the workers that complete these tasks?
• What type of tasks can be completed in the marketplace?
• How much does it cost?
• How fast can I get results back?
• How big is the AMT marketplace?


Link to publication page

Modeling Dependency in Prediction Markets

Working paper
  Nikolay Archak and Panagiotis G. Ipeirotis

 Venue: NYU Center for Digital Economy Research Working Paper No. CEDER-10-05

Dec 2010 | Unrefereed

In the last decade, prediction markets became popular forecasting tools in areas ranging from election results to movie revenues and Oscar nominations. One of the features that make prediction markets particularly attractive for decision support applications is that they can be used to answer what-if questions and estimate probabilities of complex events. Traditional approach to answering such questions involves running a combinatorial prediction market, what is not always possible. In this paper, we present an alternative, statistical approach to pricing complex claims, which is based on analyzing co-movements of prediction market prices for basis events. Experimental evaluation of our technique on a collection of 51 InTrade contracts representing the Democratic Party Nominee winning Electoral College Votes of a particular state shows that the approach outperforms traditional forecasting methods such as price and return regressions and can be used to extract meaningful business intelligence from raw price data.


Link to publication page

Running Experiments on Amazon Mechanical Turk

Journal
  Gabriele Paolacci, Jesse Chandler, and Panagiotis G. Ipeirotis

 Venue: Judgment and Decision Making, Vol. 5, No. 5, August 2010

Aug 2010 | Refereed

Although Mechanical Turk has recently become popular among social scientists as a source of experimental data, doubts may linger about the quality of data provided by subjects recruited from online labor markets. We address these potential concerns by presenting new demographic data about the Mechanical Turk subject population, reviewing the strengths of Mechanical Turk relative to other online and offline methods of recruiting subjects, and comparing the magnitude of effects obtained using Mechanical Turk and traditional subject pools. We further discuss some additional benefits such as the possibility of longitudinal, cross cultural and prescreening designs, and offer some advice on how to best manage a common subject pool.


Link to publication page

Quality Management on Amazon Mechanical Turk

Workshop
  Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang

 Venue: Proceedings of the Second Human Computation Workshop (HCOMP), 2010

Jul 2010 | Refereed

Crowdsourcing services, such as Amazon Mechanical Turk, allow for easy distribution of small tasks to a large number of workers. Unfortunately, since manually verifying the quality of the submitted results is hard, malicious workers often take advantage of the verification difficulty and submit answers of low quality. Currently, most requesters rely on redundancy to identify the correct answers. However, redundancy is not a panacea. Massive redundancy is expensive, increasing significantly the cost of crowdsourced solutions. Therefore, we need techniques that will accurately estimate the quality of the workers, allowing for the rejection and blocking of the low-performing workers and spammers.

However, existing techniques cannot separate the true (unrecoverable) error rate from the (recoverable) biases that some workers exhibit. This lack of separation leads to incorrect assessments of a worker’s quality. We present algorithms that improve the existing state-of-the-art techniques, enabling the separation of bias and error. Our algorithm generates a scalar score representing the inherent quality of each worker. We illustrate how to incorporate cost-sensitive classification errors in the overall framework and how to seamlessly integrate unsupervised and supervised techniques for inferring the quality of the workers. We present experimental results demonstrating the performance of the proposed algorithm under a variety of settings.


Link to publication page

Demographics of Mechanical Turk

Working paper
  Panagiotis G. Ipeirotis

 Venue: NYU Center for Digital Economy Research Working Paper CeDER-10-01

Mar 2010 | Unrefereed

We present the results of a survey that collected information about the demographics of participants on Amazon Mechanical Turk, together with information about their level of activity and motivation for working on Amazon Mechanical Turk. We find that approximately 50% of the workers come from the United States and 40% come from India. Country of origin tends to change the motivating reasons for workers to participate in the marketplace. Significantly more workers from India participate on Mechanical Turk because the online marketplace is a primary source of income, while in the US most workers consider Mechanical Turk a secondary source of income. While money is a primary motivating reason for workers to participate in the marketplace, workers also cite a variety of other motivating reasons, including entertainment and education.


Link to publication page

Ranked Queries Over Sources with Boolean Query Interfaces without Ranking Support

Conference
  Vagelis Hristidis, Yuheng Hu, and Panagiotis G. Ipeirotis

 Venue: Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE), 2010

Mar 2010 | Refereed | ~20% accepted

Many online or local data sources provide powerful querying mechanisms but limited ranking capabilities. For instance, PubMed allows users to submit highly expressive Boolean keyword queries, but ranks the query results by date only.

However, a user would typically prefer a ranking by relevance, measured by an Information Retrieval (IR) ranking function. The naive approach would be to submit a disjunctive query with all query keywords, retrieve the returned documents, and then re-rank them. Unfortunately, such an operation would be very expensive due to the large number of results returned by disjunctive queries. In this paper we present algorithms that return the top results for a query, ranked according to an IR-style ranking function, while operating on top of a source with a Boolean query interface with no ranking capabilities (or a ranking capability of no interest to the end user). The algorithms generate a series of conjunctive queries that return only documents that are candidates for being highly ranked according to a relevance metric. Our approach can also be applied to other settings where the ranking is monotonic on a set of factors (query keywords in IR) and the source query interface is a Boolean expression of these factors. Our comprehensive experimental evaluation on the PubMed database and TREC dataset show that we achieve order of magnitude improvement compared to the current baseline approaches.


Link to publication page

Improving Product Search with Economic Theory

Misc Presentation
  Beibei Li, Panagiotis G. Ipeirotis, and Anindya Ghose

 Venue: Proceedings of the Workshops of 2010 IEEE 26th International Conference on Data Engineering (ICDE), 2010

Mar 2010 | Refereed

With the growing pervasiveness of the Internet, online search for commercial goods and services is constantly increasing, as more and more people search and purchase goods from the Internet. Most of the current algorithms for product search are based on adaptations of theoretical models devised for “classic” information retrieval. However, the decision mechanism that underlies the process of buying a product is different than the process of judging a document as relevant or not. So, applying theories of relevance for the task of product search may not be the best approach.

We propose a theory model for product search based on expected utility theory from economics. Specifically, we propose a ranking technique in which we rank highest the products that generate the highest consumer surplus after the purchase. In a sense, we rank highest the products that are the “best value for money” for a specific user. Our approach naturally builds on decades of research in the field of economics and presents a solid theoretical foundation in which further research can build on. We instantiate our research by building a search engine for hotels, and show how we can build algorithms that naturally take into account consumer demographics, heterogeneity of consumer preferences, and also account for the varying price of the hotel rooms. Our extensive user studies demonstrate an overwhelming preference for the rankings generated by our techniques, compared to a large number of existing strong baselines.


Link to publication page

Designing Ranking Systems for Hotels on Travel Search Engines to Enhance User Experience

Conference
  Anindya Ghose, Panagiotis G. Ipeirotis, and Beibei Li

 Venue: Proceedings of the International Conference on Information Systems (ICIS), 2010

Jan 2010 | Refereed

Information seeking in an online shopping environment is different from classical relevance-based information retrieval. In this paper, we focus on understanding how humans seek information and make economic decisions, when interacting with an array of choices in an online shopping environment. Our study is instantiated on a unique dataset of US hotel reservations from Travelocity.com. Current travel search engines display results with only rudimentary interfaces by using a single ranking criterion. This largely ignores consumers’ multi-dimensional preferences and is constrained by human cognitive limitations towards multiple and diverse online data sources. This paper proposes to improve the search interface using an inter-disciplinary approach. It combines text mining, image classification, social geo-tagging and field experiments with structural modeling techniques, and designs an interface whereby each hotel of interest is ranked according to its average consumer surplus on the travel search site. This system would display the hotels with the “best value for money” for a given consumer at the top of the computer screen during the travel search process in response to a query. The final outcome based on several user studies shows that our proposed ranking system outperforms the existing baselines. This suggests that our inter-disciplinary approach has the potential to enhance user search experience for information seeking in the context of economic decision-making.


Link to publication page

Cramer’s Rule: How Information Content Moves Markets

Workshop
  Sinan Aral, Panagiotis G. Ipeirotis, and Sean J. Taylor

 Venue: Proceedings of the 20th Workshop on Information Systems and Economics (WISE), 2009

Sep 2009 | Refereed

When Jim Cramer offers investment advice on his CNBC show Mad Money, he influences market prices (Engelberg et al., 2009). By analyzing text from transcripts of the show, we explore the relationship between what Cramer says and the magnitude and direction of his price effect. We demonstrate that Cramer’s influence is more complex than simply drawing investor attention to particular stocks and is in fact related the content of his recommendations. A cursory viewing of Mad Money reveals that Cramer generally provides no new information about stocks, but instead argues that they may be mispriced by investors with access to identical information. The puzzle of the Cramer effect is why, despite containing little new information about stock fundamentals, does Cramer’s advice influence investors to alter their valuations and thus the stock price?

An intuitive explanation is that markets are informationally incomplete, that investors are not aware of all the securities they could trade, and that when Cramer recommends a stock, he simply draws attention to it. Had investors known about the stock, they would have incorporated this knowledge into their decisions and the stock would have been priced appropriately. Merton (1987) formalized this explanation in his “investor recognition hypothesis.” In his model, stocks with low investor recognition earn higher returns to compensate holders for being imperfectly diversified. Indeed, stocks with no media coverage earn higher returns when controlling for common risk factors (Fang and Peress, 2008), and increased investor attention to a particular Cramer recommendation (as measured by Nielsen television ratings) significantly increases the market’s response to Cramer’s advice (Engelberg et al., 2009). The story behind this hypothesis is that Cramer simply draws attention to stocks which lacked investor awareness and were therefore earning higher returns.

Another potential explanation for the Cramer effect is that markets are affected by noise traders who, unlike rational investors who only consider fundamentals, irrationally act on noise coming from media coverage, pundits, and their own generally uninformed research (DeLong et al., 1990). These noise traders are swayed by media content that expresses optimistic or pessimistic sentiment about stocks without providing any new information on fundamentals. There is some empirical evidence that media content affects stock prices. For example, Tetlock (Forthcoming) conducted a simple binary text analysis of a daily Wall Street Journal column and found, consistent with the theoretical predictions of DeLong et al. (1990), that pessimistic media content induces downward pressure on stock prices and that the price impact of this pressure reverses itself over time. A similar trend is evident in the price impact of Cramer’s recommendations. When he mentions a stock on his show, it initially undergoes a significant price change which reverses over the next 30 days (Engelberg et al., 2009). As Cramer rarely discusses obscure stocks, it could be that the magnitude and direction of his influence on the market is not simply attentional, but rather related to the content of what he says-essentially, that the content of his recommendations creates changes in sentiment that move the market.

To explore the source of Cramer’s price effect and to extend work on sentiment analysis beyond simple binary characterizations of positive and negative coverage, we constructed a model of Cramer’s influence on investor sentiment based on content features derived from Mad Money transcripts. Applying recent developments in generative text analysis (Blei et al., 2003), we estimated posterior probabilities that Cramer discussed specific topics in his recommendations and assessed the relative impact of these different topics on the magnitude and direction of Cramer’s influence on stock prices. Our analysis suggests that the topics of Cramer’s discourse explain a significant amount of the variance in the abnormal returns generated the day after he recommends a stock. The results imply that Cramer is more influential when he presents specific kinds of arguments or discusses particular rationales for investments, demonstrating the influence of topical information content on individual economic decisions and aggregate market outcomes.


Link to publication page

The Economic Impact of User-Generated Content on the Internet: Combining Text Mining with Demand Estimation in the Hotel Industry

Workshop
  Anindya Ghose, Panagiotis G. Ipeirotis, and Beibei Li

 Venue: Proceedings of the 20th Workshop on Information Systems and Economics (WISE), 2009

Sep 2009 | Refereed

Increasingly, user-generated product reviews, images and tags serve as a valuable source of information for customers making product choices online. An extant stream of work has looked at the economic impact of reviews. Typically, the impact of product reviews has been incorporated by numeric variables representing the valence and volume of reviews. In this paper, we posit that the information embedded in product reviews cannot be fully captured by a single scalar value. Rather, we argue that product reviews are multifaceted and hence, the textual content of product reviews is an important determinant of consumers’ choices, over and above the valence and volume of reviews. Based on a unique dataset of hotel reservations available to us from Travelocity, we estimate demand for hotels using a two-step random coefficient based structural model. We use text mining techniques that allow us to incorporate textual information from user review in demand estimation models by inferring the sentiments embedded in them and supplement them with image classification techniques. The dataset contains complete information on transactions conducted over a 3 month period from Nov - Jan 2009 for hotels in the US. We have data on user- generated content from three sources: (i) user-generated hotel reviews from two well known travel search engines, Travelocity and Tripadvisor, (ii) tags generated by users identifying different locational attributes of hotels from Geonames.org, and (iii) user contributed opinions on the most important hotel characteristics from Amazon Mechanical Turk. Moreover, since some location-based characteristics, such as proximity to the beach, are not directly measurable based on UGC, we use image classification techniques to infer such features from the satellite images of the area. These different data sources are then merged to create one comprehensive dataset that enables us to estimate the weight that consumers place on different hotel characteristics. We then propose to design a new hotel ranking and recommendation system based on the empirical estimates of consumer surplus from hotel transactions. By improving the recommendation strategy of travel search engines, it can raise the conversion rate for a particular hotel, hence increasing the return-on-investment for travel search engines.


Link to publication page

Modeling Volatility in Prediction Markets

Conference
  Nikolay Archak and Panagiotis G. Ipeirotis

 Venue: Proceedings for the 10th ACM Conference on Electronic Commerce (EC), 2009

Jul 2009 | Refereed | 40/158 = 25% accepted

There is significant experimental evidence that prediction markets are efficient mechanisms for aggregating information and are more accurate in forecasting events than traditional forecasting methods, such as polls. Interpretation of prediction market prices as probabilities has been extensively studied in the literature. However there is little research on the volatility of prediction market prices. Given that volatility is fundamental in estimating significance of price movements, it is important to have a better understanding of the volatility of the contract prices.

This paper presents a model of a prediction market with binary payoff on a competitive event involving two parties. In our model, each party has a latent underlying “ability” process that describes its ability to win and evolves as an Ito diffusion. We show that, if the prediction market for this event is efficient and unbiased, the price of the corresponding contract also follows a di usion and its instantaneous volatility is a function of the current contract price and its time to expiration. In the experimental section, we validate our model on a set of InTrade prediction markets and show that our model is consistent with the observed volatility of contract returns. Our model also outperforms existing volatility models in predicting future contract volatility from historical price data. To demonstrate the practical value of our model, we apply it to pricing options on prediction market contracts, such as those recently introduced by InTrade. Other potential applications of this model include detection of significant market moves and improving forecast standard errors.


Link to publication page

A Quality-Aware Optimizer for Information Extraction

Journal
  Alpa Jain and Panagiotis G. Ipeirotis

 Venue: ACM Transactions on Database Systems (TODS), Vol. 34, No. 1, April 2009

Apr 2009 | Refereed

A large amount of structured information is buried in unstructured text. Information extraction systems can extract structured relations from the documents and enable sophisticated, SQL-like queries over unstructured text. Information extraction systems are not perfect and their output has imperfect precision and recall (i.e., contains spurious tuples and misses good tuples). Typically, an extraction system has a set of parameters that can be used as “knobs” to tune the system to be either precision- or recall-oriented. Furthermore, the choice of documents processed by the extraction system also affects the quality of the extracted relation. So far, estimating the output quality of an information extraction task has been an ad hoc procedure, based mainly on heuristics. In this article, we show how to use Receiver Operating Characteristic (ROC) curves to estimate the extraction quality in a statistically robust way and show how to use ROC analysis to select the extraction parameters in a principled manner. Furthermore, we present analytic models that reveal how different document retrieval strategies affect the quality of the extracted relation. Finally, we present our maximum likelihood approach for estimating, on the fly, the parameters required by our analytic models to predict the runtime and the output quality of each execution plan. Our experimental evaluation demonstrates that our optimization approach predicts accurately the output quality and selects the fastest execution plan that satisfies the output quality restrictions.


Link to publication page

Join Optimization of Information Extraction Output: Quality Matters!

Conference
  Alpa Jain, Panagiotis G. Ipeirotis, An-Hai Doan, and Luis Gravano

 Venue: Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE), 2009

Mar 2009 | Refereed | 93/554 = 16.7% accepted

Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world applications often require that the output of multiple IE systems be joined to produce the data of interest. To optimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time. In fact, the quality of the join output is of critical importance: unlike in the relational world, different join execution plans can produce join results of widely different quality whenever IE systems are involved. In this paper, we develop a principled approach to understand, estimate, and incorporate output quality into the join optimization process over extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systems used to process documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actual join algorithm used. Our analysis considers several alternatives for these factors, and predicts the output quality-and, of course, the execution time-of the alternate execution plans. We establish the accuracy of our analytical models, as well as study the effectiveness of a quality aware join optimizer, with a large-scale experimental evaluation over real-world text collections and state-of-the-art IE systems.


Link to publication page

The Dimensions of Reputation in Electronic Markets

Working paper
  Anindya Ghose, Panagiotis G. Ipeirotis, and Arun Sundararajan

 Venue: NYU Center for Digital Economy Research Working Paper No. CeDER-06-02

Mar 2009 | Unrefereed

In this paper, we analyze how different dimensions of a seller’s reputation affect pricing power in electronic markets. Given the interplay between buyers’ trust and sellers’ pricing power, we use text mining techniques to identify and structure dimensions of importance from feedback posted on reputation systems. By aggregating and scoring these dimensions based on the sentiment they contain, we use them to estimate a series of econometric models associating reputation with price premiums. We find that different dimensions do indeed affect pricing power differentially, and that a negative reputation hurts more than a positive one helps on some dimensions but not on others. We provide evidence that sellers of identical products in electronic markets differentiate themselves based on a distinguishing dimension of strength, and that buyers vary in the relative importance they place on different fulfillment characteristics. We highlight the importance of textual reputation feedback further by demonstrating that it substantially improves the performance of a classifier we have trained to predict future sales. Our results also suggest that online sellers distinguish themselves on specific and varying fulfillment characteristics that resemble the unique selling points highlighted by successful brands. We conclude by providing explicit examples of IT artifacts (buyer and seller tools) that use our interdisciplinary approach to enhance buyer trust and seller efficiency in online environments. This paper is the first study that integrates econometric, text mining and predictive modeling techniques toward a more complete analysis of the information captured by reputation systems, and it presents new evidence of the importance of their effective and judicious design in online markets.


Link to publication page

The EconoMining Project at NYU: Studying the Economic Value of User-Generated Content on the Internet

Journal
  Anindya Ghose and Panagiotis G. Ipeirotis

 Venue: Journal of Revenue and Pricing Management, Vol. 8, No 2-3, March 2009

Mar 2009 | Refereed

An important use of the internet today is in providing a platform for consumers to disseminate information about products and services they buy, and share experiences about the merchants with whom they transact. Increasingly, online markets develop into social shopping channels, and facilitate the creation of online communities and social networks. Till date, businesses, government organisations and customers have not fully incorporated such information in their decision making and policy formulation processes, either because the potential value of the intellectual capital or appropriate techniques for measuring that value have not been identified. Increasingly, although, this publicly available digital content has concrete economic value that is often hidden beneath the surface. For example, online product reviews affect the buying behaviour of customers, as well as the volume of sales, positively or negatively. Similarly, user feedback on sites such as eBay and Amazon affect the reputation of online merchants and, in turn, their ability to sell their products and services. Our research on the EconoMining project studies the economic value of user-generated content in such online settings. In our research program, we combine established techniques from economics and marketing with text mining algorithms from computer science to measure the economic value of each text snippet. In this paper, we describe the foundational blocks of our techniques, and demonstrate the value of user-generated content in a variety of areas, including reputation systems and online product reviews, and we present examples of how such areas are of immediate importance to the travel industry.


Link to publication page

Searching Digital Libraries

Misc Presentation
  Panagiotis G. Ipeirotis

 Venue: Encyclopedia of Database Systems, Part 19, Springer, 2009

Jan 2009 | Invited

Searching digital libraries refers to searching and retrieving information from remote databases of digitized or digital objects. These databases may hold either the metadata for an object of interest (e.g., author and title), or a complete object such as a book or a video.


Link to publication page

Query by Document

Conference
  Yin Yang, Nilesh Bansal, Wisam Dakka, Panagiotis G. Ipeirotis, Nick Koudas, and Dimitris Papadias

 Venue: Proceedings of the Second ACM International Conference on Web Search and Data Mining (WSDM), 2009

Feb 2009 | Refereed | 29/170 = 17% accepted

We are experiencing an unprecedented increase of content contributed by users in forums such as blogs, social networking sites and microblogging services. Such abundance of content complements content on web sites and traditional media forums such as news papers, news and financial streams, and so on. Given such plethora of information there is a pressing need to cross reference information across textual services. For example, commonly we read a news item and we wonder if there are any blogs reporting related content or vice versa.

In this paper, we present techniques to automate the process of cross referencing online information content. We introduce methodologies to extract phrases from a given “query document” to be used as queries to search interfaces with the goal to retrieve content related to the query document. In particular, we consider two techniques to extract and score key phrases. We also consider techniques to complement extracted phrases with information present in external sources such as Wikipedia and introduce an algorithm called RelevanceRank for this purpose.

We discuss both these techniques in detail and provide an experimental study utilizing a large number of human judges from Amazons’s Mechanical Turk service. Detailed experiments demonstrate the effectiveness and efficiency of the proposed techniques for the task of automating retrieval of documents related to a query document.


Link to publication page

Taxonomy Design for Faceted Search

Misc Presentation
  Wisam Dakka , Panagiotis G. Ipeirotis, and Giovanni Maria Sacco

 Venue: The Information Retrieval Series, Vol. 25, 175-213, Springer, 2009

Jan 2009 | Invited

This chapter discusses the design of taxonomies to be used in dynamic taxonomy systems. Although the only actual requirement of dynamic taxonomies is a multidimensional classification, an organization by facets is normally used. The first section provides guidelines for the design of DT taxonomies, which include the automatic construction from structured data, and the retrofitting of traditional monodimensional taxonomies. The second section shows how a faceted taxonomy can be automatically extracted from the infobase itself when objects are textual or are described by textual captions or tags.


Link to publication page

Building Query Optimizers for Information Extraction: The SQoUT Project

Journal
  Alpa Jain, Panagiotis G. Ipeirotis, and Luis Gravano

 Venue: SIGMOD Record, Special Issue on "Managing Information Extraction," Vol. 37, No. 4, December 2008

Dec 2008 | Invited

Text documents often embed data that is structured in nature. This structured data is increasingly exposed using information extraction systems, which generate structured relations from documents, introducing an opportunity to process expressive, structured queries over text databases. This paper discusses our SQoUT1 project, which focuses on processing structured queries over relations extracted from text databases. We show how, in our extraction-based scenario, query processing can be decomposed into a sequence of basic steps: retrieving relevant text documents, extracting relations from the documents, and joining extracted relations for queries involving multiple relations. Each of these steps presents different alternatives and together they form a rich space of possible query execution strategies. We identify execution efficiency and output quality as the two critical properties of a query execution, and argue that an optimization approach needs to consider both properties. To this end, we take into account the user specified requirements for execution efficiency and output quality, and choose an execution strategy for each query based on a principled, cost-based comparison of the alternative execution strategies.


Link to publication page

Answering General Time Sensitive Queries

Workshop
  Wisam Dakka, Luis Gravano, and Panagiotis G. Ipeirotis

 Venue: Proceedings of the 2008 ACM CIKM International Conference on Information and Knowledge Management (CIKM), 2008

Oct 2008 | Refereed

Time is an important dimension of relevance for a large number of searches, such as over blogs and news archives. So far, research on searching over such collections has largely focused on locating topically similar documents for a query. Unfortunately, topic similarity alone is not always sufficient for document ranking. In this paper, we observe that, for an important class of queries that we call time-sensitive queries, the publication time of the documents in a news archive is important and should be considered in conjunction with the topic similarity to derive the final document ranking. Earlier work has focused on improving retrieval for “recency” queries that target recent documents. We propose a more general framework for handling time-sensitive queries and we automatically identify the important time intervals that are likely to be of interest for a query. Then, we build scoring techniques that seamlessly integrate the temporal aspect into the overall ranking mechanism. We extensively evaluated our techniques using a variety of news article data sets, including TREC data as well as real web data analyzed using the Amazon Mechanical Turk. We examined several alternatives for detecting the important time intervals for a query over a news archive and for incorporating this information in the retrieval process. Our techniques are robust and significantly improve result quality for time-sensitive queries compared to state-of-the-art retrieval techniques.


Link to publication page

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers

Conference
  Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis

 Venue: Proceedings of the Fourteenth ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2008

Aug 2008 | Refereed | 50/~500 < 10% accepted; Best Paper Award Runner Up

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon’s Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.


Link to publication page

Stay Elsewhere? Improving Local Search for Hotels Using Econometric Modeling and Image Classification

Workshop
  Beibei Li, Anindya Ghose, and Panagiotis G. Ipeirotis

 Venue: Proceedings of the Sixth International Workshop on the Web and Databases (WebDB), 2008

Jun 2008 | Refereed | 14/30 = 46% accepted

One of the common Web searches that have a strong local component is the search for hotel accommodation. Customers try to identify hotels that satisfy particular criteria, such as service, food quality, and so on. Unfortunately, today, the travel search engines provide only rudimentary ranking facilities, typically using a single ranking criterion such as distance from city center, number of stars, price per night, or, more recently, customer reviews. This approach has obvious shortcomings. First, it ignores the multidimensional preferences of the consumer and, second, it largely ignores characteristics related to the location of the hotel, for instance, proximity to the beach or proximity to a downtown shopping area. These location-based features represent important characteristics that influence the desirability of a particular hotel. However, currently there are no established metrics that can isolate the importance of the location characteristics of hotels. In our work, we use the fact that the overall desirability of the hotel is reflected in the price of the rooms; therefore, using hedonic regressions, an established technique from econometrics, we estimate the weight that consumers place on different hotel characteristics. Furthermore, since some location-based characteristics, such as proximity to the beach, are not directly measurable, we use image classification techniques to infer such features from the satellite images of the area. Our technique is validated on a unique panel dataset consisting of 9463 different hotels located in the United States, observed over a period of 5 months. The final outcome of our analysis allows us to compute the “residual value” of a hotel, which roughly corresponds to the “value for the money” of a particular hotel. By ranking the hotels as using our “value for the money” approach we generate rankings that are significantly superior to existing techniques. Our preliminary user studies show that users overwhelmingly favor the hotel rankings generated by our system.


Link to publication page

Automatic Extraction of Useful Facet Hierarchies from Text Databases

Conference
  Wisam Dakka and Panagiotis G. Ipeirotis

 Venue: Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE), 2008

Apr 2008 | Refereed | 75/617 = 12.1% accepted

Databases of text and text-annotated data constitute a significant fraction of the information available in electronic form. Searching and browsing are the typical ways that users locate items of interest in such databases. Faceted interfaces represent a new powerful paradigm that proved to be a successful complement to keyword searching. Thus far, the identification of the facets was either a manual procedure, or relied on apriori knowledge of the facets that can potentially appear in the underlying collection. In this paper, we present an unsupervised technique for automatic extraction of facets useful for browsing text databases. In particular, we observe, through a pilot study, that facet terms rarely appear in text documents, showing that we need external resources to identify useful facet terms. For this, we first identify important phrases in each document. Then, we expand each phrase with “context” phrases using external resources, such as WordNet and Wikipedia, causing facet terms to appear in the expanded database. Finally, we compare the term distributions in the original database and the expanded database to identify the terms that can be used to construct browsing facets. Our extensive user studies, using the Amazon Mechanical Turk service, show that our techniques produce facets with high precision and recall that are superior to existing approaches and help users locate interesting items faster.


Link to publication page

Classification-Aware Hidden-Web Text Database Selection

Journal
  Panagiotis G. Ipeirotis and Luis Gravano

 Venue: ACM Transactions on Information Systems (TOIS), Vol. 26, No. 2, March 2008

Mar 2008 | Invited

Many valuable text databases on the web have noncrawlable contents that are “hidden” behind search interfaces. Metasearchers are helpful tools for searching over multiple such “hidden-web” text databases at once through a unified query interface. An important step in the metasearching process is database selection, or determining which databases are the most relevant for a given user query. The state-of-the-art database selection techniques rely on statistical summaries of the database contents, generally including the database vocabulary and associated word frequencies. Unfortunately, hidden-web text databases typically do not export such summaries, so previous research has developed algorithms for constructing approximate content summaries from document samples extracted from the databases via querying. We present a novel “focused-probing” sampling algorithm that detects the topics covered in a database and adaptively extracts documents that are representative of the topic coverage of the database. Our algorithm is the first to construct content summaries that include the frequencies of the words in the database. Unfortunately, Zipf’s law practically guarantees that for any relatively large database, content summaries built from moderately sized document samples will fail to cover many low-frequency words; in turn, incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To enhance the sparse document samples and improve the database selection decisions, we exploit the fact that topically similar databases tend to have similar vocabularies, so samples extracted from databases with a similar topical focus can complement each other. We have developed two database selection algorithms that exploit this observation. The first algorithm proceeds hierarchically and selects the best categories for a query, and then sends the query to the appropriate databases in the chosen categories. The second algorithm uses “shrinkage,” a statistical technique for improving parameter estimation in the face of sparse data, to enhance the database content summaries with category-specific words. We describe how to modify existing database selection algorithms to adaptively decide (at runtime) whether shrinkage is beneficial for a query. A thorough evaluation over a variety of databases, including 315 real web databases as well as TREC data, suggests that the proposed sampling methods generate high-quality content summaries and that the database selection algorithms produce significantly more relevant database selection decisions and overall search results than existing algorithms.


Link to publication page

Detecting Important Events Using Prediction Markets, Text Mining, and Volatility Modeling

Workshop
  George Tziralis and Panagiotis G. Ipeirotis

 Venue: Proceedings of the Third Workshop on Prediction Markets, 2008

Jan 2008 | Refereed

The Impact of Information Disclosure on Stock Market Returns: The Sarbanes-Oxley Act and the Role of Media as an Information Intermediary

Workshop
  Karthik Balakrishnan, Anindya Ghose, and Panagiotis G. Ipeirotis

 Venue: Proceedings of the Seventh Workshop on the Economics of Information Security (WEIS), 2008

Jan 2008 | Refereed

The Sarbanes-Oxley (SOX) Act of 2002 is one of the, if not the, most important pieces of legislation affecting corporations traded on the U.S. stock exchanges. While SOX does not explicitly address the issue of information security, the definition of internal control provided by the SEC, combined with the fact that the reporting systems in all firms required to comply with SOX are based on systems that promote information security and integrity does imply that more focus on information security is a necessary compliance requirement. Using a dataset on stock market abnormal returns that runs from the period 2000-2006 and consists of 300 firms, we aim to examine how the stock market reaction varies for 8-K filings and news media releases, and how this reaction has changed since the passage of the SOX Act. We hypothesize that the greater timeliness of the 8-K filings induced by SOX increases and accelerates the quality of their information disclosure and dissemination in the market. Further, we classify news articles into press- and firm-initiated articles and hypothesize that the press-initiated coverage of material events has increased in the post-SOX period. We find that the effect of firm-initiated media coverage had significant negative impact relative to press-initiated coverage on the measures of informativeness suggesting that media played a significant role during the scandal-ridden periods when the firms had poor information environment between 2002 and 2004. We also find that the timeliness of release of media articles determines the level of informativeness, suggesting that media is an information intermediary and its role acts as a substitute to the firm’s existing information disclosure environment.


Link to publication page

Towards a Query Optimizer for Text-Centric Tasks

Journal
  Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, and Luis Gravano

 Venue: ACM Transactions on Database Systems (TODS), Vol. 32, No. 4, November 2007

Nov 2007 | Refereed

Text is ubiquitous and, not surprisingly, many applications rely on textual data for a variety of tasks. For example, information extraction applications retrieve documents and extract structured relations from the unstructured text in the documents. Reputation management systems downloadWeb pages to track the “buzz” around companies and products. Comparative shopping agents locate e-commerce Web sites and add the products offered in the pages to their own index.

To process a text-centric task over a text database (or the Web), we can retrieve the relevant database documents in different ways. One approach is to scan or crawl the database to retrieve its documents and process them as required by the task. While such an approach guarantees that we cover all documents that are potentially relevant for the task, this method might be unnecessarily expensive in terms of execution time. For example, consider the task of extracting information on disease outbreaks (e.g., the name of the disease, the location and date of the outbreak, and the number of affected people) as reported in news articles. This task does not require that we scan and process, say, the articles about sports in a newspaper archive. In fact, only a small fraction of the archive is of relevance to the task. For tasks such as this one, a natural alternative to crawling is to exploit a search engine index on the database to retrieve-via careful querying-the useful documents. In our example, we can use keywords that are strongly associated with disease outbreaks (e.g., World Health Organization, case fatality rate) and turn these keywords into queries to find news articles that are appropriate for the task.

The choice between a crawl- and a query-based execution strategy for a textcentric task is analogous to the choice between a scan- and an index-based execution plan for a selection query over a relation. Just as in the relational model, the choice of execution strategy can substantially affect the execution time of the task. In contrast to the relational world, however, this choice might also affect the quality of the output that is produced: while a crawl-based execution of a text-centric task guarantees that all documents are processed, a query-based execution might miss some relevant documents, hence producing potentially incomplete output, with less-than-perfect recall. The choice between crawl- and query-based execution plans can then have a substantial impact on both execution time and output recall. Nevertheless, this important choice is typically left to simplistic heuristics or plain intuition.

In this article,we introduce fundamental building blocks for the optimization of text-centric tasks. Towards this goal, we show how to rigorously analyze query- and crawl-based plans for a task in terms of both execution time and output recall. To analyze crawl-based plans, we apply techniques from statistics to model crawling as a document sampling process; to analyze query-based plans, we first abstract the querying process as a random walk on a querying graph, and then apply results from the theory of random graphs to discover relevant properties of the querying process. Our cost model reflects the fact that the performance of the execution plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model.


Link to publication page

Designing Novel Review Ranking Systems: Predicting Usefulness and Impact of Reviews

Conference
  Anindya Ghose and Panagiotis G. Ipeirotis

 Venue: Proceedings of the Ninth International Conference on Electronic Commerce (ICEC), 2007

Aug 2007 | Refereed

With the rapid growth of the Internet, users’ ability to publish content has created active electronic communities that provide a wealth of product information. Consumers naturally gravitate to reading reviews in order to decide whether to buy a product. However, the high volume of reviews that are typically published for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the reviews. Similarly, the manufacturer of a product needs to identify the reviews that in°uence the customer base, and examine the content of these reviews. In this paper, we propose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufacturer-oriented ranking mechanism ranks the reviews according to their expected effect on sales. Our ranking mechanism combines econometric analysis with text mining techniques in general, and with subjectivity analysis in particular. We show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. Our results can have several implications for the market design of online opinion forums.


Link to publication page

Show me the Money! Deriving the Pricing Power of Product Features by Mining Consumer Reviews

Conference
  Nikolay Archak, Anindya Ghose, and Panagiotis G. Ipeirotis

 Venue: Proceedings of the Thirteenth ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2007

Aug 2007 | Refereed | ~100/513 < 20% accepted

The increasing pervasiveness of the Internet has dramatically changed the way that consumers shop for goods. Consumer-generated product reviews have become a valuable source of information for customers, who read the reviews and decide whether to buy the product based on the information provided. In this paper, we use techniques that decompose the reviews into segments that evaluate the individual characteristics of a product (e.g., image quality and battery life for a digital camera). Then, as a major contribution of this paper, we adapt methods from the econometrics literature, specifically the hedonic regression concept, to estimate: (a) the weight that customers place on each individual product feature, (b) the implicit evaluation score that customers assign to each feature, and (c) how these evaluations affect the revenue for a given product. Towards this goal, we develop a novel hybrid technique combining text mining and econometrics that models consumer product reviews as elements in a tensor product of feature and evaluation spaces. We then impute the quantitative impact of consumer reviews on product demand as a linear functional from this tensor product space. We demonstrate how to use a low-dimension approximation of this functional to significantly reduce the number of model parameters, while still providing good experimental results. We evaluate our technique using a data set from Amazon.com consisting of sales data and the related consumer reviews posted over a 15-month period for 242 products. Our experimental evaluation shows that we can extract actionable business intelligence from the data and better understand the customer preferences and actions. We also show that the textual portion of the reviews can improve product sales prediction compared to a baseline technique that simply relies on numeric data.


Link to publication page

Modeling and Managing Changes in Text Databases

Journal
  Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, and Luis Gravano

 Venue: ACM Transactions on Database Systems (TODS), Vol. 32, No. 3, August 2007

Aug 2007 | Refereed

Large amounts of (often valuable) information are stored in web-accessible text databases. “Metasearchers” provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases for each query. So far, database selection research has largely assumed that databases are static, so the associated statistical summaries do not evolve over time. However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes. In this article, we first report the results of a study showing how the content summaries of 152 real web databases evolved over a period of 52 weeks. Then, we show how to use “survival analysis” techniques in general, and Cox’s proportional hazards regression in particular, to model database changes over time and predict when we should update each content summary. Finally, we exploit our change model to devise update schedules that keep the summaries up to date by contacting databases only when needed, and then we evaluate the quality of our schedules experimentally over real web databases.


Link to publication page

Opinion Mining Using Econometrics: A Case Study on Reputation Systems

Conference
  Anindya Ghose, Panagiotis G. Ipeirotis, and Arun Sundararajan

 Venue: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL), 2007

Jun 2007 | Refereed | 132/588 = 22% accepted

Deriving the polarity and strength of opinions is an important research topic, attracting significant attention over the last few years. In this work, to measure the strength and polarity of an opinion, we consider the economic context in which the opinion is evaluated, instead of using human annotators or linguistic resources. We rely on the fact that text in on-line systems influences the behavior of humans and this effect can be observed using some easy-to-measure economic variables, such as revenues or product prices. By reversing the logic, we infer the semantic orientation and strength of an opinion by tracing the changes in the associated economic variable. In effect, we use econometrics to identify the “economic value of text” and assign a “dollar value” to each opinion phrase, measuring sentiment effectively and without the need for manual labeling. We argue that by interpreting opinions using econometrics, we have the first objective, quantifiable, and context sensitive evaluation of opinions. We make the discussion concrete by presenting results on the reputation system of Amazon.com. We show that user feedback affects the pricing power of merchants and by measuring their pricing power we can infer the polarity and strength of the underlying feedback postings.Deriving the polarity and strength of opinions is an important research topic, attracting significant attention over the last few years. In this work, to measure the strength and polarity of an opinion, we consider the economic context in which the opinion is evaluated, instead of using human annotators or linguistic resources. We rely on the fact that text in on-line systems influences the behavior of humans and this effect can be observed using some easy-to-measure economic variables, such as revenues or product prices. By reversing the logic, we infer the semantic orientation and strength of an opinion by tracing the changes in the associated economic variable. In effect, we use econometrics to identify the “economic value of text” and assign a “dollar value” to each opinion phrase, measuring sentiment effectively and without the need for manual labeling. We argue that by interpreting opinions using econometrics, we have the first objective, quantifiable, and context sensitive evaluation of opinions. We make the discussion concrete by presenting results on the reputation system of Amazon.com. We show that user feedback affects the pricing power of merchants and by measuring their pricing power we can infer the polarity and strength of the underlying feedback postings.


Link to publication page

Faceted Browsing over Large Databases of Text-Annotated Objects

Workshop
  Wisam Dakka, Panagiotis G. Ipeirotis, and Kenneth R. Wood

 Venue: Proceedings of the 23rd IEEE International Conference on Data Engineering, Demonstrations (ICDE), 2007

Apr 2007 | Refereed | 28/73 = 38% accepted

We demonstrate a fully working system for multifaceted browsing over large collections of text-annotated data, such as annotated images, that are stored in relational databases. Typically, such databases can be browsed across multiple facets (by topic, genre, location, and so on) and previous user studies showed that multifaceted interfaces improve substantially the ability of users to identify items of interest in the database. We demonstrate a scalable system that automatically generates multifaceted browsing hierarchies on top of a relational database that stores the underlying text-annotated objects. Our system supports a wide range of ranking alternatives for selecting and displaying the best facets and the best portions of the generated hierarchies, to facilitate browsing. We combine our ranking schemes with Rapid Serial Visual Presentation (RSVP), an advanced visualization technique, which further enhances the browsing experience and demonstrate how to use prefetching techniques to overcome the latency issues that are inherent when browsing the contents of a relational database using multifaceted interfaces.


Link to publication page

Duplicate Record Detection: A Survey

Journal
  Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios

 Venue: IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 19, No. 1, January 2007

Jan 2007 | Refereed

Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area.


Link to publication page

Automatic Discovery of Useful Facet Terms

Workshop
  Wisam Dakka, Rishabh Dayal, and Panagiotis G. Ipeirotis

 Venue: Proceedings of the ACM SIGIR 2006 Workshop on Faceted Search, 2006

Aug 2006 | Refereed

Databases of text and text-annotated data constitute a significant fraction of the information available in electronic form. Searching and browsing are the typical ways that users locate items of interest in such databases. Faceted interfaces represent a new powerful paradigm which has been proven to be a successful complement to keyword searching. Thus far, the generation of faceted interfaces relied either on manual identification of the facets, or on apriori knowledge of the facets that can potentially appear in the underlying database. In this paper, we present our ongoing research towards automatic identification of facets that can be used to browse a collection of free-text documents. We present some preliminary results on building facets on top of a news archive. The results are promising and suggest directions for future research.


Link to publication page

To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks

Conference
  Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, and Luis Gravano

 Venue: Proceedings of the 2006 ACM International Conference on Management of Data (SIGMOD), 2006

Jun 2006 | Refereed | 58/446=13% accepted; Best Paper Award

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or ‘crawl,” the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output “completeness” (e.g., in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this paper, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated cost-model parameters. Overall, our approach helps predict the most appropriate execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.


Link to publication page

Designing Ranking Systems for Consumer Reviews: The Impact of Review Subjectivity on Product Sales and Review Quality

Workshop
  Anindya Ghose and Panagiotis G. Ipeirotis

 Venue: Proceedings of the 2006 Workshop on Information Technology and Systems (WITS), 2006

Jan 2006 | Refereed | 36/105 = 35% accepted

With the rapid growth of the Internet, users’ ability to publish content has created active electronic communities that provide a wealth of product information. Consumers naturally gravitate to reading reviews in order to decide whether to buy a product. However, the high volume of reviews that are typically published for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the reviews. Similarly, the manufacturer of a product wants to identify the reviews that influence the customer base, and examine the content of these reviews. In this paper we propose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufacturer-oriented ranking mechanism ranks the reviews according to their expected effect on sales. Our ranking mechanism combines econometric analysis with text mining techniques in general, with subjectivity analysis in particular. We show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. Our results can have several implications for the market design of online opinion forums.


Link to publication page

Automatic Construction of Multifaceted Browsing Interfaces

Conference
  Wisam Dakka, Panagiotis G. Ipeirotis, and Kenneth R. Wood

 Venue: Proceedings of the 2005 ACM International Conference on Information and Knowledge Management (CIKM), 2005

Oct 2005 | Refereed | 76/425 = 18% accepted

Databases of text and text-annotated data constitute a significant fraction of the information available in electronic form.

Searching and browsing are the typical ways that users locate items of interest in such databases. Interfaces that use multifaceted hierarchies represent a new powerful browsing paradigm which has been proven to be a successful complement to keyword searching. Thus far, multifaceted hierarchies have been created manually or semi-automatically, making it difficult to deploy multifaceted interfaces over a large number of databases. We present automatic and scalable methods for creation of multifaceted interfaces. Our methods are integrated with traditional relational databases and can scale well for large databases. Furthermore, we present methods for selecting the best portions of the generated hierarchies when the screen space is not sufficient for displaying all the hierarchy at once. We apply our technique to a range of large data sets, including annotated images, television programming schedules, and web pages. The results are promising and suggest directions for future research.


Link to publication page

Reputation Premiums in Electronic Peer-to-Peer Markets: Analyzing Textual Feedback and Network Structure

Workshop
  Anindya Ghose, Panagiotis G. Ipeirotis, and Arun Sundararajan

 Venue: Proceedings of the ACM SIGCOMM 2005 Third Workshop on Economics of Peer-to-Peer Systems, (P2PEcon), 2005

Aug 2005 | Refereed

Web-based systems that establish reputation are central to the viability of many electronic markets. We present theory that identifies the different dimensions of online reputation and characterizes their influence on the pricing power of sellers. We provide evidence that existing, numeric reputation scores conceal important seller-specific dimensions of reputation and we validate our theory further by proposing a new text mining technique that identifies and quantitatively evaluates further dimensions of importance in reputation profiles. We also suggest that the buyer-seller network contains
critical reputation information that we can further exploit to improve the design of a reputation mechanism. Our experimental evaluation validates the predictions of our model using a new data set containing over 12,000 transactions for consumer software on Amazon.com’s online secondary marketplace. This paper is the first attempt to integrate econometric methods and text and link mining techniques towards a more complete analysis of the information captured by reputation systems, and it presents new evidence of the importance of their effective and judicious design.


Link to publication page

Modeling and Managing Content Changes in Text Databases

Conference
  Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, and Luis Gravano

 Venue: Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE), 2005

Apr 2005 | Refereed | 67/521 = 13% accepted; Best Paper Award

Large amounts of (often valuable) information are stored in web-accessible text databases. “Metasearchers” provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases for each query. So far, database selection research has largely assumed that databases are static, so the associated statistical summaries do not need to change over time. However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes. In this paper, we first report the results of a study showing how the content summaries of 152 real web databases evolved over a period of 52 weeks. Then, we show how to use “survival analysis” techniques in general, and Coxýs proportional hazards regression in particular, to model database changes over time and predict when we should update each content summary. Finally, we exploit our change model to devise update schedules that keep the summaries up to date by contacting databases only when needed, and then we evaluate the quality of our schedules experimentally over real web databases.


Link to publication page

When one Sample is not Enough: Improving Text Database Selection Using Shrinkage

Conference
  Panagiotis G. Ipeirotis and Luis Gravano

 Venue: Proceedings of the 2004 ACM International Conference on Management of Data (SIGMOD), 2004

Jun 2004 | Refereed | 69/431 = 16% accepted

Database selection is an important step when searching over large numbers of distributed text databases. The database selection task relies on statistical summaries of the database contents, which are not typically exported by databases. Previous research has developed algorithms for constructing an approximate content summary of a text database from a small document sample extracted via querying. Unfortunately, Zipf’s law practically guarantees that content summaries built this way for any relatively large database will fail to cover many low-frequency words. Incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To improve the coverage of approximate content summaries, we build on the observation that topically similar databases tend to have related vocabularies. Therefore, the approximate content summaries of topically related databases can complement each other and increase their coverage. Specifically, we exploit a (given or derived) hierarchical categorization of the databases and adapt the notion of “shrinkage” -a form of smoothing that has been used successfully for document classification- to the content summary construction task. A thorough evaluation over 315 real web databases as well as over TREC data suggests that the shrinkage-based content summaries are substantially more complete than their “unshrunk” counterparts. We also describe how to modify existing database selection algorithms to adaptively decide -at run-time- whether to apply shrinkage for a query. Our experiments, which rely on TREC data sets, queries, and the associated “relevance judgments,” show that our shrinkage-based approach significantly improves state-of-the-art database selection algorithms, and also outperforms a recently proposed hierarchical strategy that exploits database classification as well.


Link to publication page

Classifying and Searching Hidden-Web Text Databases

Misc Presentation
  Panagiotis G. Ipeirotis

 Venue: Ph.D. Dissertation, Columbia University (advisor: L. Gravano), September 2004

Jan 2004 | Unrefereed

The World-Wide Web continues to grow rapidly, which makes exploiting all available information a challenge. Search engines such as Google index an unprecedented amount of information, but still do not provide access to valuable content in text databases “hidden” behind search interfaces. For example, current search engines largely ignore the contents of the Library of Congress, the US Patent and Trademark database, newspaper archives, and many other valuable sources of information because their contents are not “crawlable.” However, users should be able to find the information that they need with as little effort as possible, regardless of whether this information is crawlable or not. As a significant step towards this goal, we have designed algorithms that support browsing and searching -the two dominant ways of finding information on the web- over “hidden-web” text databases.

To support browsing, we have developed QProber, a system that automatically categorizes hidden-web text databases in a classification scheme, according to their topical focus. QProber categorizes databases without retrieving any document. Instead, QProber uses just the number of matches generated from a small number of topically focused query probes. The query probes are automatically generated using state-of-the-art supervised machine learning techniques and are typically short. QProber’s classification approach is sometimes orders of magnitude faster than approaches that require document retrieval.

To support searching, we have developed crucial building blocks for constructing sophisticated metasearchers, which search over many text databases at once through a unified query interface. For scalability and effectiveness, it is crucial for a metasearcher to have a good database selection component and send queries only to databases with relevant content. Usually, database selection algorithms rely on statistics that characterize the contents of each database.

Unfortunately, many hidden-web text databases are completely autonomous and do not report any summaries of their contents. To build content summaries for such databases, we extract a small, topically focused document sample from each database during categorization and use it to build the respective content summaries. A potential problem with content summaries derived from document samples is that any reasonably small sample will suffer from data sparseness and will not contain many words that appear in the database. To enhance the sparse samples and improve the database selection decisions, we exploit the fact that topically similar databases tend to have similar vocabularies, so samples extracted from databases with similar topical focus can complement each other. We have developed two database selection algorithms that exploit this observation. The first algorithm proceeds hierarchically and selects first the best category for a query and then sends the query to the appropriate databases in the chosen category. The second database selection algorithm uses “shrinkage,” a statistical technique for improving parameter estimation in the face of sparse data, to enhance the database content summaries with category-specific words. The shrinkage-enhanced summaries characterize the database contents better than their “unshrunk” counterparts do, and in turn help produce significantly more relevant database selection decisions and overall search results.

Content summaries of static databases do not need to change over time. However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes. To understand how real-world databases change over time and how these changes propagate to the database content summaries, we studied how the content summaries of 152 real web databases changed every week, for a period of 52 weeks. Then, we used “survival analysis” techniques to examine which parameters can help predict when the content summaries need to be updated. Based on the results of this study, we designed algorithms that analyze various characteristics of the databases and their update history to predict when the content summaries need to be modified, thus avoiding overloading the databases unnecessarily.

In summary, this thesis presents building blocks that are critical to enable access to the often valuable contents of hidden-web text databases, hopefully approaching the goal of making access to these databases as easy and efficient as over regular web pages.


Link to publication page

Modeling Query-Based Access to Text Databases

Workshop
  Eugene Agichtein, Panagiotis G. Ipeirotis, and Luis Gravano

 Venue: Proceedings of the Sixth ACM SIGMOD International Workshop on the Web and Databases (WebDB), 2003

Jun 2003 | Refereed | 25% accepted

Searchable text databases abound on the web. Applications that require access to such databases often resort to querying to extract relevant documents because of two main reasons. First, some text databases on the web are not “crawlable,” and hence the only way to retrieve their documents is via querying. Second, applications often require only a small fraction of a database’s contents, so retrieving relevant documents via querying is an attractive choice from an efficiency viewpoint, even for crawlable databases. Often an application’s query-based strategy starts with a small number of user-provided queries. Then, new queries are extracted-in an application-dependent way- from the documents in the initial query results, and the process iterates. The success of this common type of strategy relies on retrieved documents “contributing” new queries. If new documents fail to produce new queries, then the process might stall before all relevant documents are retrieved. In this paper, we develop a graph-based “reachability” metric that allows to characterize when an application’s query-based strategy will successfully “reach” all documents that the application needs. We complement our metric with an efficient sampling-based technique that accurately estimates the reachability associated with a text database and an application’s query-based strategy. We report preliminary experiments backing the usefulness of our metric and the accuracy of the associated estimation technique over real text databases and for two applications.


Link to publication page

Text Joins in an RDBMS for Web Data Integration

Conference
  Luis Gravano, Panagiotis G. Ipeirotis, Nick Koudas, and Divesh Srivastava

 Venue: Proceedings of the 12th International World-Wide Web Conference (WWW), 2003

May 2003 | Refereed | 13% accepted, more than 600 submissions

The integration of data produced and collected across autonomous, heterogeneous web services is an increasingly important and challenging problem. Due to the lack of global identifiers, the same entity (e.g., a product) might have different textual representations across databases. Textual data is also often noisy because of transcription errors, incomplete information, and lack of standard formats. A fundamental task during data integration is matching of strings that refer to the same entity.

In this paper, we adopt the widely used and established cosine similarity metric from the information retrieval field in order to identify potential string matches across web sources. We then use this similarity metric to characterize this key aspect of data integration as a join between relations on textual attributes, where the similarity of matches exceeds a specified threshold. Computing an exact answer to the text join can be expensive. For query processing efficiency, we propose a sampling-based join approximation strategy for execution in a standard, unmodified relational database management system (RDBMS), since more and more web sites are powered by RDBMSs with a web-based front end. We implement the join inside an RDBMS, using SQL queries, for scalability and robustness reasons.

Finally, we present a detailed performance evaluation of an implementation of our algorithm within a commercial RDBMS, using real-life data sets. Our experimental results demonstrate the efficiency and accuracy of our techniques.


Link to publication page

Text Joins for Data Cleansing and Integration in an RDBMS

Conference
  Luis Gravano, Panagiotis G. Ipeirotis, Nick Koudas, and Divesh Srivastava

 Venue: Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE), 2003

Mar 2003 | Refereed

An organization’s data records are often noisy because of transcription errors, incomplete information, lack of standard formats for textual data or combinations thereof. A fundamental task in a data cleaning system is matching textual attributes that refer to the same entity (e.g., organization name or address). This matching can be effectively performed via the cosine similarity metric from the information retrieval field. For robustness and scalability, these “text joins” are best done inside an RDBMS, which is where the data is likely to reside. Unfortunately, computing an exact answer to a text join can be expensive. In this paper, we propose an approximate, sampling based text join execution strategy that can be robustly executed in a standard, unmodified RDBMS.


Link to publication page

QProber: A System for Automatic Classification of Hidden-Web Databases

Journal
  Luis Gravano, Panagiotis G. Ipeirotis, and Mehran Sahami

 Venue: ACM Transactions on Information Systems (TOIS), Vol. 21, No. 1, January 2003

Jan 2003 | Refereed

The contents of many valuable Web-accessible databases are only available through search interfaces and are hence invisible to traditional Web “crawlers.” Recently, commercial Web sites have started to manually organize Web-accessible databases into Yahoo!-like hierarchical classification schemes. Here we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred Web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.


Link to publication page

Distributed Search over the Hidden-Web: Hierarchical Database Sampling and Selection

Conference
  Panagiotis G. Ipeirotis and Luis Gravano

 Venue: Proceedings of the 28th International Conference on Very Large Databases (VLDB), 2002

Aug 2002 | Refereed | 69/432 = 16% accepted

Many valuable text databases on the web have non-crawlable contents that are “hidden” behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents. Unfortunately, web accessible text databases do not generally export content summaries. In this paper, we present an algorithm to derive content summaries from “uncooperative” databases by using “focused query probes”, which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. Our content summaries are the first to include absolute document frequency estimates for the database words. We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to compensate for potentially incomplete content summaries.

Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases. Our experiments indicate that our new content-summary construction technique is efficient and produces more accurate summaries than those from previously proposed strategies. Also, our hierarchical database selection algorithm exhibits significantly higher precision than its flat counterparts.


Link to publication page

Extending SDARTS: Extracting Metadata from Web Databases and Interfacing with the Open Archives Initiative

Conference
  Panagiotis G. Ipeirotis, Tom Barry, and Luis Gravano

 Venue: Proceedings of the Second ACM+IEEE Joint Conference on Digital Libraries (JCDL), 2002

Jul 2002 | Refereed | 33% accepted

SDARTS is a protocol and toolkit designed to facilitate metasearching. SDARTS combines two complementary existing protocols, SDLIP and STARTS, to define a uniform interface that collections should support for searching and exporting metasearch-related metadata. SDARTS also includes a toolkit with wrappers that are easily customized to make both local and remote document collections SDARTS compliant.

This paper describes two significant ways in which we have extended the SDARTS toolkit. First, we have added a tool that automatically builds rich content summaries for remote web collections by probing the collections with appropriate queries. These content summaries can then be used by a metasearcher to select over which collections to evaluate a given query. Second, we have enhanced the SDARTS toolkit so that all SDARTS-compliant collections export their metadata under the emerging Open Archives Initiative (OAI) protocol. Conversely, the SDARTS toolkit now also allows all OAI-compliant collections to be made SDARTS-compliant with minimal effort. As a result, we implemented a bridge between SDARTS and OAI, which will facilitate easy interoperability among a potentially large number of collections. The SDARTS toolkit, with all related documentation and source code, is publicly available at http://sdarts.cs.columbia.edu.


Link to publication page

Query- vs. Crawling-Based Classification of Searchable Web Databases

Journal
  Luis Gravano, Panagiotis G. Ipeirotis, and Mehran Sahami

 Venue: IEEE Data Engineering Bulletin, Vol. 25, No. 1, March 2002

Mar 2002 | Invited

The contents of many valuable Web-accessible databases are only available through search interfaces and are hence invisible to traditional Web “crawlers.” Recently, commercial Web sites have started to manually organize Web-accessible databases into Yahoo!-like hierarchical classification schemes. Here we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred Web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.


Link to publication page

Using q-grams in a DBMS for Approximate String Processing

Journal
  Luis Gravano, Panagiotis G. Ipeirotis, Hosagrahar Visvesvaraya Jagadish, Nick Koudas, Shanmugauelayut Muthukrishnan, Lauri Pietarinen, and Divesh Srivastava

 Venue: IEEE Data Engineering Bulletin, Vol. 24, No. 4, December 2001

Dec 2001 | Invited

String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string queries directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string processing capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on generating short substrings of length q, called q-grams, and processing them using standard methods available in the DBMS. The proposed technique enables various approximate string processing methods in a DBMS, for example approximate (sub)string selections and joins, and can even be used with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers.


Link to publication page

Approximate String Joins in a Database (Almost) for Free

Conference
  Luis Gravano, Panagiotis G. Ipeirotis, Hosagrahar Visvesvaraya Jagadish, Nick Koudas, Shanmugauelayut Muthukrishnan, and Divesh Srivastava

 Venue: Proceedings of the 27th International Conference on Very Large Databases (VLDB), 2001

Sep 2001 | Refereed | 59/339 = 17% accepted

String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string joins directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs).

In this paper, we develop a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on matching short substrings of length q, called q-grams, and taking into account both positions of individual matches and the total number of such matches. Our approach applies to both approximate full string matching and approximate substring matching, with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers. We demonstrate experimentally the benefits of our technique over the direct use of UDFs, using commercial database systems and real data. To study the I/O and CPU behavior of approximate string join algorithms with variations in edit distance and q-gram length, we also describe detailed experiments based on a prototype implementation.


Link to publication page

SDLIP + STARTS = SDARTS. A Protocol and Toolkit for Metasearching

Conference
  Noah Green, Panagiotis G. Ipeirotis, and Luis Gravano

 Venue: Proceedings of the First ACM+IEEE Joint Conference on Digital Libraries (JCDL), 2001

Jun 2001 | Refereed

In this paper we describe how we combined SDLIP and STARTS, two complementary protocols for searching over distributed document collections. The resulting protocol, which we call SDARTS, is simple yet expressible enough to enable building sophisticated metasearch engines. SDARTS can be viewed as an instantiation of SDLIP with metasearchspecific elements from STARTS. We also report on our experience building three SDARTS-compliant wrappers: for locally available plain-text document collections, for locally available XML document collections, and for external webaccessible collections. These wrappers were developed to be easily customizable for new collections. Our work was developed as part of Columbia University’s Digital Libraries Initiative-Phase 2 (DLI2) project, which involves the departments of Computer Science, Medical Informatics, and Electrical Engineering, the Columbia University libraries, and a large number of industrial partners. The main goal of the project is to provide personalized access to a distributed patient-care digital library.


Link to publication page

PERSIVAL Demo: Categorizing Hidden-Web Resources

Workshop
  Panagiotis G. Ipeirotis, Luis Gravano, and Mehran Sahami

 Venue: Proceedings of the First ACM+IEEE Joint Conference on Digital Libraries (JCDL), 2001

Jun 2001 | Refereed

The information available in electronic form continues to grow at an exponential rate and this trend is expected to continue. Although traditional search engines like AltaVista can address common information needs, they ignore the often valuable information that is “hidden” behind search interfaces, the so-called “hidden web.”

Automating the classification of “hidden web” resources is challenging, since the contents of these collections are available only by querying, not by traditional crawling. For example, consider the PubMed medical database from the National Library of Medicine, which stores medical bibliographic information and links to full-text journals accessible through the web. This database is accessible through a query interface. A query to PubMed with keyword “cancer” returns 1,313,266 matches, which are high-quality citations to medical articles, stored locally at the PubMed site. The contents of PubMed are not “crawlable” by traditional search engines. Thus, a query on AltaVista for all the pages in the PubMed site with keyword “cancer” returns only 16,380 matches. Hence, techniques that need to have the documents available for inspection are not applicable to analyze and classify the “hidden web” resources.

The ability to access these resources and organize them for subsequent use is a central component of the Digital Libraries Initiative - Phase 2 (DLI2) project at Columbia University. The project is named PERSIVAL and its main goal is to provide personalized access to a distributed patient care digital library with all kinds of collections. The manual inspection and classification of these resources is a non-scalable solution, so we developed a novel technique to automate this task.


Link to publication page

Probe, Count, and Classify: Categorizing Hidden-Web Databases

Conference
  Panagiotis G. Ipeirotis, Luis Gravano, and Mehran Sahami

 Venue: Proceedings of the 2001 ACM International Conference on Management of Data (SIGMOD), 2001

May 2001 | Refereed | 15% accepted

The contents of many valuable web-accessible databases are only accessible through search interfaces and are hence invisible to traditional web “crawlers.” Recent studies have estimated the size of this “hidden web” to be 500 billion pages, while the size of the “crawlable” web is only an estimated two billion pages. Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. In this paper, we introduce a method for automating this classification process by using a small number of query probes. To classify a database, our algorithm does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of our technique over collections of real documents, including over one hundred web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.


Link to publication page

Automatic Classification of Text Databases through Query Probing

Workshop
  Panagiotis G. Ipeirotis, Luis Gravano, and Mehran Sahami

 Venue: Proceedings of the Third International Workshop on the Web and Databases, (WebDB), 2000

May 2000 | Refereed | 29% accepted

Many text databases on the web are “hidden” behind search interfaces, and their documents are only accessible through querying. Search engines typically ignore the contents of such search-only databases. Recently, Yahoo-like directories have started to manually organize these databases into categories that users can browse to find these valuable resources. We propose a novel strategy to automate the classification of search-only text databases. Our technique starts by training a rule-based document classifier, and then uses the classifier’s rules to generate probing queries. The queries are sent to the text databases, which are then classified based on the number of matches that they produce for each query. We report some initial exploratory experiments that show that our approach is promising to automatically characterize the contents of text databases accessible on the web.


Link to publication page