Friday 19 October 2012

Comparing Distributed Indexing: To MapReduce or Not?

Information Retrieval (IR) systems require input corpora to be indexed. The advent of terabyte-scale Web corpora has reinvigorated the need for efficient indexing. In this work, we investigate distributed indexing paradigms, in particular within the auspices of the MapReduce programming framework. In particular, we describe two indexing approaches based on the original MapReduce paper, and compare these with a standard distributed IR system, the MapReduce indexing strategy used by the Nutch IR platform, and a more advanced MapReduce indexing implementation that we propose. Experiments using the Hadoop MapReduce implementation and a large standard TREC corpus show our proposed MapReduce indexing implementation to be more efficient than those proposed in the original paper.

Richard McCreadie, Craig Macdonald, and Iadh Ounis.
Comparing Distributed Indexing: To MapReduce or Not?
In Proceedings of LSDS-IR 2009.
Boston, Massachusetts, USA, 2009.


PDF


Bibtex

0 comments:

Post a Comment

newer post older post Home