Chapter 1. Lucene

1.1. Lucene technology

According to the home page project, "Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform".

The project is hosted by Apache and allows to make scalable architecture based on distributed indexes by providing supports for indexing and searching indexes. It provides several kinds of indexes (in-memory, file-system based, database based).

Here are the list of the Lucene related projects:

  • Lucene Java (Java implementation),

  • Nutch,

  • Lucy (C implementation),

  • Solr,

  • Lucene.Net (.Net implementation),

  • Tika,

  • Mahout.

1.2. Related tools

1.2.1. Lucene Java

Lucene Java is the implementation of the Lucene technology with the Java language. Other languages like C and .Net are still supported with respectively the Lucy and Lucene.Net projects.

Lucene Java allows to interact and search the index using the Java language.

The tool is available at the url http://lucene.apache.org/java/

1.2.2. Solrj

Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat.

The tool is available at the url http://lucene.apache.org/solr/

1.2.3. Tika

Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

The tool is available at the url http://lucene.apache.org/tika/