May 10, 2009

Lucene.NET – Search engine library for applications & programs

Apache Lucene is a high-performance, high-scalable, full-featured text search engine library created for JAVA. Lucene.NET is .NET-based search engine API for indexing and searching contents.

Lucene.NET brings the same indexing and searching capabilities into your applications/programs as Google do with desktop search engine. There are several websites which uses Lucent for search engine like Wikipedia, CNET, Monster.com, Mayo Clinic, FedEx. Not only you can use Lucene for web search but you can create programs which will search emails or any other contents on machine.

There are additional flavors coming with Lucene with other open source products. The NHibernate Search is an extension to NHibernate that allows you to utilize Lucene.NET, a full text search engine as your query engine, instead of putting additional load on the database itself.

LINQ to Lucene is a custom LINQ solution for the Lucene Information Retrieval System, commonly referred to as a search-engine.

Lucene search engine uses index (like database table) with fields (like db columns) that contents documents (like db rows). The query object allows to construct complex quires composing an object graph of query instances.

Search engine usually use ‘vector space model’ for querying where documents and queries are represented as vectors. The SQL query engine uses where clause like Boolean where as search engine uses vector model so that Documents with a higher similarity will appear higher in the results.

To use Lucene, first step is to create index of your target which may include files, emails or web pages. When you create indexes, the Lucene generates 3 files – CFS, deletable & segments in specified location.

Following are steps to use Lucene where step 1 & 2 used for creating indexes where as step 3 & 4 used for actual search.

  1. Create Document's by adding Field's;
  2. Create an IndexWriter and add documents to it with addDocument();
  3. Call QueryParser.parse() to build a query from a string; and
  4. Create an IndexSearcher and pass the query to its search() method.

In following code line I have specified web folder for search

System.IO.FileInfo docDir = new System.IO.FileInfo(@"C:\Ajit\HomeDir\Web");

Later you need to specify index directory as shown

Lucene.Net.Store.Directory dir = Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, true);

You can use IndexWriter object to create indexes as shown below:

IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);

While indexing (creating table) you need to specify document (row) and fields (columns) as shown below:

Document doc = new Document();
doc.Add(new Field("path", fName, Field.Store.YES, Field.Index.UN_TOKENIZED));

doc.Add(new Field("modified", DateTools.TimeToString(fName.LastWriteTime.Ticks, DateTools.Resolution.MINUTE), Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field("contents", new System.IO.StreamReader(fName, System.Text.Encoding.Default)));

Once you create index nest step is to use it in actual search.

// the file location of the index
stringindexFileLocation = INDEX_DIR.FullName;
Lucene.Net.Store.Directorydir = Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, false);
// use the index searcher that will perform the search
Lucene.Net.Search.IndexSearcher searcher = new Lucene.Net.Search.IndexSearcher(dir);
Lucene.Net.Analysis.Analyzer analyzer = new StandardAnalyzer();
Lucene.Net.QueryParsers.QueryParser q = new Lucene.Net.QueryParsers.QueryParser("contents", analyzer);
Lucene.Net.Search.Query query = q.Parse("vendor");
//execute the query
Lucene.Net.Search.Hits hits = searcher.Search(query);
//iterate over the results.
for (int i = 0; i < hits.Length(); i++) {
Document doc = hits.Doc(i);
System.String path = doc.Get("path");
if (path != null) {
System.Console.Out.WriteLine(i + ". " + path);
} else {
System.String url = doc.Get("url");
if (url != null) {
System.Console.Out.WriteLine(i + ". " + url);
System.Console.Out.WriteLine(" - " + doc.Get("title"));
}
}
}


Lucene References:

http://lucene.apache.org/java/docs/index.html

http://incubator.apache.org/lucene.net/