Search Mapped Objects in Lucene.Net

In my previous post (Lucene.Net Object Mapping) I introduced the Lucene.Net.ObjectMapping NuGet package. The post describes how the package can be used to map virtually any .Net object to a Lucene.Net Document and how to reconstruct the object from that same Document later. Now it’s time to look at the search aspect of it, so how can you search mapped objects in Lucene.Net?

You already know Searcher

The Searcher class in Lucene.Net can be used to run queries on an index and retrieve documents matching that query. The Lucene.Net.ObjectMapping library comes with additional extensions to the Searcher class which help you search for Documents. There’s a variety of different extensions, some which just return a TopDocs object with the number of results you’ve specified, and some which allow sorting, but more powerful are the ones which require you to specify a Collector to gather the results. Using a Collector makes it very easy to support paging over all the results for a specific query, and after all that’s usually what you’d do today if you want to show search results. So let’s look at an example of searching for Documents that contain mapped .Net objects using a Collector. Let’s assume we’re building a blog engine, for which we want to index the posts.

public class BlogPost
{
    public Guid Id { get; set; }
    public DateTime Created { get; set; }
    public string Title { get; set; }
    public string Body { get; set; }
    public string[] Tags { get; set; }
}

// ... as before, you'd store your BlogPost objects like this:
luceneIndexWriter.AddDocument(thePost.ToDocument());

Use a Collector for Paging

Creating an paged index of all your blog posts is very easy, really. You’ll need a Searcher, a Collector (the TopFieldCollector will do for now) and that’s about it. Let’s look at some code.

private const int PageSize = 10;

public BlogPost[] GetPostsForPage(int page)
{
    // Sanitize the 'page' before doing anything with it.
    if (page < 0)
    {
        page = 0;
    }

    int start = page * PageSize;
    int end = start + PageSize;

    using (Searcher searcher = new IndexSearcher(myIndexReader))
    {
        TopFieldCollector collector = TopFieldCollector.Create(
            // Let's sort descending by create date.
            new Sort(new SortField("Created", SortField.LONG, true)),
            end, // Need to get the hits until 'end'.
            false,
            false,
            false,
            false);

        // Let's use the object mapping extensions for Search! This will
        // filter results to only those Documents which hold a BlogPost.
        searcher.Search<BlogPost>(new MatchAllDocsQuery(), collector);

        // At this point we know how many hits there are in total. So
        // let's check that the requested page is within range.
        if (start >= collector.TotalHits)
        {
            page = (collector.TotalHits - 1) / PageSize;
            start = page.Value * PageSize;
            end = start + PageSize;
        }

        TopDocs docs = collector.TopDocs(start, PageSize);
        List<BlogPost> posts = new List<BlogPost>();

        foreach (ScoreDoc scoreDoc in docs.ScoreDocs)
        {
            Document doc = searcher.Doc(scoreDoc.Doc);

            posts.Add(doc.ToObject<BlogPost>());
        }

        return posts.ToArray();
    }
}

That’s it, no magic, no tricks. One thing you could do, instead of just returning a plain array with the results is to return an object which holds some more meta information, like for instance the number of total hits, or the actual page you’re returning results for. But the core logic remains the same. You can play around with different ways to sort the results. Keep in mind though that tokenized/analyzed fields in Lucene.Net are sorted based on the tokens, not based on the actual string value. To help address this, I’m thinking about extending the object mappers to allow to specify not only to analyze a field (because you want to search it), but also to add a non-analyzed copy of the field for sorting purposes. That way, you have the advantage of being able to search and sort on the same logical field in the end. Keep in mind though that the index will grow since the data is indexed twice: once tokenized/analyzed, once as-is.