Search Mapped Objects in Lucene.Net

In my previous post (Lucene.Net Object Mapping) I introduced the Lucene.Net.ObjectMapping NuGet package. The post describes how the package can be used to map virtually any .Net object to a Lucene.Net Document and how to reconstruct the object from that same Document later. Now it’s time to look at the search aspect of it, so how can you search mapped objects in Lucene.Net?

You already know Searcher

The Searcher class in Lucene.Net can be used to run queries on an index and retrieve documents matching that query. The Lucene.Net.ObjectMapping library comes with additional extensions to the Searcher class which help you search for Documents. There’s a variety of different extensions, some which just return a TopDocs object with the number of results you’ve specified, and some which allow sorting, but more powerful are the ones which require you to specify a Collector to gather the results. Using a Collector makes it very easy to support paging over all the results for a specific query, and after all that’s usually what you’d do today if you want to show search results. So let’s look at an example of searching for Documents that contain mapped .Net objects using a Collector. Let’s assume we’re building a blog engine, for which we want to index the posts.

public class BlogPost
{
    public Guid Id { get; set; }
    public DateTime Created { get; set; }
    public string Title { get; set; }
    public string Body { get; set; }
    public string[] Tags { get; set; }
}

// ... as before, you'd store your BlogPost objects like this:
luceneIndexWriter.AddDocument(thePost.ToDocument());

Use a Collector for Paging

Creating an paged index of all your blog posts is very easy, really. You’ll need a Searcher, a Collector (the TopFieldCollector will do for now) and that’s about it. Let’s look at some code.

private const int PageSize = 10;

public BlogPost[] GetPostsForPage(int page)
{
    // Sanitize the 'page' before doing anything with it.
    if (page < 0)
    {
        page = 0;
    }

    int start = page * PageSize;
    int end = start + PageSize;

    using (Searcher searcher = new IndexSearcher(myIndexReader))
    {
        TopFieldCollector collector = TopFieldCollector.Create(
            // Let's sort descending by create date.
            new Sort(new SortField("Created", SortField.LONG, true)),
            end, // Need to get the hits until 'end'.
            false,
            false,
            false,
            false);

        // Let's use the object mapping extensions for Search! This will
        // filter results to only those Documents which hold a BlogPost.
        searcher.Search<BlogPost>(new MatchAllDocsQuery(), collector);

        // At this point we know how many hits there are in total. So
        // let's check that the requested page is within range.
        if (start >= collector.TotalHits)
        {
            page = (collector.TotalHits - 1) / PageSize;
            start = page.Value * PageSize;
            end = start + PageSize;
        }

        TopDocs docs = collector.TopDocs(start, PageSize);
        List<BlogPost> posts = new List<BlogPost>();

        foreach (ScoreDoc scoreDoc in docs.ScoreDocs)
        {
            Document doc = searcher.Doc(scoreDoc.Doc);

            posts.Add(doc.ToObject<BlogPost>());
        }

        return posts.ToArray();
    }
}

That’s it, no magic, no tricks. One thing you could do, instead of just returning a plain array with the results is to return an object which holds some more meta information, like for instance the number of total hits, or the actual page you’re returning results for. But the core logic remains the same. You can play around with different ways to sort the results. Keep in mind though that tokenized/analyzed fields in Lucene.Net are sorted based on the tokens, not based on the actual string value. To help address this, I’m thinking about extending the object mappers to allow to specify not only to analyze a field (because you want to search it), but also to add a non-analyzed copy of the field for sorting purposes. That way, you have the advantage of being able to search and sort on the same logical field in the end. Keep in mind though that the index will grow since the data is indexed twice: once tokenized/analyzed, once as-is.

Lucene.Net Object Mapping

Today I finally took some time to turn a little library I’ve used for a while now into a NuGet package, called Lucene.Net.ObjectMapping. At the same time, I also uploaded the code to GitHub. But let’s look at Lucene.Net Object Mapping in more detail.

How To Install

Since this is a NuGet package, installation is as simple as running the following command in the Package Manager Console

Install-Package Lucene.Net.ObjectMapping

Alternatively, you can just search for Lucene.Net.ObjectMapping in the package manager and you should find it.

How To Use It?

Using object mapping is as simple as calling two methods: ToDocument to convert an object into a document and ToObject to convert a Document (that was created with the ToDocument method) into the original object.

MyObject obj = ...;
Document doc = obj.ToDocument();
// Save the document to your Lucene.Net Index

// Later, load the document from the index again
Document docFromIndex = ...;
MyObject objFromDoc = docFromIndex.ToObject<MyObject>();

How does it work?

Under the covers, the library is JSON-serializing the object and stores the JSON in the actual Lucene.Net document. In addition, it stores some metadata like the actual and the static types of the object you stored, as well as the timestamp (ticks) of when the document was created. The type information is used when you search for documents that were created for a specific type. The static type is the type you pass in as the type parameter to ToDocument, whereas the actual type is the actual (dynamic) type of the object you’re passing in. Since all this information is stored in the document too, there are no issues re-creating objects from an class hierarchy too.
In addition to storing the object information itself, the library also indexes the individual properties of the object you’re storing, including nested properties. By default, it uses a mapper which works as follows.

  • Public properties/fields of objects are mapped to Lucene.Net fields with the same name; e.g. a property called “Id” is mapped to a field called “Id”.
  • Properties/fields that are arrays are mapped to multiple Lucene.Net fields, all with the same name (the name of the property that holds the array).
  • Nested properties/fields, i.e. objects from properties/fields, use the name of the property as a prefix for the properties/fields of the object.

Each field is created with the following mapping of field types:

  • Boolean properties are mapped to a numeric field (Int) with a value of 1 for true and 0 for false.
  • DateTime properties are mapped to a numeric field (Long) with the value being the Ticks property of the DateTime.
  • Float properties are mapped to a numeric field (Float) with the value being the float value.
  • Double and Decimal properties are mapped to a numeric field (Double) with the value being the double value.
  • Guid properties are mapped to string fields which are NOT_ANALYZED, i.e. you can search for the GUID as is.
  • Integer (also Long, Short, and Byte as well as their unsigned/signed counterparts) properties are mapped to a numeric field (Long) with the value being the integer value.
  • Null values are not mapped at all; thus, the absence of a field implies the corresponding property is null.
  • String properties are mapped to string fields which are ANALYZED.
  • TimeSpan properties are mapped to a numeric field (Long) with the value being the Ticks property of the TimeSpan.
  • Uri properties are mapped to string fields which are ANALYZED.

Example Mapping

Let’s look at a simple example of an object and its mapping to a Lucene.Net Document. Consider the following object model.

public class MyObject
{
    public int Id { get; set; }
    public string Name { get; set; }
    public ObjectMeta Meta { get; set; }
}

public class ObjectMeta
{
    public DateTime LastModified { get; set; }
    public string ModifiedBy { get; set; }
    public string[] Modifications { get; set; }
}

// Create an instance of MyObject
MyObject obj = new MyObject()
{
    Id = 1234,
    Name = "My Lucene.Net mapped Object",
    Meta = new ObjectMeta()
    {
        LastModified = DateTime.UtcNow,
        ModifiedBy = "the dude",
        Modifications = new string[] { "changed a", "removed b", "added c" },
    },
};

Document doc = obj.ToDocument();

The mapping rules called out above will add the following fields for searching to the document. Please note that I’m not calling out the fields needed for the internal workings of the Lucene.Net.ObjectMapping library.

Field Name Type Value
Id Numeric / Long 1234
Name String / ANALYZED My Lucene.Net mapped Object
Meta.LastModified Numeric / Long < the number of ticks at the current time >
Meta.ModifiedBy String / ANALYZED the dude
Meta.Modifications String / ANALYZED changed a
Meta.Modifications String / ANALYZED removed b
Meta.Modifications String / ANALYZED added c

The mapper is by no means complete. Ideas to extend it in the future exist, including functionality to

  • specify attributes on string properties (or properties mapped to string fields) to specify how to index the string (NO vs ANALYZED vs NOT_ANALYZED vs NOT_ANALYZED_NO_NORMS vs ANALYZED_NO_NORMS).
  • specify attributes on any properties to define how to map the field, e.g. by specifying a class which can map the field

I’ll talk a little more on how to use this all when searching for documents in your Lucene.Net index. But as a sneak preview: the library also provides extension methods to the Searcher class from Lucene.Net that you can use to specify an object type to filter your documents on.

Writing to Event Log — the right way

This one’s been on my mind for a long time. I know it’s very tempting to just use System.Diagnostics.EventLog.WriteEntry to write some string to the event log. But personally I never liked the fact that you write all that static text along with the variables like actual error messages etc. Why make your life harder analyzing events later on when there’s an easy way to fix that?

Instrumentation Manifests to the Rescue!

For a while now this has actually been quite easy, using instrumentation manifests. You can read more about it here: http://msdn.microsoft.com/en-us/library/windows/desktop/dd996930(v=vs.85).aspx. These manifests allow you to define events, templates for events, messages for events, even your own event channels (so you wouldn’t need to log into that crowded “Application” channel anymore) and a lot more. But let’s look at a little example.

<?xml version="1.0" encoding="utf-8"?>
<instrumentationManifest xsi:schemaLocation="http://schemas.microsoft.com/win/2004/08/events eventman.xsd" xmlns="http://schemas.microsoft.com/win/2004/08/events" xmlns:win="http://manifests.microsoft.com/win/2004/08/windows/events" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:trace="http://schemas.microsoft.com/win/2004/08/events/trace">
    <instrumentation>
        <events>
            <provider name="MyService" guid="{DDB3FC6E-6CC4-4871-9F27-88C1B1F19BBA}" symbol="TheEventLog"
                      message="$(string.MyService.ProviderMessage)"
                      resourceFileName="MyService.Events.dll"
                      messageFileName="MyService.Events.dll"
                      parameterFileName="MyService.Events.dll">
                <events>
                    <event symbol="ServiceStarted" version="0" channel="Application"
                           value="1000" level="win:Informational"
                           message="$(string.MyService.event.1000.message)" />
                    <event symbol="ServiceStopped" version="0" channel="Application"
                           value="1001" level="win:Informational"
                           message="$(string.MyService.event.1001.message)"/>
                    <event symbol="ServiceConfigurationError" version="0" channel="Application"
                           value="1002" level="win:Error" template="ServiceException"
                           message="$(string.MyService.event.1002.message)"/>
                    <event symbol="ServiceUnhandledException" version="0" channel="Application"
                           value="1003" level="win:Error" template="ServiceException"
                           message="$(string.MyService.event.1003.message)"/>
                </events>
                <levels/>
                <channels>
                    <importChannel name="Application" chid="Application"/>
                </channels>
                <templates>
                    <template tid="ServiceException">
                        <data name="Exception" inType="win:UnicodeString" outType="xs:string"/>
                    </template>
                </templates>
            </provider>
        </events>
    </instrumentation>
    <localization>
        <resources culture="en-US">
            <stringTable>
                <string id="level.Informational" value="Information"/>
                <string id="level.Error" value="Error"/>
                <string id="channel.Application" value="Application"/>

                <string id="MyService.ProviderMessage"
                        value="My Windows Service"/>

                <string id="MyService.event.1000.message"
                        value="My Windows Service has started."/>
                <string id="MyService.event.1001.message"
                        value="My Windows Service has stopped."/>
                <string id="MyService.event.1002.message"
                        value="My Windows Service encountered a problem with its configuration. Please fix these issues and start the service again.:%n%n%1"/>
                <string id="MyService.event.1003.message"
                        value="My Windows Service encountered an unhandled exception:%n%n%1"/>
            </stringTable>
        </resources>
    </localization>
</instrumentationManifest>

Let’s start at the top. Lines 5-9 define some basic information about this instrumentation provider, like a name, a unique ID and a symbol (which will come in handy later). We can also define a friendly name for events logged this way (i.e. the event source). Let’s ignore the three xyzFileName attributes for now. On lines 11-22 we’re defining four events, some of them informational (like “the service started” or “the service stopped”), some are errors (e.g. configuration errors, or unhandled exceptions). If we wanted to define our own channel, we’d do so between lines 25 and 27. For now we’re just re-using (i.e. importing) the pre-defined “Application” channel.

Event Templates

Event templates are particularly handy if you want to write parameters with your events. Lines 29-31 define a template which has exactly one parameter, which happens to be a unicode string. We’ll use it to store exceptions. We can define more than one parameter and there’s a lot of types to use, but I’ll let you explore those on your own. This template, as you can see, is referred to by the two events with IDs 1002 and 1003.

Resources

The localization gods are with us to. Our event and template definitions so far were abstract, no actual UI strings were contained. We can define those per language, as you can see starting line 37. In the resources element and its sub-elements, we define the actual strings we want to show, including any parameters. Parameters are numbered (1-based) and are referred to with %1, %2, %3 and so on. As you can see on lines 51 and 53, we’re defining the strings for the two error events with one parameter each (“%1”), to contain the exception message. If you want line breaks, you’ll achieve those with “%n”.

Compile, with some Sugar added

So now we have a fancy manifest, but what can we do with it? Well, eventually we want to log events using the definitions from this manifest, so let’s get to it. The Windows SDK comes with two very handy tools, MC.exe (the message compiler) and RC.exe (the resource compiler). We’ll use the first to compile the manifest — and generate some c# code as a side effect — then use the second to compile the output of the first into a resource which can be linked into an executable. The commands are as follows.

mc.exe -css MyService.Events manifest.man -r obj\Debug
rc.exe obj\Debug\manifest.rc

MC.exe was nice enough to generate a file called manifest.cs for us. That file contains some code that you can use to log every event you defined in the manifest. This is why it was so handy to define the events (and templates): depending on how many parameters an event’s template has, the generated methods will ask you to provide just as many (typed) values for those parameters. Isn’t that great?! You’ll also find the compiled manifest.res file in obj\Debug. You can link that into its own executable (or your main assembly too, if you wanted), as follows:

csc.exe /out:MyService.Events.dll /target:library /win32res:obj\Debug\manifest.res

And you have a satellite assembly which holds the manifest you’ve built! CSC will log a warning about missing source files (because you didn’t add any .cs files to be compiled) but so far that doesn’t hurt anyone. We could probably also use link.exe but so far the C# compiler does a nice enough job.

Use that generated Code

Remember the code that was generated for us by MC.exe? Let’s go ahead and use it.

// ...
TheEventLog.EventWriteServiceStarted();
// ...
TheEventLog.EventWriteServiceConfigurationError(exception.Message); // ... or log the entire exception, including stack traces.
// ...

Wasn’t that very easy?

Install the Event Provider

There’s still something missing though: we’ll need to install our instrumentation/event provider with the system. It’s similar to creating the event source (which in fact will happen automatically when installing the manifest). This will typically happen in your application’s/service’s installer, using a command line as follows. But before that, remember the xyzFileName attributes we talked about? These need to be updated to point to the full path of the MyService.Events.dll assembly we generated. Otherwise the following command is going to fail.

wevtutil.exe im path\to\my\manifest.man

From now on, when your app or service starts and logs those events, they’ll show up in the event viewer. For the two events we defined with parameters, the values of the parameters are essentially the only thing that’s stored along with the ID of the event. Likewise, they’ll be the only thing that’s going to be exported with the event — so the files with the exported events you’re going to ask your customers to send you are going to be a lot smaller and won’t contain the static part of the events you already know anyway!

To uninstall the manifest, just run this command:

wevtutil.exe um path\to\my\manifest.man

Both commands need to run elevated (particularly important to remember when writing your installer).

Next Steps

As a next step, you’ll probably want to add the manual steps of compiling the manifest linking into the satellite assembly to the project file as automated targets. I’ll likely write another post about that in the future too.

Summary

As you can see, writing a manifest, compiling it and using the generated code to write to the event log is quite easy. So no more excuses to write each event as one big string (which is can be a lot harder to analyze when they come back from your customers because you first need to parse the strings).

Gzip Encoding an HTTP POST Request Body

I was wondering how difficult it was to Gzip-compress the body of an HTTP POST request (or any HTTP request with a body, that is), for large request bodies. While the .Net HttpClient has supported compression of response bodies for a while, it appears that to this day there is no out-of-the-box support for encoding the body of a request. Setting aside for now that the server may not natively support Gzip-compressed request bodies, let’s look at what we need to do to support this on the client side.

Enter HttpMessageHandler

The HttpMessageHandler abstract base class and its derived classes are used by the HttpClient class to asynchronously send HTTP requests and receive the response from the server. But since we don’t actually want to send the message ourselves – just massage the body and headers a little bit before sending – we’ll derive a new class GzipCompressingHandler from DelegatingHandler so we can delegate sending (and receiving) to another handler and just focus on the transformation of the content. So here’s what that looks like.

public sealed class GzipCompressingHandler : DelegatingHandler
{
    public GzipCompressingHandler(HttpMessageHandler innerHandler)
    {
        if (null == innerHandler)
        {
            throw new ArgumentNullException("innerHandler");
        }

        InnerHandler = innerHandler;
    }

    protected override Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
    {
        HttpContent content = request.Content;

        if (request.Method == HttpMethod.Post)
        {
            // Wrap the original HttpContent in our custom GzipContent class.
            // If you want to compress only certain content, make the decision here!
            request.Content = new GzipContent(request.Content);
        }

        return base.SendAsync(request, cancellationToken);
    }
}

As you can see, all we’re doing is just wrapping the original HttpContent in our GzipContent class. So let’s get right to that.

Gzip-compressed HttpContent: GzipContent

We’re almost there, all we need to do is actually compressing the content and modify the request headers to indicate the new content encoding.

internal sealed class GzipContent : HttpContent
{
    private readonly HttpContent content;

    public GzipContent(HttpContent content)
    {
        this.content = content;

        // Keep the original content's headers ...
        foreach (KeyValuePair<string, IEnumerable<string>> header in content.Headers)
        {
            Headers.TryAddWithoutValidation(header.Key, header.Value);
        }

        // ... and let the server know we've Gzip-compressed the body of this request.
        Headers.ContentEncoding.Add("gzip");
    }

    protected override async Task SerializeToStreamAsync(Stream stream, TransportContext context)
    {
        // Open a GZipStream that writes to the specified output stream.
        using (GZipStream gzip = new GZipStream(stream, CompressionMode.Compress, true))
        {
            // Copy all the input content to the GZip stream.
            await content.CopyToAsync(gzip);
        }
    }

    protected override bool TryComputeLength(out long length)
    {
        length = -1;
        return false;
    }
}

Easy, right? Of course you could add other supported compression algorithms, using more or less the same code (or even adding some abstraction for different compression algorithms), but this is basically all that’s required.

Summary

Using the HttpMessageHandler and its associated classes makes it extremely easy to apply transformations to all (or a well-defined subset) of HTTP requests you’re sending. In this case, we’re applying Gzip-compression to the bodies of all outgoing POST requests, but the logic to decide when to compress can be as customized as you want; you could even apply Gzip-compression only if the requested URI ends with “.gzip” or for certain content types.