New API for Confluence Extractor (Lucene Search) Plugins

Confluence plugins support a way to hook into the Lucene search system. These extractor plugins allow you to add additional key/value pairs to the lucene index, so you can search for them using a key:value style search syntax.

As of Confluence 5.2, however, the old Extractor system pretty much doesn’t work. If you are wondering why your extractor module has stopped working, you’ll need to update your extractor to accommodate the new API. Perhaps the word “new” isn’t quite right for an API that’s a year old, but I just stumbled onto it last month, so that’s how I’m thinking of it.

Here’s some Atlassian doc on how the new API works.

New Methods

One thing that the above doc doesn’t really go into is what you’re supposed to accomplish in the extractText method you need to implement. In the first extractor API, there was one method (addFields), and you would do two things: update the defaultSearchableText StringBuffer that was passed into that method for you, as well as adding new Field objects containing your key/value pairs to the passed in Document.

The new Extractor2 api appears to divide those two operations into two methods: extractFields and extractText. The example in the above document shows what to do with extractFields nicely, but appears to have a relatively unilluminating extractText example. Fwiw, when I re-factored one of these extractors I simply appended the values for any key/value pairs to a new StringBuffer, and returned that. This seems to work, or at least it doesn’t hurt anything.

That One Parameter

Another thing that wasn’t explicitly noted in the doc is what that one parameter (Object searchable) being passed into the two methods might contain. My plugin only needed to worry about certain types of Pages, so I can’t provide an exhaustive list, but I can say for certain that in the context of pages, that the passed in Object is an instance of Page. So, you should be able to safely test if it is an instance of ContentEntityObject or Page or something along those lines and continue from there.

Tokenized vs Analyzed

One other thing I noted is that you’ll see in the old Documentation that there is an enum passed in when creating a Field called Field.Index.TOKENIZED. But in the new API, the example uses FieldDescriptor.Index.ANALYZED. Just in case you’re wondering, the Tokenized terminology seems to have been deprecated by Lucene, and replaced with the Analyzed terminology, so one expects that those kinds of enums should be (at least roughly) equivalent for our purposes.

theme by teslathemes