Search Index Configuration File

Jackrabbit allows you to control which properties of a node are indexed and how much they will affect the jcr:score value of that node in the result. You also have the option to configure different analyzers on a property-by-property basis. The index configuration file instructs lucene how to index the content of a workspace.

Summary

Apache Lucene is a Java-based indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities. This page is based on:

Important Changes

Other Resources

Page:

Jackrabbit Index Debugging
Page:

Jackrabbit Workspace Configuration File
Page:

Jackrabbit Repository Configuration File
Page:

Search Index Configuration File
Page:

Jackrabbit Repository Splitting

Indexing Configuration

The configuration parameter indexingConfiguration is not set by default. This means all properties of a node are indexed.

If you wish to configure the indexing behaviour you need to add a parameter to the SearchIndex element of either your repository configuration file or your workspace configuration file.

Any time you make changes to the indexing configuration do not forget to recreate the index from scratch.

See https://wiki.apache.org/jackrabbit/IndexingConfiguration

Configuration files

Indexing configuration file should be located in the package info.magnolia.jackrabbit.

To optimize the index size you can index only certain properties of a node type. Index rules are processed top down and the first matching rule gets applied and all remaining ones are ignored.

As of Jackrabbit 2.0 you can also use the match all regex for the namespace prefix part of a property name. However that's currently the only supported regular expression. Please note that you have to declare the namespace prefixes in the configuration element that you are using throughout the XML file.

With the nodeScopeIndex attribute set to false the property will not be in the full-text index. Meaning it would be available for all searches except for those using contains(...) in sql and sql2 or jcr:contains(...) for xpath.

Here we are applying an index rule against nodes of type nt:base. This also applies to nodes with a type that extends from nt:base. Since nt:base is the base node type of all primary nodes types this rule will apply everywhere.

<index-rule nodeType="nt:base">
  <property isRegexp="true" nodeScopeIndex="false">mgnl:.*</property> <!-- Exclude Magnolia metadata from the full-text index. -->
  <property isRegexp="true" nodeScopeIndex="false">jcr:.*</property> <!-- Exclude JCR metadata from the full-text index. -->
  <property isRegexp="true">.*:.*</property> <!-- Include all properties from any namespace, even the empty namespace. -->
</index-rule>

You may also add a condition to the index rule and have multiple rules with the same node type.

For example, let's say that we only want to boost page titles when the paged has been marked with a priority property. Further more let's assume we also have a requirement to provide three priority levels of low, medium, and high.

<!-- Since the default boost it 1.0 we don't need to specify it. Anything not medium or high will be considered low. -->
<index-rule nodeType="mgnl:page"
            condition="@priority = 'medium'">
  <property boost="3.0">title</property>
</index-rule>
<index-rule nodeType="mgnl:page"
            condition="@priority = 'high'">
  <property boost="5.0">title</property>
</index-rule>

Finally, add a radio button to your page dialog for controlling page priority levels.

You may also reference properties in the condition that are not on the current node and/or specify the type of a node in the condition.

It is possible to configure boost value on both nodes and/or properties that match an index rule. The default boost value is 1.0. Higher boost values (a reasonable range is 1.0 - 5.0) will yield a higher score value and appear as more relevant.

Here we are applying a boost value of 3.0 added to the title property on nodes of type mgnl:page.

<index-rule nodeType="mgnl:page">
  <property boost="3.0">title</property>
</index-rule>

Sometimes it is useful to include the contents of descendant nodes into a single node to easier search on content that is scattered across multiple nodes.

Here we create an index aggregate on mgnl:page that includes the content of mgnl:area and mgnl:component. This will make it easier to search content on a page that is located in one of its area or component subnodes.

<aggregate primaryType="mgnl:page">
  <include primaryType="mgnl:area">*</include>
  <include primaryType="mgnl:component">*</include>
</aggregate>

With this configuration part, you define how a property should be analyzed.

For example, let's say I wanted to target properties which I know store German language content with a German language analyzer.

<analyzer class="org.apache.lucene.analysis.de.GermanAnalyzer">
   <property>text_de</property>
</analyzer>

Custom configuration file

You can create a custom indexing configuration for any workspace. Once created the file can be configured at the workspace.xml file of the workspace you wish to target. Changes to this configuration require a reindexing of the workspace.

An example of this would be the website specific example shown above or the dam specific configuration here:

This shows an example of node data aggregation. Since the magnolia metadata is stored on the mgnl:asset node and the image metadata/data is stored on a mgnl:resource subnode we can aggregate this into one lucene document.

Page tree