Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
HTML Wrap
alignright
classmenu
Page properties
DownloadMultiple submodules
EditionEE Std
License
Include Page
_MLA
_MLA
IssuesMGNLEESOLR
Maven siteSolr
Latest version
Artifact resource link
groupIdinfo.magnolia.solr
artifactIdmagnolia-solr-search-provider-parent
label$version
renderTypedisplay_only
resourceTypeJAR
HTML Wrap
clearboth
width343px
alignright
classmenu

Related topics:

The Solr module (full name Magnolia Solr Search Provider Module) allows you to use Apache Solr, a standalone enterprise-grade The Solr module uses the  Apache Solr  search platform to index and crawl Magnolia content. Solr is a standalone enterprise search server with a REST-like API, for indexing and crawling Magnolia content, especially if you need to manage assets in high volumes (100,000+ DAM assets).

For a brief overview of Solr's main features see the Solr search page. For module compatibility with Apache Solr and module release notes see the Solr module release notes page.

Table of Contents

Module structure

The Magnolia Solr bundle consists of two modules:

...

module (parent) consists of five submodules. The first two – Content Indexer and Solr Search Provider – are required for correct functioning of the Solr search feature. 

artifactIDDescription

magnolia-solr-search-provider-parent

Parent reactor.

magnolia-content-indexer

Indexes Magnolia workspaces. It can also crawl a published website.

...

magnolia-solr-search-provider

Provides templates for displaying Solr search results on

...

a site and faceted search components.

magnolia-solr-workbench

Solr uses the Lucene library for full-text indexing and provides faceted search, distributed search and index replication. You can use Solr to index content in an event-based or action-based fashion. The module from version 5.0 is compatible with Solr5.3, older versions of the module are compatible with Solr4.

Table of Contents

Provides a Solr container for list, search and thumbnail views in content apps.

magnolia-solr-search-provider-bundle

A bundle containing the Content Indexer, Search Provider and Solr Workbench modules, together with third-party libraries and sample Solr configuration files managed-schema and solrconfig.xml.

magnolia-solr-uninstall

For the removal of the old, pre-2.0 version of the Content Indexer module and Solr module configuration.

Version 2.0 of the Solr module didn't support migration from the older version. The older version had to be uninstalled first.

Installing

Maven is the easiest way to install the modules. Add the following dependencies to your bundle:

...

Artifact maven dependencies snippet
groupIdinfo.magnolia.solr
artifactIdmagnolia-solr-search-provider

Artifact maven dependencies snippet
groupIdinfo.magnolia.solr
artifactIdmagnolia-solr-workbench

Include Page
_Pre-built jars are also available
_Pre-built jars are also available

...

  • Artifact resource link
    groupIdinfo.magnolia.solr
    artifactIdmagnolia-content-indexer
    label$artifactId.jar
    renderTypedownload_link
    versionSNAPSHOT
    resourceTypeJAR
  • Artifact resource link
    groupIdinfo.magnolia.solr
    artifactIdmagnolia-solr-search-provider
    label$artifactId.jar
    renderTypedownload_link
    versionSNAPSHOT
    resourceTypeJAR
  • Artifact resource link
    groupIdinfo.magnolia.solr
    artifactIdmagnolia-solr-uninstall
    label$artifactId.jar
    renderTypedownload_link
    versionSNAPSHOT
    resourceTypeJAR
  • Artifact resource link
    groupIdinfo.magnolia.solr
    artifactIdmagnolia-solr-search-provider-bundleworkbench
    label$artifactId.jar
    renderTypedownload_link
    versionSNAPSHOT
    resourceTypeZIP
    (contains content indexer, search provider and
    JAR

Solr Search Provider bundle

The Content Indexer, Search Provider and Solr Workbench submodules are available in a bundle which also contains a sample configuration set and some third-party libraries

...

Installing Apache Solr

Apache Solr is a standalone search server. You need the server in addition to the Magnolia Solr modules.

Download Apache Solr and extract the zip to your computer.

...

Installing Solr 5

Create Magnolia config set and configuring a schema and solrconfig

A schema file specifies what fields the Magnolia content can contain, how those fields are added to the index, and how they are queried. https://cwiki.apache.org/confluence/display/solr/Documents%2C+Fields%2C+and+Schema+Design

SolrRequestHandler is a Solr Plugin that defines the logic executed for any request. https://wiki.apache.org/solr/SolrRequestHandler

Create new magnolia config set by duplicating $SOLR_HOME/server/solr/configsets/data_driven_schema_configs folder and name it magnolia_data_driven_schema_configs ($SOLR_HOME/server/solr/configsets/magnolia_data_driven_schema_configs).

Download the magnolia example configuration files (based on Solr data_driven_schema_configs https://cwiki.apache.org/confluence/display/solr/Config+Sets) and overwrite the default files in newly created magnolia_data_driven_schema_configs/conf:

Starting Apache Solr and creating new core based on Magnolia config set

Go to the $SOLR_HOME/bin , start Solr server and create new core called magnolia 

Code Block
languagebash
cd $SOLR_HOME/bin
./solr start
./solr create_core -c magnolia -d magnolia_data_driven_schema_configs

This type of startup works for testing and development purposes. For production installation see Taking Solr to Production.

Installing Solr 4

Configuring a schema and solrconfig

A schema file specifies what fields the Magnolia content can contain, how those fields are added to the index, and how they are queried. An ExtractingRequestHandler extracts searchable fields from Magnolia pages.

Download the configuration files and overwrite the default files in  $SOLR_HOME/example/solr/collection1/conf/ :

Code Block
solr/
  bin/
  contrib/
  dist/
  docs/
  example/
    solr/
      collection1/
        conf/
          schema.xml
          solrconfig.xml
  licenses/

Starting Apache Solr

Go to the example directory and start Solr.

Code Block
languagebash
cd $SOLR_HOME/example
java -jar start.jar

This type of startup works for testing and development purposes. For production installation see Taking Solr to Production.

What's new in Solr Search Provider module version 5.0.2

Warning

 This version contains changes in solrconfig.xml and  managed-schema please read the notes before update to 5.0.2.

Fixed the issue of two indexers/crawlers mutually overwriting the resulting index when indexing the same content. For example when one indexer was for indexing the English translation and other one for indexing the German translation. 

Jira
serverMagnolia - Issue tracker
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId500b06a6-e204-3125-b989-2d75b973d05f
keyMGNLEESOLR-102

Problem was caused by using jcr uuid(indexers) and url(crawlers) as unique identifier for solr indexes. To fix this issue changes in solrconfig.xml and  managed-schema were required.

  • <uniqueKey> in managed-schema was changed to uuid
  • default value for unique key field was changed to uuid in info.magnolia.search.solrsearchprovider.logic.providers.FacetedSolrSearchProvider
  • solrconfig.xml now generates uuid field from combination of type and id fields. https://wiki.apache.org/solr/Deduplication method is used for generating the uuid. For more details see the change in code diff.

Update to 5.0.2

Option 1:

If you don't plan to index same content by two different indexers or crawlers then you don't need to update your solrconfig.xml and managed-schema for your solr core. Only change what you need to do is add uniqueKeyField property with value id into your solr sear result page.

Option 2:

Use new Solr module and Solr module configuration files for your solr core and for $SOLR_HOME/server/solr/configsets/magnolia_data_driven_schema_config

It's needed to recreate all Solr indexes, because of the changes in configuration files. Probably the easiest way to do it is recreate the solr core and then retrigger indexing int Magnolia.

  1. Use new solrconfig.xml and  managed-schema configuration files for $SOLR_HOME/server/solr/configsets/magnolia_data_driven_schema_config Magnolia config set. 
  2. Delete  magnolia core an create it again

    Code Block
    languagebash
    cd $SOLR_HOME/bin
    ./solr delete -c magnolia
    ./solr create_core -c magnolia -d magnolia_data_driven_schema_configs
  3. Retrigger the indexers, by changing their property indexed to false 

What's new in Solr Search Provider module version 5.0

Solr Search Provider module version 5.0 brings support to Solr 5 (officially tested with version 5.3.1).

Full changelog for version 5.0 https://jira.magnolia-cms.com/browse/MGNLEESOLR/fixforversion/18141

Regarding the changes in the module it's recommended completely recreate the Solr indexes after to upgrade to version 5.0.

API changes

org.apache.solr.client.solrj.SolrServer is deprecated and was replaced by org.apache.solr.client.solrj.SolrClient in solr-solrj 5.x library. Because of that info.magnolia.search.solrsearchprovider.MagnoliaSolrBridge#getSolrServer method was changed to info.magnolia.search.solrsearchprovider.MagnoliaSolrBridge#getSolrClient method.

What's new in Solr Search Provider module version 3.0

Solr Search Provider module version 3.0 delivers the following key fixes and enhancements:

...

Jira
serverMagnolia - Issue tracker
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId500b06a6-e204-3125-b989-2d75b973d05f
keyMGNLEESOLR-66

...

magnolia-solr-search-provider-theme module has gone 

Jira
serverMagnolia - Issue tracker
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId500b06a6-e204-3125-b989-2d75b973d05f
keyMGNLEESOLR-66

...

Jira
serverMagnolia - Issue tracker
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId500b06a6-e204-3125-b989-2d75b973d05f
keyMGNLEESOLR-64

...

Jira
serverMagnolia - Issue tracker
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId500b06a6-e204-3125-b989-2d75b973d05f
keyMGNLEESOLR-77

...

Jira
serverMagnolia - Issue tracker
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId500b06a6-e204-3125-b989-2d75b973d05f
keyMGNLEESOLR-61

...

Jira
serverMagnolia - Issue tracker
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId500b06a6-e204-3125-b989-2d75b973d05f
keyMGNLEESOLR-70

...

Jira
serverMagnolia - Issue tracker
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId500b06a6-e204-3125-b989-2d75b973d05f
keyMGNLEESOLR-72

Full changelog for version 3.0 https://jira.magnolia-cms.com/browse/MGNLEESOLR/fixforversion/17434

Regarding the changes in the module it's recommended completely recreate the Solr indexes after to upgrade to version 3.0.

Indexing Magnolia workspaces

The Content Indexer module is a recursive repository indexer and an event based indexer. You can configure multiple indexers for different sites and document types. The content indexer also allows you to crawl external websites using JSoup and CSS selectors. You then define different field mappings that will be obtained for each node and indexed in the solr index.

Indexer configuration

Configure an indexer in Configuration > /modules/content-indexer/config/indexers. Example configurations for indexing a website and DAM assets are provided. Duplicate one of the examples to index another site or workspace.

...

multiplefalse
enableHeadingAttributesfalse
enableSortingfalse
classm5-configuration-tree
enableHighlightingfalse

...

Mgnl f
modules

...

Mgnl f
content-indexer

...

Mgnl f
config

...

Mgnl f
indexers

...

Mgnl n
websiteIndexer

...

Mgnl n
fieldMappings

...

Mgnl p
abstract

...

abstract

...

Mgnl p
author

...

author

...

Mgnl p
date

...

date

...

Mgnl p
teaserAbstract

...

mgnlmeta_teaserAbstract

...

Mgnl p
text

...

content

...

Mgnl p
title

...

title

...

Mgnl p
enabled

...

true

...

Mgnl p
indexed

...

false

...

Mgnl p
pull

...

false

...

Mgnl p
rootNode

...

/

...

Mgnl p
type

...

website

...

Mgnl p
workspace

...

website

Properties:

enabled

required

true enables the indexer configuration. false disables the indexer configuration.

indexed

required

Indicates whether indexing was done. When Solr finishes indexing content-indexer will set this property to true. You can set it to false to trigger re-indexing.

nodeType

optional, default is mgnl:page

JCR node type to index. For example, if you were indexing assets in the Magnolia DAM you would set this to mgnl:asset.

pull

optional, default is false (push)

Pull URLs instead of pushing. When true Solr will use Tika to extract information from a document, for instance a PDF. When false it will push the collected information using a Solr document.

assetProviderId

optional , default is jcr

If pull is set to true, specify an assetProviderId to obtain an asset correctly.

rootNode

required

Node in the workspace where indexing starts. Use this property to limit indexing to a particular site branch.

type

required

Sets the type of the indexed content such as website or documents. When you search the index you can filter results by type.

workspace

required

Workspace to index.

fieldMappings

required

Field mappings defines how fields in Magnolia content are mapped to Solr fields. Left side is Magnolia, right side is Solr.

<Magnolia_field>

<Solr_field>

You can use the fields available in the schema. If a field does not exist in Solr's schema you can use a dynamic field mgnlmeta_*. For instance if you have information nested in a deep leaf of your page stored with property specComponentAbstract, you can map this field with mgnlmeta_specComponentAbstract. The indexer contains a recursive call which will explore the node's child leaves until it finds the property.

IndexService

The indexer uses an IndexService to handle the indexing of a node. A basic implementation is configured by default: info.magnolia.search.solrsearchprovider.logic.indexer.BasicSolrIndexService. You can define and configure your own IndexService for specific needs.

Implement the IndexService interface:

Code Block
languagejava
titleIndexService
public class I18nIndexerService implements info.magnolia.module.indexer.indexservices.IndexService {

   private static final Logger log = LoggerFactory.getLogger(I18nIndexerService.class);

   @Override
   public boolean index(Node node, IndexerConfig config) {
      ...

Register the IndexService in the Content Indexer module configuration:

...

multiplefalse
enableHeadingAttributesfalse
enableSortingfalse
classm5-configuration-tree
enableHighlightingfalse

...

Mgnl f
modules

...

Mgnl f
content-indexer

...

Mgnl f
config

...

Mgnl n
indexService

...

Mgnl p
class

...

 info.magnolia.search.solrsearchprovider.logic.indexer.BasicSolrIndexService

Crawling a website

The crawler mechanism uses the Scheduler to crawl a site periodically.

From version 3.0 Crawlers can be also connected with activation process by adding  info.magnolia.module.indexer.crawler.commands.CrawlerIndexerActivationCommand into command chain with activation command. By default this is done for this activation/deactivation commands:

  • catalog: default, command: activation - configured under /modules/activation/commands/default/activate/activate
  • catalog: default, command: deactivate - configured under /modules/activation/commands/default/deactivate
  • catalog: default, command: personalizationActivation - configured under /modules/personalization-integration/commands/default/personalizationActivation

If you are using custom activation command and you wish to connect it with crawler mechanism, you can use info.magnolia.module.indexer.setup.AddCrawlerIntoCommandChainTask install/update task for it.

Example: Configuration to crawl bbc.com

...

multiplefalse
enableHeadingAttributesfalse
enableSortingfalse
classm5-configuration-tree
enableHighlightingfalse

...

Mgnl n
bbc_com

...

Mgnl n
sites

...

Mgnl n
bbc

...

Mgnl p
url

...

http://www.bbc.co.uk/

...

Mgnl n
fieldMappings

...

Mgnl p
abstract

...

#story_continues_1

...

Mgnl p
keywords

...

meta[name=keywords] attr(0,content)

...

Mgnl p
depth

...

2

...

Mgnl p
enabled

...

false

...

Mgnl p
nbrCrawlers

...

2

...

Mgnl p
type

...

news

Properties:

...

enabled

...

required

true enables the crawler. false disables the crawler.

When a crawler is enabled info.magnolia.module.indexer.CrawlerIndexerFactory registers a new scheduler job for the crawler automatically. 

...

depth

...

required

The max depth of a page in terms of distance in clicks from the root page. This should not be too high, ideally 2 or 3 max.

...

nbrCrawlers

...

required

The max number of simultaneous crawler threads that crawl a site. 2 or 3 is enough.

...

crawlerClass

...

optional, since version 3.0, default value is info.magnolia.module.indexer.crawler.MgnlCrawler

Implementation of {@link edu.uci.ics.crawler4j.crawler.WebCrawler which is used by the Crawler to crawl sites.

...

catalog

...

optional, since version 3.0, default value is content-indexer

Name of the catalog where the command resides.

...

command

...

optional, since version 3.0, default value is crawlerIndexer

Command which is used to instantiate and trigger the Crawler.

...

activationOnly

...

optional, since version 3.0

If it's set to true then crawler should be triggered only during activation. No scheduler job will be registered for this crawler.

(warning) The jcrItems property (see below) has to be configured too for this feaure to work.

...

delayAfterActivation

...

optional, since version 3.0, default value is 5s

Defines the delay (in seconds) after which crawler should start when activation is done. Default value is 5s.

...

cron

...

optional, default is every hour 0 0 0/1 1/1 * ? *

A CRON expression that specifies how often the site will be crawled. CronMaker is a useful tool for building expressions.

...

type

...

optional

Sets the type of the crawled content such as news. When you search the index you can filter results by type.

...

sites

...

required

List of sites to crawl. For each crawler you can define multiple sites to crawl.

...

<site>

...

required

Name of the site.

...

url

...

required

URL of the site.

...

fieldMappings

...

required

Field mappings defines how fields parsed from the site pages are mapped to Solr fields. Left side is Solr field, right side is the crawled site.

...

<site_field>

...

required

You can use any CSS selector to target an element on the page. For example, #story_continues_1 targets an element by ID.

You can also use custom syntax to get content inside attributes. For example, meta keywords are extracted using meta[name=keywords] attr(0,content). This will extract first value of keywords meta element. If you don't specify anything after the CSS selector then the text contained in the element is indexed. meta[name=keywords] would return an empty string because a meta element does contain any text, keywords are in the attributes. To get the value of a specific attribute specify attr(<index>,<Solr_field_name>). If you set index=-1 then all attributes are extracted and separated by a semicolon ;.

...

jcrItems

...

optional, since version 3.0

List of jcr items. If any of this items is activated crawler will be triggered.

...

<item_name>

...

optional, since version 3.0

Name of the jcr item.

...

workspace

...

required, since version 3.0

Workspace where jcr item is stored.

...

path

...

required, since version 3.0

Path of the jcr item.

...

siteAuthenticationConfig

...

optional, since version 5.0.2

Authentication information to allow crawling password restricted area.

...

username

...

required, since version 5.0.2

Username which is used for login into restricted area.

...

password

...

required, since version 5.0.2

User's password used for login into restricted area.

...

loginUrl

...

required, since version 5.0.2

Url to page with login form.

such as crawler4j or SolrJ:

  • Artifact resource link
    groupIdinfo.magnolia.solr
    artifactIdmagnolia-solr-search-provider-bundle
    renderTypedownload_link
    resourceTypeZIP

    Expand
    titleClick here to expand and see the content of the bundle (in v. 5.2)
    Code Block
    languagebash
    magnolia-solr-search-provider-bundle-5.2.zip
    ├── commons-math3-3.6.1.jar
    ├── crawler4j-4.1.jar
    ├── je-5.0.73.jar
    ├── LICENSE.txt
    ├── lidalia-slf4j-ext-1.0.0.jar
    ├── magnolia-content-indexer-5.2.jar
    ├── magnolia-solr-search-provider-5.2.jar
    ├── magnolia-solr-workbench-5.2.jar
    ├── noggit-0.8.jar
    ├── NOTICE.txt
    ├── README.txt
    ├── sample-solr-config-files
    │   ├── managed-schema
    │   └── solrconfig.xml
    ├── solr-solrj-7.3.0.jar
    ├── stax2-api-3.1.4.jar
    ├── woodstox-core-asl-4.4.1.jar
    └── zookeeper-3.4.11.jar
    

Configuration

For the installation information about Apache Solr server and for further configuration details see the following pages:

...

usernameField

...

required, since version 5.0.2, default value is mgnlUserID

Name of input field for entering the username in login form.

...

passwordField

...

Name of input field for entering the password in login form.

...

logoutUrlIdentifier

...

required, since version 5.0.2, default value is mgnlLogout

String which identifies the logout Url. Crawler doesn't crawl over the urls which contains logoutUrlIdentifier to avoid logout.

Providing a Solr search

The Solr Search Provider module contains templates to display search results on the site. It also provides faceted search components for refining the results further. The faceted search gets related facets from the search context. Suggestions and available fields are available in Freemarker context.

Configuring the Solr server base URL

Configure the Solr server address in Configuration > /modules/solr-search-provider/config/solrConfig@baseURLbaseURL should be http://<domain_name>:<port>/solr/<solr_core_name>. So if solr server was installed as described in installing Solr 5 then baseURL is http://localhost:8983/solr/magnolia.

See HttpSolrClient Javadoc for other properties.

...

multiplefalse
enableHeadingAttributesfalse
enableSortingfalse
classm5-configuration-tree
enableHighlightingfalse

...

Mgnl f
solr-search-provider

...

Mgnl f
config

...

Mgnl f
solrConfig

...

Mgnl p
allowCompression

...

false

...

Mgnl p
baseURL

...

http://localhost:8983/solr/magnolia

...

Mgnl p
connectionTimeout

...

100

...

Mgnl p
followRedirects

...

false

...

Mgnl p
maxConnectionsPerHost

...

100

...

Mgnl p
maxRetries

...

0

...

Mgnl p
maxTotalConnections

...

100

...

Mgnl p
soTimeout

...

1,000

Creating a search results page

Create a search results page using one of the available templates. Which template you use depends on the type of project you have and the modules that are installed.

ModuleTemplateConfiguration
mtemteSolrSearchResult/modules/solr-search-provider/templates/mteSolrSearchResult
standard-templating-kitsolrSearchResult/modules/solr-search-provider/templates/solrSearchResult

To try it in the demo travel site:

  1. Make the template available in the site definition.
  2. Create a page which uses the template.
  3. Edit the home page properties.
  4. Select your Solr results page in the Search Page field.

Search result settings

Image Removed

Url domain filtering

You can filter results by URL domain in the Filter url prefix field

Image Removed.

Field boosting for relevance

The example query title^100 abstract^0.1 will boost the rank for matches in the title field 1000 times more than equivalent matches in the abstract.

Image Removed

The query will give the following results:

Image Removed

If instead you boost the abstract over the title you would get the following results for the same search. The returned snippets are now primarily from page titles.

Image Removed

Filtering search results

Positive filtering: Return only results where the keyword conference is present.

Image Removed

Negative filtering: Don't return results where the keyword conference is present.

Image Removed

You can add more filters by separating them by spaces.

Autocomplete search bar

The autocomplete search bar provides suggestions while you type into the search field. jQuery UI Autocomplete widget and info.magnolia.search.solrsearchprovider.logic.servlets.SearchServlet are used for this functionality.

How to configure it

  1. Go http://jqueryui.com/download and download jQuery UI javascript for Autocomplete widget and required dependencies
  2. In downloaded archive find jquery-ui.js (or jquery-ui.min.js) and jquery.js and add them into Magnolia resources
  3. Add jQuery javascript libraries into to the Search result page

    <script src="path to jquery.js" type="text/javascript"></script>
    <script src="path to jquery-ui.js" type="text/javascript"></script>

  4. Add this small javascript into the Search result page

    Code Block
    languagejs
    var jq = jQuery.noConflict();
    jq(document).ready(function () {
        jq("#searchbar, #nav-search, #search").autocomplete({
            open: function () {
                jq(this).autocomplete('widget').css('z-index', 999);
            },
            source: function (request, response) {
                jq.get("${contextPath}/searchservlet/", {search: request.term.toLowerCase(), queryType: "SUGGEST", fields: "collation", fq: "*"},
                    function (data) {
                        response(data);
                    }, "json"
                );
            },
            minLength: 2
        });
    });

More information about the autocomplete feature

For more information see series of the blog posts:

(warning) Be aware that the above linked blog posts may use old versions of solr and the Magnolia solr integration module. This said, some examples described in the blog post may not work with the latest version. However, the examples still are worth to read and inspiring.

Other features

  • Pagination
  • Faceting on all fields
  • Ranged faceting
  • Similar search
  • Localized search

  • Suggestions