Indexing and crawling a website with Solr

This page describes how to configure the Content Indexer submodule of the Magnolia Solr module to index Magnolia workspaces and crawl a website. Solr module allows you to use Apache Solr, a standalone enterprise-grade search server with a REST-like API, for indexing and crawling Magnolia content, especially if you need to manage assets in high volumes (100,000+ DAM assets).

Configuring Solr clients

From version 5.2 the Solr module supports multiple Solr servers/cores. You can configure a client for every server/core under Configuration > /modules/solr-search-provider/config/solrClientConfigs. It's recommended to have one client named default. This default client is used when no specific client is defined for the indexer, crawler, or search result page template.

If you need to have more servers/cores, duplicate the default client and change the baseURL property to point to another server/core.

Node name	Value
 solr-search-provider
 config
 solrClientConfigs
 default
 allowCompression	false
 baseURL	http://localhost:8983/solr/magnolia
 connectionTimeout	100
 soTimeout	1,000

The value entered for the baseURL property should conform with the following syntax:

<protocol>://<domain_name>:<port>/solr/<solr_core_name>

If the Solr server is installed as described in Installing Apache Solr, then the value is http://localhost:8983/solr/magnolia. For a description of the other properties see the HttpSolrClient.Builder Javadoc and Using SolrJ - Common Configuration Options.

Indexing Magnolia workspaces

The Content Indexer module is a recursive repository indexer and an event based indexer. You can configure multiple indexers for different sites and document types. The content indexer also allows you to crawl external websites using JSoup and CSS selectors. You then define different field mappings that will be obtained for each node and indexed in the Solr index.

IndexService

Both the indexer and the crawler use the IndexService to handle the indexing of a content. A basic implementation is configured by default: info.magnolia.search.solrsearchprovider.logic.indexer.BasicSolrIndexService. You can define and configure your own IndexService for specific needs.

Implement the IndexService interface:

IndexService

public class I18nIndexerService implements info.magnolia.module.indexer.indexservices.IndexService {

   private static final Logger log = LoggerFactory.getLogger(I18nIndexerService.class);

   @Override
   public boolean index(Node node, IndexerConfig config) {
      ...

For a globally configured indexing service, register the IndexService in the configuration of the Content Indexer module. For your custom indexing service, use the indexServiceClass (see above in the properties table):

Node name	Value
modules
content-indexer
config
indexServiceClass	info.magnolia.search.solrsearchprovider.logic.indexer.BasicSolrIndexService

Indexer configuration

You can configure an indexer in Configuration > /modules/content-indexer/config/indexers. See an example configuration for indexing assets and folders in the DAM workspace or the below example configuration for indexing of content in the website workspace:

Node name	Value
modules
content-indexer
config
indexers
websiteIndexer
clients
default	default
fieldMappings
abstract	abstract
author	author
date	date
teaserAbstract	mgnlmeta_teaserAbstract
text	content
title	title
enabled	true
indexed	false
pull	false
rootNode	/
type	website
workspace	website

Properties:

`enabled`	required `true` enables the indexer configuration. `false` disables the indexer configuration.
`indexed`	required Indicates whether indexing was done. When Solr finishes indexing content-indexer will set this property to `true`. You can set it to `false` to trigger re-indexing.
`nodeType`	optional, default is `mgnl:page` JCR node type to index. For example, if you were indexing assets in the Magnolia DAM you would set this to `mgnl:asset`.
`pull`	optional, default is `false` (push) Pull URLs instead of pushing. When `true` Solr will use Tika to extract information from a document, for instance a PDF. When `false` it will push the collected information using a Solr document.
`assetProviderId`	optional , default is `jcr` If `pull` is set to true, specify an assetProviderId to obtain an asset correctly.
`rootNode`	required Node in the workspace where indexing starts. Use this property to limit indexing to a particular site branch.
`type`	required Sets the type of the indexed content such as `website` or `documents`. When you search the index you can filter results by type.
`workspace`	required Workspace to index.
`indexServiceClass`	optional (Solr module version 5.2+) Custom IndexService used by this indexer. If not defined, the global one is used.
`fieldMappings`	required Field mappings defines how fields in Magnolia content are mapped to Solr fields. Left side is Magnolia, right side is Solr.
`<Magnolia_field>`	`<Solr_field>` You can use the fields available in the schema. If a field does not exist in Solr's schema you can use a dynamic field `mgnlmeta_*` . For instance if you have information nested in a deep leaf of your page stored with property `specComponentAbstract` , you can map this field with `mgnlmeta_specComponentAbstract` . The indexer contains a recursive call which will explore the node's child leaves until it finds the property.
`clients`	optional, default is `default` (Solr module version 5.2+) Solr clients which will be used by this indexer. Allows to index content for multiple instances of Solr.
`<client-name>`	required Name of the client.

Crawling a website

The crawler mechanism uses the Scheduler to crawl a site periodically.

From version 3.0 Crawlers can be also connected with activation process by adding info.magnolia.module.indexer.crawler.commands.CrawlerIndexerActivationCommand into command chain with activation command. By default this is done for these commands:

If you are using the Publishing module:
- catalog: default, command: publish - configured under /modules/publishing-core/commands/default/publish
- catalog: default, command: unpublish - configured under /modules/publishing-core/commands/default/unpublish

If you are using the Activation module:
- catalog: default, command: activation - configured under /modules/activation/commands/default/activate/activate
- catalog: default, command: deactivate - configured under /modules/activation/commands/default/deactivate

catalog: default, command: personalizationActivation - configured under /modules/personalization-integration/commands/default/personalizationActivation

If you are using custom activation command and you wish to connect it with crawler mechanism, you can use info.magnolia.module.indexer.setup.AddCrawlerIntoCommandChainTask install/update task for it.

Example: Configuration to crawl www.bbc.co.uk

Node name	Value
bbc_co_uk
clients
default	default
sites
bbc
url	http://www.bbc.co.uk/
fieldMappings
abstract	#story_continues_1
keywords	meta[name=keywords] attr(0,content)
depth	2
enabled	false
nbrCrawlers	2
type	news