Page History

...

If you need to have more servers/cores, duplicate the default client and change the baseURL property to point to another server/core.

Node name	Value
 solr-search-provider
 config
 solrClientConfigs
 default
 allowCompression	false
 baseURL	http://localhost:8983/solr/magnolia
 connectionTimeout	100
 soTimeout	1,000

The value entered for the baseURL property should conform with the following syntax:

...

For a globally configured indexing service, register the IndexService in the configuration of the Content Indexer module. For your custom indexing service, use the indexServiceClass (see above in the properties table):

...

multiple	false
enableHeadingAttributes	false
enableSorting	false
class	m5-configuration-tree
enableHighlighting	false

...

Mgnl f

modules

...

Mgnl f

content-indexer

...

Mgnl f

config

...

Mgnl p

indexServiceClass

...

info.magnolia.search.solrsearchprovider.logic.indexer.BasicSolrIndexService

Image Added

Indexer configuration

You can configure an indexer in Configuration > /modules/content-indexer/config/indexers. See an example configuration for indexing assets and folders in the DAM workspace or the below example configuration for indexing of content in the website workspace:

...

multiple	false
enableHeadingAttributes	false
enableSorting	false
class	m5-configuration-tree
enableHighlighting	false

...

Mgnl f

modules

...

Mgnl f

content-indexer

...

Mgnl f

config

...

Mgnl f

indexers

...

Mgnl n

websiteIndexer

...

Mgnl n

clients

...

Mgnl p

default

...

Mgnl n

fieldMappings

...

Mgnl p

abstract

...

abstract

...

Mgnl p

author

...

author

...

Mgnl p

date

...

date

...

Mgnl p

teaserAbstract

...

mgnlmeta_teaserAbstract

...

Mgnl p

text

...

content

...

Mgnl p

title

...

title

...

Mgnl p

enabled

...

true

...

Mgnl p

indexed

...

false

...

Mgnl p

pull

...

false

...

Mgnl p

rootNode

...

/

...

Mgnl p

type

...

website

...

Mgnl p

workspace

...

website

Image Added

Properties:

`enabled`	required `true` enables the indexer configuration. `false` disables the indexer configuration.
`indexed`	required Indicates whether indexing was done. When Solr finishes indexing content-indexer will set this property to `true`. You can set it to `false` to trigger re-indexing.
`nodeType`	optional, default is `mgnl:page` JCR node type to index. For example, if you were indexing assets in the Magnolia DAM you would set this to `mgnl:asset`.
`pull`	optional, default is `false` (push) Pull URLs instead of pushing. When `true` Solr will use Tika to extract information from a document, for instance a PDF. When `false` it will push the collected information using a Solr document.
`assetProviderId`	optional , default is `jcr` If `pull` is set to true, specify an assetProviderId to obtain an asset correctly.
`rootNode`	required Node in the workspace where indexing starts. Use this property to limit indexing to a particular site branch.
`type`	required Sets the type of the indexed content such as `website` or `documents`. When you search the index you can filter results by type.
`workspace`	required Workspace to index.
`indexServiceClass`	optional (Solr module version 5.2+) Custom IndexService used by this indexer. If not defined, the global one is used.
`fieldMappings`	required Field mappings defines how fields in Magnolia content are mapped to Solr fields. Left side is Magnolia, right side is Solr.
`<Magnolia_field>`	`<Solr_field>` You can use the fields available in the schema. If a field does not exist in Solr's schema you can use a dynamic field `mgnlmeta_*` . For instance if you have information nested in a deep leaf of your page stored with property `specComponentAbstract` , you can map this field with `mgnlmeta_specComponentAbstract` . The indexer contains a recursive call which will explore the node's child leaves until it finds the property.
`clients`	optional, *default is `default`* (Solr module version 5.2+) Solr clients which will be used by this indexer. Allows to index content for multiple instances of Solr.**
`<client-name>`	required Name of the client.

...

Example: Configuration to crawl crawl www.bbc.co.uk

...

multiple	false
enableHeadingAttributes	false
enableSorting	false
class	m5-configuration-tree
enableHighlighting	false

...

Mgnl n

bbc_co_uk

...

Mgnl n

clients

...

Mgnl p

default

...

Mgnl n

sites

...

Mgnl n

bbc

...

Mgnl p

url

...

http://www.bbc.co.uk/

...

Mgnl n

fieldMappings

...

Mgnl p

abstract

...

#story_continues_1

...

Mgnl p

keywords

...

meta[name=keywords] attr(0,content)

...

Mgnl p

depth

...

2

...

Mgnl p

enabled

...

false

...

Mgnl p

nbrCrawlers

...

2

...

Mgnl p

type

...

Image Added

Properties:

`enabled`	required `true` enables the crawler. `false` disables the crawler. When a crawler is enabled `info.magnolia.module.indexer.CrawlerIndexerFactory` registers a new scheduler job for the crawler automatically.
`depth`	required The max depth of a page in terms of distance in clicks from the root page. This should not be too high, ideally 2 or 3 max.
`nbrCrawlers`	required The max number of simultaneous crawler threads that crawl a site. 2 or 3 is enough.
`crawlerClass`	optional, since version 3.0, default value is info.magnolia.module.indexer.crawler.MgnlCrawler Implementation of `edu.uci.ics.crawler4j.crawler.WebCrawler` which is used by the Crawler to crawl sites.
`catalog`	optional, since version 3.0, default value is content-indexer Name of the catalog where the command resides.
`command`	optional, since version 3.0, default value is crawlerIndexer Command which is used to instantiate and trigger the Crawler.
`activationOnly`	optional, since version 3.0 If it's set to true then crawler should be triggered only during activation. No scheduler job will be registered for this crawler. The `jcrItems` property (see below) has to be configured too for this feaure to work.
`delayAfterActivation`	optional, since version 3.0, default value is 5s Defines the delay (in seconds) after which crawler should start when activation is done. Default value is 5s.
`cron`	optional, default is every hour `0 0 0/1 1/1 ? ` A CRON expression that specifies how often the site will be crawled. CronMaker is a useful tool for building expressions.
`type`	optional Sets the type of the crawled content such as `news`. When you search the index you can filter results by type.
`indexServiceClass`	optional, since version 5.2 Custom IndexService used by this crawler. If not defined, the global one is used.
`clients`	optional, since version 5.2 , default is `default` client Solr clients which will be used by this indexer. Allows index content into multiple Solr instances.
`<client-name>`	required Name of the client.
`fieldMappings`	required Field mappings defines how fields parsed from the site pages are mapped to Solr fields. Left side is Solr field, right side is the crawled site.
`<site_field>`	required You can use any CSS selector to target an element on the page. For example, `#story_continues_1` targets an element by ID. You can also use custom syntax to get content inside attributes. For example, meta keywords are extracted using `meta[name=keywords] attr(0,content)`. This will extract first value of keywords meta element. If you don't specify anything after the CSS selector then the text contained in the element is indexed. `meta[name=keywords]` would return an empty string because a meta element does contain any text, keywords are in the attributes. To get the value of a specific attribute specify `attr(<index>,<Solr_field_name>)`. If you set `index=-1` then all attributes are extracted and separated by a semicolon `;`.
`jcrItems`	optional , since version 3.0 List of JCR items. If any of this items is activated crawler will be triggered.
`<item_name>`	optional, since version 3.0 Name of the JCR item.
`workspace`	required, since version 3.0 Workspace where JCR item is stored.
`path`	required, since version 3.0 Path of the JCR item.
`siteAuthenticationConfig`	optional , since version 5.0.2 Authentication information to allow crawling password restricted area.
`username`	required, since version 5.0.2 Username which is used for login into restricted area.
`password`	required, since version 5.0.2 User's password used for login into restricted area.
`loginUrl`	required, since version 5.0.2 Url to page with login form.
`usernameField`	required, since version 5.0.2, default value is mgnlUserID Name of input field for entering the username in login form.
`passwordField`	required, since version 5.0.2, default value is mgnlUserPSWD Name of input field for entering the password in login form.
`logoutUrlIdentifier`	required, since version 5.0.2, default value is mgnlLogout String which identifies the logout Url. Crawler doesn't crawl over the urls which contains logoutUrlIdentifier to avoid logout.

...

Page tree

Versions Compared

Old Version 2

New Version Current

Key

Indexer configuration