Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

enabled

required

true enables the crawler. false disables the crawler.

When a crawler is enabled info.magnolia.module.indexer.CrawlerIndexerFactory registers a new scheduler job for the crawler automatically. 

depth

required

The max depth of a page in terms of distance in clicks from the root page. This should not be too high, ideally 2 or 3 max.

nbrCrawlers

required

The max number of simultaneous crawler threads that crawl a site. 2 or 3 is enough.

crawlerClass

optional, since version 3.0, default value is info.magnolia.module.indexer.crawler.MgnlCrawler

Implementation of {@link edu.uci.ics.crawler4j.crawler.WebCrawler which is used by the Crawler to crawl sites.

catalog

optional, since version 3.0, default value is content-indexer

Name of the catalog where the command resides.

command

optional, since version 3.0, default value is crawlerIndexer

Command which is used to instantiate and trigger the Crawler.

activationOnly

optional, since version 3.0

If it's set to true then crawler should be triggered only during activation. No scheduler job will be registered for this crawler.

(warning) The jcrItems property (see below) has to be configured too for this feaure to work.

delayAfterActivation

optional, since version 3.0, default value is 5s

Defines the delay (in seconds) after which crawler should start when activation is done. Default value is 5s.

cron

optional, default is every hour 0 0 0/1 1/1 * ? *

A CRON expression that specifies how often the site will be crawled. CronMaker is a useful tool for building expressions.

type

optional

Sets the type of the crawled content such as news. When you search the index you can filter results by type.

sites

required

List of sites to crawl. For each crawler you can define multiple sites to crawl.

<site>

required

Name of the site.

url

required

URL of the site.

fieldMappings

required

Field mappings defines how fields parsed from the site pages are mapped to Solr fields. Left side is Solr field, right side is the crawled site.

<site_field>

required

You can use any CSS selector to target an element on the page. For example, #story_continues_1 targets an element by ID.

You can also use custom syntax to get content inside attributes. For example, meta keywords are extracted using meta[name=keywords] attr(0,content). This will extract first value of keywords meta element. If you don't specify anything after the CSS selector then the text contained in the element is indexed. meta[name=keywords] would return an empty string because a meta element does contain any text, keywords are in the attributes. To get the value of a specific attribute specify attr(<index>,<Solr_field_name>). If you set index=-1 then all attributes are extracted and separated by a semicolon ;.

jcrItems

optional, since version 3.0

List of jcr items. If any of this items is activated crawler will be triggered.

<item_name>

optional, since version 3.0

Name of the jcr item.

workspace

required, since version 3.0

Workspace where jcr item is stored.

path

required, since version 3.0

Path of the jcr item.

siteAuthenticationConfig

optional, since version 5.0.2

Authentication information to allow crawling password restricted area.

username

required, since version 5.0.2

Username which is used for login into restricted area.

password

required, since version 5.0.2

User's password used for login into restricted area.

loginUrl

required, since version 5.0.2

Url to page with login form.

usernameField

required, since version 5.0.2, default value is mgnlUserID

Name of input field for entering the username in login form.

passwordField

required, since version 5.0.2,default value is mgnlUserPSWD

Name of input field for entering the password in login form.

logoutUrlIdentifier

required, since version 5.0.2, default value is mgnlLogout

String which identifies the logout Url. Crawler doesn't crawl over the urls which contains logoutUrlIdentifier to avoid logout.

...