Magnolia 5.6 reached end of life on June 25, 2020. This branch is no longer supported, see End-of-life policy.
...
| required
When a crawler is enabled |
| required The max depth of a page in terms of distance in clicks from the root page. This should not be too high, ideally 2 or 3 max. |
| required The max number of simultaneous crawler threads that crawl a site. 2 or 3 is enough. |
| optional, since version 3.0, default value is info.magnolia.module.indexer.crawler.MgnlCrawler Implementation of {@link edu.uci.ics.crawler4j.crawler.WebCrawler which is used by the Crawler to crawl sites. |
| optional, since version 3.0, default value is content-indexer Name of the catalog where the command resides. |
| optional, since version 3.0, default value is crawlerIndexer Command which is used to instantiate and trigger the Crawler. |
| optional, since version 3.0 If it's set to true then crawler should be triggered only during activation. No scheduler job will be registered for this crawler. The |
| optional, since version 3.0, default value is 5s Defines the delay (in seconds) after which crawler should start when activation is done. Default value is 5s. |
| optional, default is every hour A CRON expression that specifies how often the site will be crawled. CronMaker is a useful tool for building expressions. |
| optional Sets the type of the crawled content such as |
| required List of sites to crawl. For each crawler you can define multiple sites to crawl. |
| required Name of the site. |
| required URL of the site. |
| required Field mappings defines how fields parsed from the site pages are mapped to Solr fields. Left side is Solr field, right side is the crawled site. |
| required You can use any CSS selector to target an element on the page. For example, You can also use custom syntax to get content inside attributes. For example, meta keywords are extracted using |
| optional, since version 3.0 List of jcr items. If any of this items is activated crawler will be triggered. |
| optional, since version 3.0 Name of the jcr item. |
| required, since version 3.0 Workspace where jcr item is stored. |
| required, since version 3.0 Path of the jcr item. |
| optional, since version 5.0.2 Authentication information to allow crawling password restricted area. |
| required, since version 5.0.2 Username which is used for login into restricted area. |
| required, since version 5.0.2 User's password used for login into restricted area. |
| required, since version 5.0.2 Url to page with login form. |
| required, since version 5.0.2, default value is mgnlUserID Name of input field for entering the username in login form. |
| required, since version 5.0.2,default value is mgnlUserPSWD Name of input field for entering the password in login form. |
| required, since version 5.0.2, default value is mgnlLogout String which identifies the logout Url. Crawler doesn't crawl over the urls which contains logoutUrlIdentifier to avoid logout. |
...