Content of this page has been updated to comply with Magnolia CMS 5.6.x and add some more details on how to configure your Jackrabbit Data Store properly which was not mentioned in its previous version. However audiences still able to retrieve previous version using Page History, choose v. 19 in the list.

Scenario

-- Updated use-case to demonstrate clustering in Magnolia CMS 5.6.x. Since Forum module was deprecated so we switch to Contacts app and its workspace for easier to follow.

We want the two public instances to share the comments contacts which are stored in the forum contacts workspace. But otherwise we want to keep the content independent.

Magnolia demo bundle already included demo Contacts module with all of its related sample content, app, and configurations.

See: 

A Note on Clustering

-- Thanks to Bradley Andersen for your provided info in this section

We can either cluster, or not cluster. Setting up clustering is harder, but, if we do not cluster, we need to deal with:

  • Synchronization
  • Transactional Activation
  • Sticky Sessions (think PUR module) More things to back up
  • Etc.

On the other hand, clustering introduces some problems:

  • If you use PostgreSQL, the journal can grow to the point it shuts down the DB server
  • It introduces a single point of failure
  • You can't do a rolling update if you only have one DB
  • Does not scale - a good rule of thumb seems to be: one DB connection per JCR workspace is open. In an OOTB configuration, there are about 30 JCR workspaces. If we're above, say, 4 publics, we actually have too many simultaneous DB connections.
  • Each cluster node needs its own (private) file system and search index.

Note that certain things should naturally be clustered (unless we want to create a service to reverse-publish from a public to the author, and then the author to the other publics):

  • User generated content such as comments written by site visitors
  • Public User Accounts
  • Forum Posts

A potential solution for all these issues is Amazon Aurora.

A potential solution to the single point of failure problem is: create a redundant, second Jackrabbit cluster to avoid single point of failure in the content store.

Before setup

Please note that customers who want to use Clustering function have to follow Jackrabbit requirements below (original link here):

Clustering in Jackrabbit works as follows: content is shared between all cluster nodes.

  • That means all Jackrabbit cluster nodes need access to the same persistent storage (persistence manager, data store, and repository file system).
  • The persistence manager must be clusterable (eg. central database that allows for concurrent access, see PersistenceManagerFAQ); any DataStore (file or DB) is clusterable by its very nature, as they store content by unique hash ids.
  • However, each cluster node needs its own (private) repository directory, including repository.xml file, workspace FileSystem and Search index.
  • Every change made by one cluster node is reported in a journal, which can be either file based or written to some database.

What shall we do

We will use MySQL database which supported concurrent access for our persistence manager. Also we will need a shared folder (either NFS or local file system) for our DataStore location. At the end, all clustered content of Contacts will be stored in MySQL and its related binary objects (contact images in this case) will be stored in this shared folder.

Sample MySQL script to create 'magnolia_cluster' database, create 'admin' user using 'admin' password on 'localhost' and grant him all permissions on created DB:

CREATE USER 'admin'@'localhost' IDENTIFIED BY 'admin';
CREATE SCHEMA `magnolia_cluster` DEFAULT CHARACTER SET utf8 ;
GRANT ALL PRIVILEGES ON magnolia_cluster.* TO 'admin'@'localhost' WITH GRANT OPTION;

We will configure a clustered repository by changing Magnolia provided WEB-INF/config/default/repositories.xml file into a clustered one and duplicate WEB-INF/config/repo-conf/jackrabbit-bundle-mysql-search.xml to WEB-INF/config/repo-conf/jackrabbit-bundle-mysql-cluster.xml for its configuration. Note that we still keep our previous one for non-cluster content. This means you will have 2 repositories working at the same time when we start our Magnolia instance.

It is possible to use H2 file system persistence storage for non-cluster repository / content while configuring MySQL database persistence storage for clustered content.


An overview of steps

  1. Configure Magnolia author and public system wide properties in WEB-INF/config/default/magnolia.properties
  2. Configure author and public Jackrabbit repositories in /WEB-INF/config/default/repositories_cluster.xml which is a duplication of Magnolia provided /WEB-INF/config/default/repositories.xml
  3. Configure your cluster details in WEB-INF/config/repo-conf/jackrabbit-bundle-mysql-cluster.xml

Magnolia properties

-- Reference here for a complete list of all configuration items Configuration management .

As we mentioned above in the prerequisite, "Each cluster node must have its own repository configuration." → So we will use this property to set its repository location:

magnolia.repositories.cluster=${magnolia.home}/repositories_cluster

Just like "magnolia.repositories.jackrabbit.config" configuration item, you are also expected to provide cluster configuration file location in 

magnolia.repositories.jackrabbit.cluster.config=WEB-INF/config/repo-conf/jackrabbit-bundle-mysql-cluster.xml

Also this property would help identifing the instance as a cluster master node. During installation and update Magnolia bootstraps content only into master nodes. This ensures that other (replica) nodes installed later don't override already bootstrapped content. default is false. Note that I'm setting it to true in our author instance for demonstrastion purpose, however you would have to consider where to put your master cluster due to your practical scenario.

magnolia.repositories.jackrabbit.cluster.master=true

repositories.xml

Note that the position where you put your Repository definition tag in 'repository.xml' fill determine the initiation order of Magnolia CMS repositories. Clustered repository is recommended to be placed after default one so that Magnolia CMS related configurations could be initiated first.


  1. add a new repository configuration in .../WEB-INF/config/default/repositories.xml

        <!-- magnolia non-default repository -->
        <Repository name="magnoliacluster" provider="info.magnolia.jackrabbit.ProviderImpl" loadOnStartup="true">
            <param name="configFile" value="${magnolia.repositories.jackrabbit.cluster.config}" />
            <param name="repositoryHome" value="${magnolia.repositories.cluster}" />
            <!-- the default node types are loaded automatically
                <param name="customNodeTypes" value="WEB-INF/config/repo-conf/nodetypes/magnolia_nodetypes.xml" />
            -->
            <param name="contextFactoryClass" value="org.apache.jackrabbit.core.jndi.provider.DummyInitialContextFactory" />
            <param name="providerURL" value="localhost" />
            <param name="bindName" value="cluster-${magnolia.webapp}" />
            <!-- since forum module has been deprecated, we switch to contacts module for demonstration. -->
            <!-- <workspace name="forum" />  -->
            <workspace name="contacts" />
        </Repository>
    
  2. add a mapping to the clustered repository for the workspace to tell the system that this workspace lives in a different repository (the clustered one)

        <RepositoryMapping>
            <Map name="website" repositoryName="magnolia" workspaceName="website" />
            ...
            <!-- since forum module has been deprecated, we switch to contacts module for demonstration. -->
            <!-- <Map name="forum" repositoryName="magnoliacluster" workspaceName="forum" /> -->
            <Map name="contacts" repositoryName="magnoliacluster" workspaceName="contacts" />
        </RepositoryMapping>
    
  3. We already set magnolia.repositories.jackrabbit.cluster.config in the magnolia.properties to WEB-INF/config/repo-conf/jackrabbit-bundle-mysql-cluster.xml however you can use whatever folder you want in file system using absolute path.

Jackrabbit configuration file

see: http://wiki.apache.org/jackrabbit/Clustering

  1. make a copy of the non-clustering configuration file (jackrabbit-bundle-mysql-cluster.xml in this case)
  2. make sure that both the instances use the same underlying database (MySQL magnolia_cluster schema in this case)
    1. Sample MySQL datasource configuration

      <DataSources>
          <DataSource name="magnolia_cluster">
            <param name="driver" value="com.mysql.jdbc.Driver" />
            <param name="url" value="jdbc:mysql://localhost:3306/magnolia_cluster" />
            <param name="user" value="admin" />
            <param name="password" value="admin" />
            <param name="databaseType" value="mysql"/>
            <param name="validationQuery" value="select 1"/>
          </DataSource>
        </DataSources>
  3. add the cluster configuration to the configuration file

      <Cluster syncDelay="2000" id="mclu1">
        <Journal class="org.apache.jackrabbit.core.journal.DatabaseJournal">
          <param name="revision" value="${rep.home}/revision"/>
          <param name="driver" value="com.mysql.jdbc.Driver"/>
          <param name="url" value="jdbc:mysql://localhost:3306/magnolia_cluster"/>
          <param name="user" value="admin"/>
          <param name="password" value="admin"/>
          <param name="databaseType" value="mysql"/>
          <param name="schemaObjectPrefix" value="JOURNAL_"/>
        </Journal>
      </Cluster>
  4. Configure DataStore using your shared folder. This section is important to share binary objects amongst your clustered instances. Note that you could able to use database datastore by configure org.apache.jackrabbit.core.data.db.DbDataStore in below section. Reference to Jackrabbit Datastore documentation for more details on limitations, garbage collection, and the way it work.

      <DataStore class="org.apache.jackrabbit.core.data.FileDataStore">
        <param name="path" value="YOUR_SHARED_CLUSTERED_LOCATION"/>
        <param name="minRecordLength" value="1024"/>
      </DataStore>

Note that your 'magnolia.repositories.cluster=${magnolia.home}/repositories_cluster' must point to different physical locations on all your author and public instances due to Jackrabbit clustering requirement that 'each cluster node needs its own (private) repository directory'. However 'YOUR_SHARED_CLUSTERED_LOCATION' in DataStore FileDataStore location must point to the same location on all your instances to share their binary data objects. Please don't confuse on this point otherwise you will get into trouble when starting the instances.

Set the cluster id

The cluster id identifies the instance and is used to write changes to the journal as well as to load changes from the journal. Make sure this is a unique value and is not shared with the other nodes in the cluster.

Cluster id can be defined either in the properties file (most convenient way) or in the persistence manager in the cluster configuration (both ways are used in the attached files):

  <Cluster id="mclu1" syncDelay="2000">
   ....
  </Cluster>

Setting the cluster id in the properties file, will save you from having two different persistence manager files with just this little change.

  1. set magnolia.clusterid property in the magnolia.properties file

Sync Delay

By default, cluster nodes read the journal and update their state every 5 seconds (5000 milliseconds). To use a different value, set the attribute syncDelay in the cluster configuration. syncDelay="2000" means states are synch every 2000 miliseconds.

Subscribers

Make sure that the content is not activated to both the clustered instances.

  • only one subscriber should have a subscription to the clustered workspace(s) in /server/activation/subscribers/xxx/subscriptions

Warning: loading of workspace configuration

Once a workspace has been created a copy of jackrabbit configuration is saved to the workspace folder (workspace.xml)

  • changing the original jackrabbit configuration file won't have any effect
  • changes have to be made in the workspace.xml

Verify your setup

Bring up your instances, note that your author is our master cluster in this case, need to be installed first.

Then open your Contacts app such as (http://localhost:8080/magnoliaAuthor/.magnolia/admincentral#app:contacts:browser;/:treeview:)

Create a testing contact and upload an image for him

Remember to save your info

Switch to another instance, also open Contacts app (such as http://localhost:8180/magnoliaPublic/.magnolia/admincentral#app:contacts:browser;/:treeview: ) and make sure that your created one was there (after synchDelay=2000 miliseconds)

Clean up your Journal

This is important to prevent your database to be overloaded or hang. Thank you Jordie Diepeveen for reminding us about this.

The journal can potentially become very large. By default, old revisions are not removed.

We recommend turning on the janitor functionality for clusters:

https://wiki.apache.org/jackrabbit/Clustering#Removing_Old_Revisions

<Cluster ....>
    <Journal ...>
        ......
        <param name="janitorEnabled" value="true" />
    </Journal>
</Cluster>

Reference to Magnolia Clustering - Cleaning the Jackrabbit journal and Apache Jackrabbit Clustering - Removing Old Revisions recommendations for more details.

Have a good day!

16 Comments

  1. With this scenario, there is a "ItemExistsException" on installation.

    • You start public1, it installs everything in its own repo and few thing in the shared forum repo.
    • You start public2, it triggers also its installation, everything is installed in its own repo, but then when the commenting  module start its installation, an ItemExistsException occurs. Commenting want to bootstrap something in the shared forum repo but already here because installed by public1.

    To proceed with the installation of public2, the solution I found is, once public 1 is up and running, I delete everything from the forum workspace and start the installation of public 2. Then the commenting module coming with public 2 will re-bootstrap the deleted items and there is no issue.

    Maybe there is a better way to handle this issue. If someone has a better idea, thanks to share !!

     

    But anyway i think it's a conceptual issue (not of commenting, but) of Clustering Magnolia. When many instances share a repo, it's important to handle carefully the concurrent write behavior.

     

     

  2. After configuring the cluster, starting the public instance after the author instance will throw some exceptions:

    ERROR org.apache.jackrabbit.core.query.lucene.SearchIndex: Unable to read revision '7'.

    It looks like this behavior is expected since the two instances are not "synchronized". Next startups should be fine.

    Also, don't forget to configure the public instance in order to NOT bootstrap the samples managed by the shared repository.

    1. Finally someone reply (tongue)

    2. Is this issue still exists in 5.5.x Core version. I was trying it and i still see the error even after i set cluster.master as true for instance 1 and cluster.master as false for instace 2 of my public facing magnolia.

    3. Nicolas Barbé and Mohan Sundararajan set 'forceConsistencyCheck' under SearchIndex to 'true' to eliminate the error in your clustered config file ('jackrabbit-bundle-mysql-cluster.xml' for instance)

      <SearchIndex class="info.magnolia.jackrabbit.lucene.SearchIndex">
        ...
        <param name="forceConsistencyCheck" value="true" />
        ...
      </SearchIndex>

      Remember to switch it off after first startup to save resources and boost startup performance.

  3. Scenario: 1 Author instance , 2 public instances

    In case of "ItemExistsException" as Samuel Schmitt's comment we can also pass through this issue by the following.

    Step 1: Start author instance (example: trainingTemplatingAuthor) after starting is completed we will open FORUM app and delete  "pagecomment" forum (path: http://localhost:8080/trainingTemplatingAuthor/.magnolia/admincentral#app:forum:browser;/pagecomments:null:)

    Step 2: Stop Tomcat server and deploy the public instance (example: trainingTemplatingPublic1) after the server completed starting we do the same as Step 1 above.

    Step 3: Stop Tomcat server again and deploy the public instance (example: trainingTemplatingPublic2)

    Completed.

     

    Regards,

  4. I would really like so see an example of setting up a cluster with derby for local development to change a workspace between author and public for e.g. questions, users, etc..

    1. As far as I know, if you use Derby as an embedded database, you will not be able to use its workspaces in a cluster. "Apache Derby doesn't support concurrent access in the embedded mode." see https://wiki.apache.org/jackrabbit/Clustering

      However, H2 is probably another embedded db you should look into, but I have never tested it personally.

        1. Tried it with different setups, but never got clustering working with H2 and file-system only db's (so no mysql storage).

  5. The journal can potentially become very large. By default, old revisions are not removed.

    We recommend turning on the janitor functionality for clusters:

    https://wiki.apache.org/jackrabbit/Clustering#Removing_Old_Revisions

    <Cluster ....>
    	<Journal ...>
    		......
    		<param name="janitorEnabled" value="true" />
    	</Journal>
    </Cluster>
    1. Activating the Janitor function on the journal has quite a few drawbacks that you must be aware of, these are also listed in Jordie's link. I feel that it would be good to add some more details on those drawbacks and how to handle them as part of this article.

      Jordie Diepeveen, can you describe how you are dealing with the listed drawbacks?

  6. After the creation of a cluster of 2 publics, just one of the nodes starts successfully, the others do not start due to the following error:

    Caused by: javax.jcr.RepositoryException: The repository home /magnoliaPublic/repositories_cluster/magnolia appears to be in use since the file 
    named .lock is locked by another process.
    at org.apache.jackrabbit.core.util.RepositoryLock.tryLock(RepositoryLock.java:166) ~[jackrabbit-core-2.12.4.jar:2.12.4]

    Both nodes are deployed in the same machine and both points to the same respositories folder.

    I supposed that cluster configuration would solve this error.

    it is necessary additional configuration? or just workspace forlder can be shared?

    Thanks in advance

    1. Hello-

      Each clustered node cannot share the same repositories folder. Each node maintains it's own index.

      HTH

  7. Should the FileSystem(configured in jackrabbit xml) be shared among clusters or local to each node?

      <DataSources>


        <DataSource name="magnolia">

            <param name="driver" value="javax.naming.InitialContext"/>

            <param name="url" value="java:comp/env/jdbc/MagnoliaPublic"/>

            <param name="databaseType" value="mysql"/>

        </DataSource>

      </DataSources>

      <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">

         <param name="path" value="${rep.home}/repository" />

      </FileSystem>

    https://jackrabbit.apache.org/archive/wiki/JCR/Clustering_115513377.html

    The jackrabbit documentation refers to have Global filesystem shared. 

  8. Applied the steps with an H2 as the non-clustered db and a postgres as the clustered one.

    Magnolia version used 6.1.

    Works like a charm (smile)