Content of this page has been updated to comply with Magnolia CMS 5.6.x and add some more details on how to configure your Jackrabbit Data Store properly which was not mentioned in its previous version. However audiences still able to retrieve previous version using Page History, choose v. 19 in the list.
Scenario
-- Updated use-case to demonstrate clustering in Magnolia CMS 5.6.x. Since Forum module was deprecated so we switch to Contacts app and its workspace for easier to follow.
We want the two public instances to share the comments contacts which are stored in the forum contacts workspace. But otherwise we want to keep the content independent.
Magnolia demo bundle already included demo Contacts module with all of its related sample content, app, and configurations.
See:
A Note on Clustering
-- Thanks to Bradley Andersen for your provided info in this section
We can either cluster, or not cluster. Setting up clustering is harder, but, if we do not cluster, we need to deal with:
- Synchronization
- Transactional Activation
- Sticky Sessions (think PUR module) More things to back up
- Etc.
On the other hand, clustering introduces some problems:
- If you use PostgreSQL, the journal can grow to the point it shuts down the DB server
- It introduces a single point of failure
- You can't do a rolling update if you only have one DB
- Does not scale - a good rule of thumb seems to be: one DB connection per JCR workspace is open. In an OOTB configuration, there are about 30 JCR workspaces. If we're above, say, 4 publics, we actually have too many simultaneous DB connections.
- Each cluster node needs its own (private) file system and search index.
Note that certain things should naturally be clustered (unless we want to create a service to reverse-publish from a public to the author, and then the author to the other publics):
- User generated content such as comments written by site visitors
- Public User Accounts
- Forum Posts
A potential solution for all these issues is Amazon Aurora.
A potential solution to the single point of failure problem is: create a redundant, second Jackrabbit cluster to avoid single point of failure in the content store.
Before setup
Please note that customers who want to use Clustering function have to follow Jackrabbit requirements below (original link here):
Clustering in Jackrabbit works as follows: content is shared between all cluster nodes.
- That means all Jackrabbit cluster nodes need access to the same persistent storage (persistence manager, data store, and repository file system).
- The persistence manager must be clusterable (eg. central database that allows for concurrent access, see PersistenceManagerFAQ); any DataStore (file or DB) is clusterable by its very nature, as they store content by unique hash ids.
- However, each cluster node needs its own (private) repository directory, including repository.xml file, workspace FileSystem and Search index.
- Every change made by one cluster node is reported in a journal, which can be either file based or written to some database.
What shall we do
We will use MySQL database which supported concurrent access for our persistence manager. Also we will need a shared folder (either NFS or local file system) for our DataStore location. At the end, all clustered content of Contacts will be stored in MySQL and its related binary objects (contact images in this case) will be stored in this shared folder.
Sample MySQL script to create 'magnolia_cluster' database, create 'admin' user using 'admin' password on 'localhost' and grant him all permissions on created DB:
CREATE USER 'admin'@'localhost' IDENTIFIED BY 'admin'; CREATE SCHEMA `magnolia_cluster` DEFAULT CHARACTER SET utf8 ; GRANT ALL PRIVILEGES ON magnolia_cluster.* TO 'admin'@'localhost' WITH GRANT OPTION;
We will configure a clustered repository by changing Magnolia provided WEB-INF/config/default/repositories.xml file into a clustered one and duplicate WEB-INF/config/repo-conf/jackrabbit-bundle-mysql-search.xml to WEB-INF/config/repo-conf/jackrabbit-bundle-mysql-cluster.xml for its configuration. Note that we still keep our previous one for non-cluster content. This means you will have 2 repositories working at the same time when we start our Magnolia instance.
It is possible to use H2 file system persistence storage for non-cluster repository / content while configuring MySQL database persistence storage for clustered content.
An overview of steps
- Configure Magnolia author and public system wide properties in WEB-INF/config/default/magnolia.properties
- Configure author and public Jackrabbit repositories in /WEB-INF/config/default/repositories_cluster.xml which is a duplication of Magnolia provided /WEB-INF/config/default/repositories.xml
- Configure your cluster details in WEB-INF/config/repo-conf/jackrabbit-bundle-mysql-cluster.xml
Magnolia properties
-- Reference here for a complete list of all configuration items Configuration management .
As we mentioned above in the prerequisite, "Each cluster node must have its own repository configuration." → So we will use this property to set its repository location:
magnolia.repositories.cluster=${magnolia.home}/repositories_cluster
Just like "magnolia.repositories.jackrabbit.config" configuration item, you are also expected to provide cluster configuration file location in
magnolia.repositories.jackrabbit.cluster.config=WEB-INF/config/repo-conf/jackrabbit-bundle-mysql-cluster.xml
Also this property would help identifing the instance as a cluster master node. During installation and update Magnolia bootstraps content only into master nodes. This ensures that other (replica) nodes installed later don't override already bootstrapped content. default is false
. Note that I'm setting it to true
in our author instance for demonstrastion purpose, however you would have to consider where to put your master cluster due to your practical scenario.
magnolia.repositories.jackrabbit.cluster.master=true
repositories.xml
Note that the position where you put your Repository definition tag in 'repository.xml' fill determine the initiation order of Magnolia CMS repositories. Clustered repository is recommended to be placed after default one so that Magnolia CMS related configurations could be initiated first.
add a new repository configuration in .../WEB-INF/config/default/repositories.xml
<!-- magnolia non-default repository --> <Repository name="magnoliacluster" provider="info.magnolia.jackrabbit.ProviderImpl" loadOnStartup="true"> <param name="configFile" value="${magnolia.repositories.jackrabbit.cluster.config}" /> <param name="repositoryHome" value="${magnolia.repositories.cluster}" /> <!-- the default node types are loaded automatically <param name="customNodeTypes" value="WEB-INF/config/repo-conf/nodetypes/magnolia_nodetypes.xml" /> --> <param name="contextFactoryClass" value="org.apache.jackrabbit.core.jndi.provider.DummyInitialContextFactory" /> <param name="providerURL" value="localhost" /> <param name="bindName" value="cluster-${magnolia.webapp}" /> <!-- since forum module has been deprecated, we switch to contacts module for demonstration. --> <!-- <workspace name="forum" /> --> <workspace name="contacts" /> </Repository>
add a mapping to the clustered repository for the workspace to tell the system that this workspace lives in a different repository (the clustered one)
<RepositoryMapping> <Map name="website" repositoryName="magnolia" workspaceName="website" /> ... <!-- since forum module has been deprecated, we switch to contacts module for demonstration. --> <!-- <Map name="forum" repositoryName="magnoliacluster" workspaceName="forum" /> --> <Map name="contacts" repositoryName="magnoliacluster" workspaceName="contacts" /> </RepositoryMapping>
- We already set magnolia.repositories.jackrabbit.cluster.config in the magnolia.properties to WEB-INF/config/repo-conf/jackrabbit-bundle-mysql-cluster.xml however you can use whatever folder you want in file system using absolute path.
Jackrabbit configuration file
see: http://wiki.apache.org/jackrabbit/Clustering
- make a copy of the non-clustering configuration file (jackrabbit-bundle-mysql-cluster.xml in this case)
- make sure that both the instances use the same underlying database (MySQL magnolia_cluster schema in this case)
Sample MySQL datasource configuration
<DataSources> <DataSource name="magnolia_cluster"> <param name="driver" value="com.mysql.jdbc.Driver" /> <param name="url" value="jdbc:mysql://localhost:3306/magnolia_cluster" /> <param name="user" value="admin" /> <param name="password" value="admin" /> <param name="databaseType" value="mysql"/> <param name="validationQuery" value="select 1"/> </DataSource> </DataSources>
add the cluster configuration to the configuration file
<Cluster syncDelay="2000" id="mclu1"> <Journal class="org.apache.jackrabbit.core.journal.DatabaseJournal"> <param name="revision" value="${rep.home}/revision"/> <param name="driver" value="com.mysql.jdbc.Driver"/> <param name="url" value="jdbc:mysql://localhost:3306/magnolia_cluster"/> <param name="user" value="admin"/> <param name="password" value="admin"/> <param name="databaseType" value="mysql"/> <param name="schemaObjectPrefix" value="JOURNAL_"/> </Journal> </Cluster>
Configure DataStore using your shared folder. This section is important to share binary objects amongst your clustered instances. Note that you could able to use database datastore by configure org.apache.jackrabbit.core.data.db.DbDataStore in below section. Reference to Jackrabbit Datastore documentation for more details on limitations, garbage collection, and the way it work.
<DataStore class="org.apache.jackrabbit.core.data.FileDataStore"> <param name="path" value="YOUR_SHARED_CLUSTERED_LOCATION"/> <param name="minRecordLength" value="1024"/> </DataStore>
Note that your 'magnolia.repositories.cluster=${magnolia.home}/repositories_cluster' must point to different physical locations on all your author and public instances due to Jackrabbit clustering requirement that 'each cluster node needs its own (private) repository directory'. However 'YOUR_SHARED_CLUSTERED_LOCATION' in DataStore FileDataStore location must point to the same location on all your instances to share their binary data objects. Please don't confuse on this point otherwise you will get into trouble when starting the instances.
Set the cluster id
The cluster id identifies the instance and is used to write changes to the journal as well as to load changes from the journal. Make sure this is a unique value and is not shared with the other nodes in the cluster.
Cluster id can be defined either in the properties file (most convenient way) or in the persistence manager in the cluster configuration (both ways are used in the attached files):
<Cluster id="mclu1" syncDelay="2000"> .... </Cluster>
Setting the cluster id in the properties file, will save you from having two different persistence manager files with just this little change.
- set magnolia.clusterid property in the magnolia.properties file
Sync Delay
By default, cluster nodes read the journal and update their state every 5 seconds (5000 milliseconds). To use a different value, set the attribute syncDelay in the cluster configuration. syncDelay="2000" means states are synch every 2000 miliseconds.
Subscribers
Make sure that the content is not activated to both the clustered instances.
- only one subscriber should have a subscription to the clustered workspace(s) in /server/activation/subscribers/xxx/subscriptions
Warning: loading of workspace configuration
Once a workspace has been created a copy of jackrabbit configuration is saved to the workspace folder (workspace.xml)
- changing the original jackrabbit configuration file won't have any effect
- changes have to be made in the workspace.xml
Verify your setup
Bring up your instances, note that your author is our master cluster in this case, need to be installed first.
Then open your Contacts app such as (http://localhost:8080/magnoliaAuthor/.magnolia/admincentral#app:contacts:browser;/:treeview:)
Create a testing contact and upload an image for him
Remember to save your info
Switch to another instance, also open Contacts app (such as http://localhost:8180/magnoliaPublic/.magnolia/admincentral#app:contacts:browser;/:treeview: ) and make sure that your created one was there (after synchDelay=2000 miliseconds)
Clean up your Journal
This is important to prevent your database to be overloaded or hang. Thank you Jordie Diepeveen for reminding us about this.
The journal can potentially become very large. By default, old revisions are not removed.
We recommend turning on the janitor functionality for clusters:
https://wiki.apache.org/jackrabbit/Clustering#Removing_Old_Revisions
<Cluster ....> <Journal ...> ...... <param name="janitorEnabled" value="true" /> </Journal> </Cluster>
Reference to Magnolia Clustering - Cleaning the Jackrabbit journal and Apache Jackrabbit Clustering - Removing Old Revisions recommendations for more details.
Have a good day!
16 Comments
Samuel Schmitt
With this scenario, there is a "ItemExistsException" on installation.
To proceed with the installation of public2, the solution I found is, once public 1 is up and running, I delete everything from the forum workspace and start the installation of public 2. Then the commenting module coming with public 2 will re-bootstrap the deleted items and there is no issue.
Maybe there is a better way to handle this issue. If someone has a better idea, thanks to share !!
But anyway i think it's a conceptual issue (not of commenting, but) of Clustering Magnolia. When many instances share a repo, it's important to handle carefully the concurrent write behavior.
Magnolia International
After configuring the cluster, starting the public instance after the author instance will throw some exceptions:
ERROR org.apache.jackrabbit.core.query.lucene.SearchIndex: Unable to read revision '7'.
It looks like this behavior is expected since the two instances are not "synchronized". Next startups should be fine.
Also, don't forget to configure the public instance in order to NOT bootstrap the samples managed by the shared repository.
Samuel Schmitt
Finally someone reply
Mohan Sundararajan
Is this issue still exists in 5.5.x Core version. I was trying it and i still see the error even after i set cluster.master as true for instance 1 and cluster.master as false for instace 2 of my public facing magnolia.
Viet Nguyen
Nicolas Barbé and Mohan Sundararajan set 'forceConsistencyCheck' under SearchIndex to 'true' to eliminate the error in your clustered config file ('jackrabbit-bundle-mysql-cluster.xml' for instance)
Remember to switch it off after first startup to save resources and boost startup performance.
Training Participants - FullStack Developer
Scenario: 1 Author instance , 2 public instances
In case of "ItemExistsException" as Samuel Schmitt's comment we can also pass through this issue by the following.
Step 1: Start author instance (example: trainingTemplatingAuthor) after starting is completed we will open FORUM app and delete "pagecomment" forum (path: http://localhost:8080/trainingTemplatingAuthor/.magnolia/admincentral#app:forum:browser;/pagecomments:null:)
Step 2: Stop Tomcat server and deploy the public instance (example: trainingTemplatingPublic1) after the server completed starting we do the same as Step 1 above.
Step 3: Stop Tomcat server again and deploy the public instance (example: trainingTemplatingPublic2)
Completed.
Regards,
Jordie Diepeveen
I would really like so see an example of setting up a cluster with derby for local development to change a workspace between author and public for e.g. questions, users, etc..
Magnolia International
As far as I know, if you use Derby as an embedded database, you will not be able to use its workspaces in a cluster. "Apache Derby doesn't support concurrent access in the embedded mode." see https://wiki.apache.org/jackrabbit/Clustering
However, H2 is probably another embedded db you should look into, but I have never tested it personally.
Bradley Andersen
looks like in some way it is possible with H2: http://h2database.com/html/advanced.html#clustering
Jordie Diepeveen
Tried it with different setups, but never got clustering working with H2 and file-system only db's (so no mysql storage).
Jordie Diepeveen
The journal can potentially become very large. By default, old revisions are not removed.
We recommend turning on the janitor functionality for clusters:
https://wiki.apache.org/jackrabbit/Clustering#Removing_Old_Revisions
Niels Hardeman
Activating the Janitor function on the journal has quite a few drawbacks that you must be aware of, these are also listed in Jordie's link. I feel that it would be good to add some more details on those drawbacks and how to handle them as part of this article.
Jordie Diepeveen, can you describe how you are dealing with the listed drawbacks?
Joaquin Alfaro
After the creation of a cluster of 2 publics, just one of the nodes starts successfully, the others do not start due to the following error:
Both nodes are deployed in the same machine and both points to the same respositories folder.
I supposed that cluster configuration would solve this error.
it is necessary additional configuration? or just workspace forlder can be shared?
Thanks in advance
Richard Gange
Hello-
Each clustered node cannot share the same repositories folder. Each node maintains it's own index.
HTH
Malathy Sampath
Should the FileSystem(configured in jackrabbit xml) be shared among clusters or local to each node?
<DataSources>
<DataSource name="magnolia">
<param name="driver" value="javax.naming.InitialContext"/>
<param name="url" value="java:comp/env/jdbc/MagnoliaPublic"/>
<param name="databaseType" value="mysql"/>
</DataSource>
</DataSources>
<FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
<param name="path" value="${rep.home}/repository" />
</FileSystem>
https://jackrabbit.apache.org/archive/wiki/JCR/Clustering_115513377.html
The jackrabbit documentation refers to have Global filesystem shared.
Konstantinos Christodoulou
Applied the steps with an H2 as the non-clustered db and a postgres as the clustered one.
Magnolia version used 6.1.
Works like a charm