Draft for 5.2, 5.3

Proposal for cache improvements.

 

The locking was solved in 4.3, other improvements are pending

There are 4 main areas to study/develop in order to have a simplified and faster caching module:

  1. separate client caching from server-side caching
  2. remove byte arrays and use stream to write to e read from cache elements
  3. synchronize read / write operations at cache element level, not at global cache level
  4. add a global voter

Separate client caching from server-side caching

Split cache filter in two filters (cachable resources are the resources not bypassed):

  • Headers filter, with a manager (the simple implementation is an in-memory (concurrenthashmap) table) to
    • store response headers
    • apply max-age and Expires (or no-cache), or whatever else (for example ETags)
    • check request headers in order to send back to client SC_NOT_MODIFIED
  • Content filter, with a manager (the simple implementation is a filesystem based manager) to
    • cache resources by streaming response (multiplex streams) to an outputstream taken from cache element
    • check for SC_NOT_MODIFIED using cache element creation date
    • stream from cache using an inputstream

Memory consumption optimization

Optimize memory usage by removing the use of byte arrays both in writing to cache and reading from cache

Cache locking

Use the java.util.concurrent.locks.ReentrantReadWriteLock ReadLock and WriteLock to do per-element resource locking

Remove boilerplate and hide locking concerns

We could think of adding a "getOrCache" (unconvincing name to be debated) method on the Cache interface whose implementation could look something like the following (not taking any locking/synchronizing issue in account, so this code might not be accurate)

Object getOrCache(Object key, Callback c) {
  Object cached = get(key);
  if (cached == null) {
    Object value = generateCacheValue();
    put(key, value);
    return value;
  } else {
    return cached;
  }
}

interface Callback {
  Object generateCacheValue();
}

... where the Callback interface would thus be responsible for "generating" the cache value; this new method could thus be called as such:

cache.getOrCache(key, new Callback() {
  public Object generateCacheValue() {
    return retrieveValueFromSomeRemoteService();
  }
});

See this diff (this class) for an example of an implementation.

TBH, I wouldn't be surprised if more recent versions of EhCache and other cache libraries had such a construct natively. Seems clean and elegant enough to be used in many cases. Not sure it would work for our page caching, but most likely for many other situations where we want to cache "stuff" (I'm using this in the external-indexing module, for example)

edit: looking at the EhCache 2.5 API, it could perhaps indeed be implemented with Cache.putIfAbsent, by passing a subclass of Element whose getValue (or getObjectValue, not sure which) would be lazy. It also has a SelfPopulatingCache class, which might be interesting looking into.

Additionally, if feasible, in some cases, using generics for key and values in the cache would avoid casting.
(there's possibly going to be an unchecked cast at some point when retrieving the cache instance, but eh)

Caching other objects

See Concept - Cache arbitrary objects.

Global cache voter

MGNLCACHE-37 - Getting issue details... STATUS

I.e.

public class AllInOneCacheVoter extends AbstractBoolVoter
{

    private String allowedExtensions;

    private String deniedExtensions;

    private String allowedRequestContentTypes;

    private String deniedRequestContentTypes;

    private String allowedResponseContentTypes;

    private String deniedResponseContentTypes;

    private boolean allowRequestWithParameters = false;

    private boolean allowAdmin = false;

    private boolean allowAuthenticated = false;

    private boolean allowDocroot = true;

    private boolean allowDotResources = true;

    private boolean allowDotMagnolia = false;

    private VoterSet voters = new VoterSet();

    // called by Content2Bean
    public void init()
    {
       if (StringUtils.isNotBlank(allowedExtensions) || StringUtils.isNotBlank(deniedExtensions))
        {
            ExtensionVoter voter = new ExtensionVoter();
            voter.setAllow(allowedExtensions);
            voter.setDeny(deniedExtensions);
            voter.setNot(true);
            voters.addVoter(voter);
        }
       ... create voters and add them to voters
    }

    /**
     * {@inheritDoc}
     */
    @Override
    protected boolean boolVote(Object value)
    {
        return voters.vote(value) == 0;
    }

    ... getters and setters ...
}

6 Comments

  1. Some thoughts about caching:

    • i think that google gears resource caching is intended to be used for offline applications. For caching adminCentral resources i think that the best approach is to correctly set response headers for client-side caching and maybe add current magnolia version in resources (i.e. /auth/.resources/4.3-m1/admin-css/admin-all.css, /.magnolia/4.3-m1/pages/javascript.js, ...) and a virtual uri mapping to forward requests
    • decoupling server-side caching and client-side caching: the idea is to keep the current cache filter as a starting point for server-side caching (storing and serving resources), and to create a new filter with all the logic and configuration to work with request / response headers for client-side caching
    • simplified cache configuration: i think that a first step is to change some voters into flat properties of a new global voter with a simple structure. I.e.:
      • simpleCacheConfigurationVoter
        • class = info.magnolia.voting.voters.SimpleCacheConfigurationVoter
        • allowedExtensions = html,js,css,xml
        • cacheRequestWithParameters = false
        • cacheOnAdmin = false
        • cacheAuthenticated = false
        • urls (contentnode with sub-voters)
          • ...
    • streaming from cache: what about restoring the old simple cache mechanism, and
      • adding some more synchronization stuff
      • adding a first level in memory (hashtable) cache
      • adding a simple hit counter to keep in memory most accessed resources
      • decoupling having two filters seams to be all right, but maybe the cache filter could cache the headers (attention this must be dynamic) as they might be set based on the served content. I am dreaming here about caching related data in the page properties.
      • simplified cache configuration don't forget the enabled flag (wink)
  2. about simplified cache configuration i just have seen MAGNOLIA-2557 (linked on top of the page... ops...) in which Philipp made the same proposal

  3. Feel free to edit the page directly.

    Streaming from cache: Have a void stream(Object key, OutputStream out) method instead of the current get(Object key). This would allow an implementation to "find" the source based on some condition (if size>X, serve from cache, otherwise serve from fs - or, to speak implementation details, we'd probably have a different CachedEntry - which instead of holding the content, would have a pointer to the fs or other streamable source)

    re:configuration: sounds good and similar to what we had in mind indeed. We'll also need to think about update tasks (yay)

    re:decoupling - one reason that might speak against it is that they (i think?) share (or should?) the configuration. The current cache caches all http headers, too, so I'm not sure how that would help ? What is the actual problem - other than the complexity (of both the configuration and of the strategy and executors system...)

    re:google gears: yes, scratch that, i'd just noted this down here a while back when i found out about gears, but it's not very relevant for us at the moment.

    To re-iterate, client headers and streaming are two issues independent from the configuration one and can/should be solved independently

  4. I think we have now the pieces together and should rename this concept page and update the content. Then we make a short meeting to finalize the decisions needed for 4.3.

  5. Thanks for updating the page. I totally agree that the mentioned issues must be solved, but I am afraid that rewriting the cache completely is taking too drastic measures. Following some short answers.

    • client- vs. server-side caching In many cases the caching and response headers setting are related
      • use stored information: the cache entry can contain addition information (today we store the headers, uuid of content, ..)
      • two configurations: we can have two configurations and two filter instances (server-side, client-side) without rewriting anything
      • dynamic headers: the executor could set some headers dynamically base on the headers present in the cached request (max age, ..)
      • executors for bypassed request can be defined as well
    • streaming I propose to store in the entry among other data like headers and uuid also a pointer to a file (in case the content is bigger than a threshold). We already agree to add a stream() method to the current API. We should not drop the advantages of having a compound keys and entries
    • synchronize the current implementation uses a per key locking, so this is not an issue at all
    • simplified configuration the voter is nice, but I am missing list of allow/deny urls and a global enabled flag

    Possible that the current implementation needs improvement or is to complex, but this can be solved without rewriting everything. Todays solution has laid out some future solutions as using content information for setting headers. I am thinking about using template configuration or page properties to decided on caching strategies. Another thing we have is a uuid to cache key mapping which plains the way for isolated cache flushing (based on a linked graph).