Coherence Tips #1 – Be careful using <scheme-ref> and <autostart> in your cache config

This tip is about server side Coherence cache configuration files and why you need to be careful about using particular elements in scheme configurations. I came across these issues when reviewing a configuration file for a project recently and realised that it is not something that everyone instinctively knows when putting editing their configuration files.

In your Coherence server side cache configurations files you need to be particularly careful about when you use <scheme-ref> and <autostart> configuration elements. I have seen these used in configuration files in the wrong place, which can lead to inconsistent configuration when members start their services.

To get quickly to the point lets look at a simple example of the problem using the <thread-count> element. If you are using DefaultCacheServer on your server side nodes then DefaultCacheServer will start any service that has the autostart element set to true. It does this by iterating over the schemes in the configuration file in the order they are declared – that is top to bottom.

So, say you have the following schemes in your cache configuration file

<caching-schemes>
    <distributed-scheme>
        <scheme-name>scheme-C</scheme-name>
        <scheme-ref>scheme-A</scheme-ref>
        <thread-count>1</thread-count>
        <autostart>true</autostart>
    </distributed-scheme>

    <distributed-scheme>
        <scheme-name>scheme-B</scheme-name>
        <service-name>DistributedServiceB</service-name>
        <thread-count>10</thread-count>
        <autostart>true</autostart>
    </distributed-scheme>

    <distributed-scheme>
        <scheme-name>scheme-A</scheme-name>
        <service-name>DistributedServiceA</service-name>
        <thread-count>10</thread-count>
        <autostart>true</autostart>
    </distributed-scheme>
</caching-schemes>

In the configuration above scheme-C references scheme-A. Scheme-C has autostart set to true and thread-count set to 1. When DefaultCacheServer starts it will start the services in the order they are declared, i.e. scheme-C, scheme-B then scheme-A. As scheme-C references scheme-A its effective configuration becomes this…

    <distributed-scheme>
        <scheme-name>scheme-C</scheme-name>
        <service-name>DistributedServiceA</service-name>
        <scheme-ref>scheme-A</scheme-ref>
        <thread-count>1</thread-count>
        <autostart>true</autostart>
    </distributed-scheme>

…so it will cause DistributedServiceA to start with a thread count of 1. When DefaultCacheServer gets to scheme-A it will see that the service declared, DistributedServiceA, is already running so will not re-start it. So the problem is what did we intend the thread count of DistributedServiceA to be, was it 1 or was it 10.

If we re-ordered the scheme definitions so that scheme-A was at the top…

<caching-schemes>
    <distributed-scheme>
        <scheme-name>scheme-B</scheme-name>
        <service-name>DistributedServiceB</service-name>
        <thread-count>10</thread-count>
        <autostart>true</autostart>
    </distributed-scheme>

    <distributed-scheme>
        <scheme-name>scheme-A</scheme-name>
        <service-name>DistributedServiceA</service-name>
        <thread-count>10</thread-count>
        <autostart>true</autostart>
    </distributed-scheme>

    <distributed-scheme>
        <scheme-name>scheme-C</scheme-name>
        <scheme-ref>scheme-A</scheme-ref>
        <thread-count>1</thread-count>
        <autostart>true</autostart>
    </distributed-scheme>
</caching-schemes>

…then we would get DistributedServiceA started with 10 threads. But even then, I have seen cases where during cluster start-up a race condition occurs between DefaultCacheServer in one JVM starting the services and code already running on another cluster member that is further along in its start-up. In this case DefaultCacheServer has joined the cluster and is about to start the autostart services – that is, scheme-A and scheme-C are not yet started – when it receives a message from another node for a cache mapped to scheme-C. The ConfigurableCacheFactory on the starting node will attempt to load and start scheme-C so as to service the incoming request causing DistributedServiceA to start using scheme-C configuration with a thread count of 1, whereas the other cluster members have DistributedServiceA with a thread count of 10. The first time I saw this happen in a real system it took a while to realise what the cause was. The symptom was obvious as the cluster went very slowly as all the requests hitting the node with only a single thread were queuing up.

If you are not using DefaultCacheServer to auto-start services then the schemes will be used in the order that your code requests them, that is for example, the order you ask for caches, which could lead to all sorts of inconsistent configurations for services.

It might seem obvious to some, but it only makes sense to only use certain configuration elements on schemes that have the <service-name> element. There are a number of configuration elements that define settings for the service as a whole as opposed to those for the specific cache being mapped to the service. In our example we used <thread-count> but there are a number of others depending on the type of scheme. A way to see this is to look in the tangosol-coherence-xml file in the coherence.jar file. Inside the services section are the default settings for various elements for each service type, all of which can be also set in your cache configuration file. All of these elements are specifically to configure the service as a whole, so if you include them in multiple schemes with different values that all map to the same service name you risk starting the services inconsistently or with a different configuration than the one you intended.

Although it is particularly important to be careful of this in configurations that use <scheme-ref> tags, that is those that have a lot of scheme inheritance and overriding of values, you also need to be careful in schemes that have neither a <scheme-ref> nor a <service-name> element, as these will be overriding the default Coherence configuration for the corresponding service. For example all <distributed-scheme> will ultimately map to the same DistributedCache service. To be safe it might be best to make sure that your schemes that do not reference other schemes have a <service-name> element so they do not default to the base Coherence service for that scheme.

If you use the Coherence Incubator Commons to allow configuration files to be combined together then things can get more complex as you have a number of files to check.

Since the release of Coherence 3.7 there has been the option to validate the configuration against a schema but this will not help as the XML is not actually invalid but the meaning of your configuration is wrong.

Maybe someone should write a nice Coherence configuration visualised for my IDE. Maybe I’ll do it one day if I ever get time – that would be for IntelliJ though – sorry Eclipse users.

You may also like...

1 Response

  1. Hi Jonathan,

    I agree, scheme composition can definitely get you in trouble if you are not careful.

    The rule I follow is to treat referenced schemes as abstract — they don’t define service name and they are never autostarted (or started at all), but are simply use to configure backing map, serializer, and such. No caches should ever map to them either.

    On the other hand, all the schemes that my caches map to *must* have service name defined and autostart set to true, and they *typically* reference one of the abstract template schemes in order to inherit backing map and serializer configuration, for example.

    That approach keeps me out of the trouble most of the time ;-)

    Cheers,
    Aleks

Leave a Reply

Your email address will not be published. Required fields are marked *