Elastic rewrites code after three hour cache crash

Elastic rewrites code after three hour cache crash

Elastic has re-written part of its HA and failover architecture after a three-hour interruption of service and 20-minute blackout last week.

The host of Elastic Cloud has replaced its TreeCache with a service called TreeWatcher, which it reckoned is an API-compatible re-write and drop-in replacement.

According to Elastic, the only difference between the two is how the tree is refreshed.

The re-write followed the three-hour service degradation and outage of a service used by major telcos, corporations and academic computing institutions.

According to Elastic, the way it was written saw TreeCache mishandle a high-number of CPU requests in the services AWS eu-west-1 (Ireland) region.

TreeCache is an important part of the Elastic failover architecture. It provides a mirror that keeps proxy servers in touch with an Apache ZooKeeper data store in the event of disconnect.

The proxy servers provide smart rooting behind the Elastic Cloud load balancers and in front of an allocation layer.

However, the problem seems to have been inconsistent connections that meant as sessions were lost and re-established TreeCache tried to re-connect to ZooKeper and – in so doing – overwhelmed the service.

“Proxies acting as ZooKeeper clients using TreeCache started experiencing resource starvation due to refresh requests piling up and leading to out of memory conditions in that layer, resulting in a death spiral that lead to a complete outage,” Elastic said in a report here.

TreeWatcher will, now, not refresh the tree if a session is resumed when a connection is re-established, thereby not flooding ZooKeeper with new requests for new sessions.

The mea culpa came just a day after Elastic announced an update of its software stack to v6.6.