Elastic rewrites code after three hour cache crash

DevOps

By Gavin Clarke

January 31, 2019

Elastic rewrites code after three hour cache crash

Elastic has re-written part of its HA and failover architecture after a three-hour interruption of service and 20-minute blackout last week.

The host of Elastic Cloud has replaced its TreeCache with a service called TreeWatcher, which it reckoned is an API-compatible re-write and drop-in replacement.

According to Elastic, the only difference between the two is how the tree is refreshed.

The re-write followed the three-hour service degradation and outage of a service used by major telcos, corporations and academic computing institutions.

According to Elastic, the way it was written saw TreeCache mishandle a high-number of CPU requests in the services AWS eu-west-1 (Ireland) region.

TreeCache is an important part of the Elastic failover architecture. It provides a mirror that keeps proxy servers in touch with an Apache ZooKeeper data store in the event of disconnect.

The proxy servers provide smart rooting behind the Elastic Cloud load balancers and in front of an allocation layer.

However, the problem seems to have been inconsistent connections that meant as sessions were lost and re-established TreeCache tried to re-connect to ZooKeper and – in so doing – overwhelmed the service.

“Proxies acting as ZooKeeper clients using TreeCache started experiencing resource starvation due to refresh requests piling up and leading to out of memory conditions in that layer, resulting in a death spiral that lead to a complete outage,” Elastic said in a report here.

TreeWatcher will, now, not refresh the tree if a session is resumed when a connection is re-established, thereby not flooding ZooKeeper with new requests for new sessions.

The mea culpa came just a day after Elastic announced an update of its software stack to v6.6.

AWS combines "building block" blueprints with CodeCatalyst for rapid project creation including DevO...

Atlassian takes another step toward full DevOps automation

GitHub autofix progresses to public beta: insecure code corrected with AI, but only for enterprise

Secret leakage in public GitHub repositories increasing, claims new report

Test launch of TEA open source reward project clouded by repository spam attack

From Docker to Dagger: Solomon Hykes on modernisation of the DevOps pipeline

Enterprises struggle with Agile methodology, reports long-standing survey of practitioners

Spotlight on GitHub self-hosted runners again as researcher demonstrates attack on PyTorch code

PyPy moves from Mercurial, says 'open source has become synonymous with GitHub'

Where next for Jamstack? Netlify survey avoids the word, highlights rise of Astro

Docker buys AtomicJar to integrate container-based test automation

AWS promotes cell-based architecture for 'resilience at scale'

ABOUT US

FOLLOW US