Decreasing data warehouse downtime

Decreasing data warehouse downtime

SPONSORED FEATURE: When data warehouses were first developed, people used them for decision support – the sort of decisions which were made in boardrooms every month or quarter. Today, they’re made every few milliseconds, which significantly blurs the line between data warehouses and operational systems. In fact, the two are increasingly becoming the same thing, which means our tolerance for data warehouse downtime is decreasing. So how can we minimize it?

This month, Amazon Redshift launched a high availability solution that spans multiple AWS Availability Zones (AZ) in a single AWS region and could help do just that. Developed for the company’s RA3 Redshift clusters, it promises to dramatically reduce the downtime risk for mission-critical workloads on Redshift. We spoke to Saurav Das, Senior Product Manager for Amazon Redshift, to find out how it works.

The mission-critical challenge for data warehouses

There is a risk of outage with every workload. What changes is the client’s tolerance to that risk, based on factors including their size, their use case, and other issues like regulatory liabilities. Many of these workloads are business-critical, says Das; an outage of up to an hour might be an irritation for them, but it won’t cause business operations to fail. The less risk-tolerant workloads are mission critical, he says, explaining that these must recover in tens of seconds rather than tens of minutes to keep operations intact.

In the past, these mission-critical workloads were mainly transactional. An ambulance dispatch system that takes calls and routes available vehicles to an emergency might fall into this category. Analytical workloads were typically less time critical. A financial company might want to crunch the numbers for a business intelligence report overnight before the morning bell rings, but that’s hardly mission-critical.

That’s changing, Das says. “What we are seeing now is with the explosion of data, customers are using more and more data and they want these systems to be available 24×7,” he says. That ambulance dispatch application might now rely on analysing historical and real-time traffic data, along with real-time gasoline levels in individual vehicles. It might use these to determine the best vehicle to attend an emergency and the best route to take so that urgent medical care can be delivered in a timely manner. It might even take historical incident data into account to predict the likely volume and location of emergencies later that evening. That makes analytical systems indispensable.

Amazon Redshift is cloud-based data warehouse service for analytics workloads that went into general availability ten years ago and serves millions of analytics requests each day. Increasingly, these requests come from clients who would be critically impacted rather than simply inconvenienced if an outage occurred.

Amazon Redshift stores its data in Redshift Managed Storage (RMS) backed by Amazon S3 which is designed to be highly durable ensuring zero data loss. Redshift also provides many recovery capabilities for any failures within an AZ, including automatic backups to recover the data warehouse and auto remediation for various infrastructure failures that happens behind the scenes with no customer interaction.

If an entire availability zone goes down, customers can enable Redshift cluster relocation to move their cluster to another AZ without any application changes. Ideally, this takes just a few minutes but it is a best-effort method subject to capacity constraints that can extend recovery time.

A new high-availability offering

This is where Amazon’s new high availability offering – Multi-AZ deployments – comes into play. Launched this month, it brings mission-critical fail-over capabilities to Redshift clusters.

“Customers with mission-critical workloads are sensitive to infrastructure outages within an AZ. Although these are rare, they do happen, and these customers want protection,” explains Das. “This solution would protect them and quickly recover them from an infrastructure outage within the AZ.”

Amazon has developed a high-availability service that provisions a Redshift RA3 cluster simultaneously in two AZs. This allows the system to automatically failover without any capacity constraints says the company, because capacity is already provisioned in the other AZ.

Pre-launch tests found that Redshift Multi-AZ deployments reduce recovery time to under 60 seconds in the unlikely case of an AZ failure, says Das. Amazon Redshift currently provides three nines (99.9%) availability, which translates to no more than 43 minutes of downtime each month, he adds. With Multi-AZ deployments however, Amazon Redshift offers an order of magnitude greater availability at four nines (99.99%) – that’s four and a half minutes of downtime per month at most. That all happens with no user intervention.

How did Amazon make this work in the cloud? The hardest part of the setup was ‘heartbeat’ detection, says Das. This detection system checks the infrastructure, collecting data points that tell it whether everything is working properly. When a problem is detected, Redshift Multi-AZ automatically triggers a failover to restore availability. As part of building Multi-AZ, the core detection algorithm for Amazon Redshift was enhanced while being subjected to extensive stress and scale testing to support the faster recovery time for mission-critical customer deployments

More capacity, higher throughput

Another benefit of Multi-AZ deployments is the added throughput you get with compute being doubled, explains Das. “Often in high availability systems, you have a primary environment and a secondary that is standby usually sitting there not doing anything,” he says. This secondary system is just provisioned to provide higher availability and is only activated when disaster strikes.

The upside of this active-passive architecture is fast fail-over, because a ‘hot’ machine is already standing by. The downside is that the customer is paying for extra capacity that goes unused almost all the time.

Redshift Multi-AZ delivers higher availability and manages compute resources in both AZs as a single data warehouse that sits behind a single endpoint. Queries are routed in a round-robin fashion to compute resources in both AZs, so each compute resource does half of the work.

“All this hardware gets used,” Das affirms. “So, you’re not only getting high availability, but also higher throughput.”

Customers choosing the new option should primarily do it for the high-availability benefits, Das says, but the higher throughput is a nice added bonus. Companies will pay the higher compute costs that come with running two clusters in different AZs, but they won’t need to pay for additional storage as it is shared. That’s because the RA3 clusters store their data in Redshift Managed Storage (RMS) allowing customers to scale and pay for compute and storage independently.

RMS is a write-through protocol, meaning that once you write it gets committed to S3 storage and data is automatically replicated across all AWS availability zones within an AWS region. This works on a regional level, although data wouldn’t be available in the event of a cross-region failure. At that point, you’d need to use Redshift Cross-Region Copy to replicate snapshots of your cluster to another AWS region. But for companies working in a single region, this is still a big win.

The advantage of this separate computing and storage cost model becomes increasingly apparent as data sets increase in size (and they can scale up to petabytes in size under RMS). Paying only once for storage makes this solution more cost efficient as the data warehouse becomes larger, says Amazon

Use cases and applications

This high-availability, high-throughput option is excellent for high-concurrency, high-read workloads, explains Das. “One example is dashboard-type workloads where there are a lot of ad hoc queries that spike within a certain period of time and need to execute really quickly,” he says. “In that case, this extra throughput matters because all queries are getting executed simultaneously.”

He gives fraud detection in financial applications as another likely candidate. “Their solution needs to run at all times,” he adds. “It just doesn’t go down because they’re trying to detect fraud and it’s constantly churning data.” Other applications could include fleet management, where critical deliveries (or emergency services visits) must be made as quickly and efficiently as possible.

Customers can enable the new high-availability feature via the console or via an AWS API, in three ways. The first simply involves selecting the Multi-AZ option when you create a new RA3 cluster. In the second, you convert an existing RA3 cluster from single AZ to Multi-AZ by selecting the Multi-AZ option. Finally, you can restore an existing snapshot from RA3 or Serverless as you might do normally in a Redshift instance, but you can do so as Multi-AZ cluster, transforming it on the fly.

This new capability promises to give customers the best of both worlds: high availability and high throughput, with only a corresponding increase in compute costs rather than storage. AWS has already been working with customers who have piloted the service in practice with impressive results. As data warehouses become more critical for mission-critical operations, the company is working hard to keep customers ahead of the game.

Sponsored by AWS.