Decreasing data warehouse downtime

Published thu 2 Nov 2023 // 12:35 UTC

SPONSORED FEATURE: When data warehouses were first developed, people used them for decision support – the sort of decisions which were made in boardrooms every month or quarter. Today, they’re made every few milliseconds, which significantly blurs the line between data warehouses and operational systems. In fact, the two are increasingly becoming the same thing, which means our tolerance for data warehouse downtime is decreasing. So how can we minimize it?

This month, Amazon Redshift launched a high availability solution that spans multiple AWS Availability Zones (AZ) in a single AWS region and could help do just that. Developed for the company’s RA3 Redshift clusters, it promises to dramatically reduce the downtime risk for mission-critical workloads on Redshift. We spoke to Saurav Das, Senior Product Manager for Amazon Redshift, to find out how it works.

DEVCLASS AD

The mission-critical challenge for data warehouses

There is a risk of outage with every workload. What changes is the client’s tolerance to that risk, based on factors including their size, their use case, and other issues like regulatory liabilities. Many of these workloads are business-critical, says Das; an outage of up to an hour might be an irritation for them, but it won’t cause business operations to fail. The less risk-tolerant workloads are mission critical, he says, explaining that these must recover in tens of seconds rather than tens of minutes to keep operations intact.

In the past, these mission-critical workloads were mainly transactional. An ambulance dispatch system that takes calls and routes available vehicles to an emergency might fall into this category. Analytical workloads were typically less time critical. A financial company might want to crunch the numbers for a business intelligence report overnight before the morning bell rings, but that’s hardly mission-critical.

DEVCLASS AD

That’s changing, Das says. “What we are seeing now is with the explosion of data, customers are using more and more data and they want these systems to be available 24×7,” he says. That ambulance dispatch application might now rely on analysing historical and real-time traffic data, along with real-time gasoline levels in individual vehicles. It might use these to determine the best vehicle to attend an emergency and the best route to take so that urgent medical care can be delivered in a timely manner. It might even take historical incident data into account to predict the likely volume and location of emergencies later that evening. That makes analytical systems indispensable.

Amazon Redshift is cloud-based data warehouse service for analytics workloads that went into general availability ten years ago and serves millions of analytics requests each day. Increasingly, these requests come from clients who would be critically impacted rather than simply inconvenienced if an outage occurred.

Amazon Redshift stores its data in Redshift Managed Storage (RMS) backed by Amazon S3 which is designed to be highly durable ensuring zero data loss. Redshift also provides many recovery capabilities for any failures within an AZ, including automatic backups to recover the data warehouse and auto remediation for various infrastructure failures that happens behind the scenes with no customer interaction.

If an entire availability zone goes down, customers can enable Redshift cluster relocation to move their cluster to another AZ without any application changes. Ideally, this takes just a few minutes but it is a best-effort method subject to capacity constraints that can extend recovery time.

DEVCLASS AD

A new high-availability offering

This is where Amazon’s new high availability offering – Multi-AZ deployments – comes into play. Launched this month, it brings mission-critical fail-over capabilities to Redshift clusters.

“Customers with mission-critical workloads are sensitive to infrastructure outages within an AZ. Although these are rare, they do happen, and these customers want protection,” explains Das. “This solution would protect them and quickly recover them from an infrastructure outage within the AZ.”

DEVCLASS AD

Amazon has developed a high-availability service that provisions a Redshift RA3 cluster simultaneously in two AZs. This allows the system to automatically failover without any capacity constraints says the company, because capacity is already provisioned in the other AZ.

Pre-launch tests found that Redshift Multi-AZ deployments reduce recovery time to under 60 seconds in the unlikely case of an AZ failure, says Das. Amazon Redshift currently provides three nines (99.9%) availability, which translates to no more than 43 minutes of downtime each month, he adds. With Multi-AZ deployments however, Amazon Redshift offers an order of magnitude greater availability at four nines (99.99%) – that’s four and a half minutes of downtime per month at most. That all happens with no user intervention.

How did Amazon make this work in the cloud? The hardest part of the setup was ‘heartbeat’ detection, says Das. This detection system checks the infrastructure, collecting data points that tell it whether everything is working properly. When a problem is detected, Redshift Multi-AZ automatically triggers a failover to restore availability. As part of building Multi-AZ, the core detection algorithm for Amazon Redshift was enhanced while being subjected to extensive stress and scale testing to support the faster recovery time for mission-critical customer deployments

More capacity, higher throughput

Another benefit of Multi-AZ deployments is the added throughput you get with compute being doubled, explains Das. “Often in high availability systems, you have a primary environment and a secondary that is standby usually sitting there not doing anything,” he says. This secondary system is just provisioned to provide higher availability and is only activated when disaster strikes.

The upside of this active-passive architecture is fast fail-over, because a ‘hot’ machine is already standing by. The downside is that the customer is paying for extra capacity that goes unused almost all the time.

Redshift Multi-AZ delivers higher availability and manages compute resources in both AZs as a single data warehouse that sits behind a single endpoint. Queries are routed in a round-robin fashion to compute resources in both AZs, so each compute resource does half of the work.

“All this hardware gets used,” Das affirms. “So, you’re not only getting high availability, but also higher throughput.”

Customers choosing the new option should primarily do it for the high-availability benefits, Das says, but the higher throughput is a nice added bonus. Companies will pay the higher compute costs that come with running two clusters in different AZs, but they won’t need to pay for additional storage as it is shared. That’s because the RA3 clusters store their data in Redshift Managed Storage (RMS) allowing customers to scale and pay for compute and storage independently.

RMS is a write-through protocol, meaning that once you write it gets committed to S3 storage and data is automatically replicated across all AWS availability zones within an AWS region. This works on a regional level, although data wouldn’t be available in the event of a cross-region failure. At that point, you’d need to use Redshift Cross-Region Copy to replicate snapshots of your cluster to another AWS region. But for companies working in a single region, this is still a big win.

The advantage of this separate computing and storage cost model becomes increasingly apparent as data sets increase in size (and they can scale up to petabytes in size under RMS). Paying only once for storage makes this solution more cost efficient as the data warehouse becomes larger, says Amazon

Use cases and applications

This high-availability, high-throughput option is excellent for high-concurrency, high-read workloads, explains Das. “One example is dashboard-type workloads where there are a lot of ad hoc queries that spike within a certain period of time and need to execute really quickly,” he says. “In that case, this extra throughput matters because all queries are getting executed simultaneously.”

He gives fraud detection in financial applications as another likely candidate. “Their solution needs to run at all times,” he adds. “It just doesn’t go down because they’re trying to detect fraud and it’s constantly churning data.” Other applications could include fleet management, where critical deliveries (or emergency services visits) must be made as quickly and efficiently as possible.

Customers can enable the new high-availability feature via the console or via an AWS API, in three ways. The first simply involves selecting the Multi-AZ option when you create a new RA3 cluster. In the second, you convert an existing RA3 cluster from single AZ to Multi-AZ by selecting the Multi-AZ option. Finally, you can restore an existing snapshot from RA3 or Serverless as you might do normally in a Redshift instance, but you can do so as Multi-AZ cluster, transforming it on the fly.

This new capability promises to give customers the best of both worlds: high availability and high throughput, with only a corresponding increase in compute costs rather than storage. AWS has already been working with customers who have piloted the service in practice with impressive results. As data warehouses become more critical for mission-critical operations, the company is working hard to keep customers ahead of the game.

Sponsored by AWS.

databases

Decreasing data warehouse downtime

Godot maintainers struggle with 'draining and demoralizing' AI slop submissions

React survey shows TanStack gains, doubts over server components

GitHub previews Agentic Workflows as part of continuous AI concept

Anthropic updates to hide Claude’s AI actions, devs hate it

Microsoft's sudden deprecation of Polyglot Notebooks leaves users fuming

Microsoft delivers first preview of .NET 11 and C# 15

JavaScript survey reveals gripes against date handling, Webpack and Next.js - and that "TypeScript has won"

Heroku future in doubt as Salesforce freezes features to focus on AI

OpenAI Codex app looks beyond the IDE, devs ask why Mac-only?

Apple embraces agentic AI development with Xcode 26.3

Adobe backtracks, reanimates Animate following user backlash

Anthropic research: skilled devs make better use of AI, but using AI is bad for learning skills

Kubernetes leadership warns of Ingress NGINX risks, but has also hastened its deprecation

TypeScript inventor Anders Hejlsberg: AI is 'a big regurgitator of stuff someone has done'

WebAssembly gaining adoption "behind the scenes" as technology advances

Microsoft previews command-line tool created because calling modern Windows APIs is too difficult

VS Code tasks config file abused to run malicious code

LLVM project adopts "human in the loop" policy following AI-driven nuisance contributions

Microsoft updates React Native for Windows. Developers ask, 'Why not use MAUI?'

jQuery 4.0 released, first major version since 2016

Vibe coded applications full of security blunders

Code signing Windows apps may be easier and more secure with new Azure Artifact service