Curbing the cost of cloud analytics and data warehousing

Development

By Robin Birtstone

July 24, 2023

Curbing the cost of cloud analytics and data warehousing

Sponsored Feature: Information wants to be free. It also tends to be expensive – especially when you have to process it. As we struggle to squeeze more insights out of our data, the race is on to reduce the computational cost involved. And the cloud is helping to increase the performance of the gargantuan data warehouses needed to drive new insights, without breaking the bank.

Inflation has clearly bitten hard for some companies procuring their own data storage and processing infrastructure. But while server and storage component prices fluctuate, the volume of data that companies must store and process to remain competitive continues to expand. In 2018, IDC and Seagate collaborated on a report, Data Age 2025, which estimated that the total amount of data created, captured, and replicated each year would grow from 33 zettabytes then to around 100 zettabytes this year. The report expected volumes to soar by another 75 percent to reach 175 zettabytes in 2025.

Focusing on price performance

With prices and data volumes on their way up, companies must put more attention on efficiency to help manage the challenge of enterprise data storage and processing. That is why AWS is constantly striving to improve the performance of its data warehousing and analytics service, Amazon Redshift. We spoke to Stefan Gromoll, Performance Engineering manager at Amazon, who spends a lot of his time focused on improving Redshift’s performance for customers.

“We are laser-focused on continuously improving Redshift’s price performance, which means delivering the best performance we can for every dollar you spend,” he says. Price performance should be consistent and predictable, so that customers can regulate the cost of large-scale data processing. If Amazon can deliver better price performance than the on-premises data warehousing vendors and cloud data warehouse competitors, and if it can consistently improve the price performance of its own service, then the performance team is doing their job.

Decoupling storage and compute

Price encompasses both the cost of storing and processing data. “However, for most customers, at least anecdotally, compute costs are the dominant factor,” says Gromoll.

Amazon Redshift has been focused on price performance since its introduction. But it was the debut of RA3 nodes and managed storage in December 2019 which allowed Redshift users to decouple storage and compute, scaling each as necessary (read the full history here). These nodes use large, high-performance SSDs for local caching and Redshift Managed Storage to deliver improved price performance.

Decoupling compute from storage enabled customers to boost their data processing capabilities without paying for unnecessary data storage. The Redshift performance team focuses on increasing the performance of the compute and storage infrastructure, with the improvements tailored to show up on budgets rather than benchmarks.

“We want to know where we stand with these official benchmarks because we know people run them out there,” he says. But the real focus is on improving real-world data warehouse performance in areas that really matter for customers.

Gromoll and his team regularly examine performance telemetry from the Redshift fleet to find common performance optimization opportunities. Then, they work on ways to squeeze more data processing performance out of Redshift for the same cost.

A string of performance enhancements

One recent example was string vectorization, which applies a general performance-enhancing technology to string processing. With smaller registers, a single CPU core was only able to conduct one mathematical operation per clock cycle. Companies handled parallel operations by running groups of calculations across multiple cores in concert, but this still left space for performance improvements at the single-core level. As register sizes increased, a single register could store multiple numbers. Vectorization uses that capability to run a single instruction on multiple numbers at once in a single clock cycle.

Redshift customers store much of their data as strings rather than integers or floating point numbers. “We realized that there was a lot of opportunity to provide benefits to our customers by optimizing the string performance.” Gromoll recalls.

Amazon engineers developed a new way to manage compressed string data on disk. Vectorizing the algorithms that read string compression encodings permitted CPU-efficient scans over compressed dictionary-encoded string columns. This accelerated the processing of multiple strings by up to sixty times for queries processing large amounts of string data, Gromoll says.

Improving response times

Mobile gaming company Playrix began using Amazon Redshift Serverless in 2022 in a bid to improve its use of marketing analytics to increase game sales. The company, which has 85 million daily active users, must analyze tens of petabytes of data to understand how players interact with its games. Its EC2-hosted PostgreSQL database had served it well but was struggling to keep up. After Playrix switched to Redshift alongside serverless application container AWS Fargate (which slurps data from its partner systems) it saw improved response times for queries on massive amounts of historical data and reduced its monthly costs by 20 percent.

Parallelism plays an important part in Redshift at the cluster level too. The data warehouse has a feature known as Concurrency Scaling, which automatically adds and removes compute to meet volatile demand for read and write queries. The automatic scaling feature reduces or eliminates queued queries, speeding up data processing for large workloads while avoiding bottlenecks. Customers are only charged for the extra compute that their queries use.

Concurrency Scaling was another key benefit for Playrix’ Redshift migration. Gaming analytics carries volatile workloads, but Playrix uses Concurrency Scaling to service spiky SQL queries from its internal users, scaling quickly while keeping costs low.
Today, Playrix processes and stores up to 5TB of real-time streaming data from its marketing partners in its Amazon Redshift data lake. It applies machine learning to this data, helping it to predict revenue and lifetime customer value.

Automating workload management

Users can dictate which queries a concurrency-scaling cluster deals with using Redshift’s Auto Workload Manager (AutoWLM). This is the automated version of Amazon’s workload management scheduler that overrides manual processes by automatically deciding how many queries should execute at the same time and what resources (e.g. memory) to give each query.

That’s important because customers submit many concurrent queries, often in the thousands. “When you get these 1000 users who all issue queries at the same time, AutoWLM decides exactly how to execute all of those 1000 queries in a way that maximizes throughput.” Gromoll explains. The system also continually learns from query patterns to tailor this optimization over time, adapting its query routing as warehouse usage evolves.

Automatic workload management is set up automatically in Redshift. Amazon made a serverless option for Redshift available in July 2022, joining its provisioned Redshift deployment mechanisms.

Pay-as-you-go instances are readily available, but users can gain cost efficiencies by planning ahead with reserved instances. Playrix reserved instances to boost price performance using Redshift, but also uses EC2 spot instances. These are ephemeral instances that Amazon can cancel with minimal notice, and are therefore priced very low. Astute customers can use them opportunistically to dispatch short-lived workloads.

Strategies like this have paid off big for Playrix. It boasts a thousand percent speed boost in its analytics queries since moving to Redshift, for the same cost as its standalone EC2-based PostgreSQL implementation.

“We have invested very heavily to make Redshift performance linear.” says Gromoll. He points out that both provisioned and Serverless can be scaled in relatively small increments. This gives customers a lot of flexibility to keep costs under control: Redshift provisioned warehouses can be expanded or reduced by as little as a single compute node, while Redshift Serverless uses even more granular Redshift Processing Units (RPUs). “So you can really dial in exactly the performance and cost you want without having to pay for more compute than you need—you don’t have to double your warehouse if you just need a little bit more compute.” Gromoll adds.

Price performance in action

The impact of these price performance improvements grows with the volume of data that customers are processing in Redshift – and there are some very large users indeed. Providing security and governance features to deliver comprehensive identity management with granular authorization controls such as Role Based Access Controls, Row Level Security or dynamic data masking, at no additional cost to the customer further drives home cost savings and helps with price performance.

Nasdaq, the financial exchange and clearing house which hosts almost 4,000 listed companies globally, moved to Redshift in 2014 to power its business analytics. Today, it ingests billions of financial records nightly to power its business analytics operation, crunching some four terabytes of data after market close. The challenge is getting that data into the system for processing in the first place.

As market volatility pushed data loads ever higher, Nasdaq worked with AWS to reinvent its Redshift-based data warehousing operation. It relocated its data lake in the Amazon S3 managed storage layer and switched to Amazon Redshift Spectrum.

Redshift Spectrum allows the exchange to query its massive data lake directly in S3, eliminating the time needed for extraction, transformation, and loading data separately into Redshift. The new architecture also decoupled storage and compute, enabling the company to concentrate its computing nodes entirely on query processing, slashing query processing times by a third.

The new architecture enabled Nasdaq to grow its nightly record volume from 30 billion to 70 billion and beyond, while reaching its 90 percent mark for data load completion five hours sooner than it did before the change. That readies it for analytics jobs as early as an hour after the market closes.

Automating manual tasks

Another class of feature within Redshift that helps to boost price performance and eliminate administrative overhead is “autonomics”. These help companies across different verticals to do more with Redshift without having to spend more on staff.

“We know that our customers don’t want to have to manually tune their database to get the best performance from it,” he explains. “So we have invested heavily in the last couple of years on autonomics, which allow the database to self-tune to deliver the best price performance.”

One example of autonomics at work is automatically optimizing how data is stored and distributed in the data warehouse. Redshift autonomics can detect when better performance can be delivered by distributing data differently and automatically sends data to the appropriate nodes to improve query performance. Locating the data appropriately before a query runs means less data shuffling during query execution.

In the past, database administrators had to manually assign the distribution keys used to allocate that data, but now it happens automatically. “Customers can load their data and start working with it,” Gromoll says. “Redshift will automatically learn from their workload and redistribute data optimally to deliver the best price performance.”

In the future, Gromoll sees even bigger opportunities in autonomics. “Database teams can focus on generating insights from their data rather than administering their data warehouse.” he posits. His team also spends its time identifying focused performance improvements that might not seem like much on their own, but which contribute to big savings when applied together over millions of queries. As data loads increase, the team continues to look everywhere he can for price-performance gains that Amazon can pass onto its customers.

Sponsored by AWS.

Amazon CodeWhisperer hushed by Q Developer, now generally available, while Q Apps enter preview

GitHub opens technical preview for Copilot Workspace, its next phase in AI-powered development

React 19 beta is out, promising stable server components and a host of other developments

New C# 12 feature proves controversial: Primary constructors 'worst feature I've ever seen implement...

Node 22 released with experimental support for require targeting ECMAScript modules and more

Visual Studio analysed: will it ever migrate from .NET Framework?

Dapr: not just for Kubernetes or the Microsoft platform, says co-creator Mark Fussell

AWS combines "building block" blueprints with CodeCatalyst for rapid project creation including DevO...

Sourcegraph coding assistant now supports Anthropic Claude 3 – though limited to 7K token input

Supabase moves out of beta, adds supports for Swift, plugs in Oriole storage engine

Go dev survey shows frustration with Python’s dominance of AI