Employing your cloud data warehouse to scale up AI/ML

Employing your cloud data warehouse to scale up AI/ML

SPONSORED FEATURE: Canadian hockey player Wayne Gretzky famously said: “I skate to where the puck is going, not where it has been”. That quote graces countless inspirational business blog posts, but they all miss a vital caveat: you cannot work out where the puck will be unless you know where it was, and in what direction it was heading.

That is where predictive analytics comes in.

Predictive analytics mines historical and current data for patterns that can identify and quantify future trends. The more data you have, the more accurate your predictions can be, but only if you have the capacity to crunch those numbers. That is why AI and predictive analytics go hand-in-hand. Machine learning – a foundational technology in predictive analytics – is a proven technique for recognizing patterns in vast amounts of data without requiring a PHD in data science.

How AI changes data warehouse design

Customers want tools that can be used to unlock the latent value of their data warehouses quickly, without the steep learning curve associated with new programming paradigms and APIs. Amazon’s re:Invent conference in November demonstrated the company’s commitment to AI as a tool for opening up new functionality and efficiency in data warehousing.

Data warehousing’s relevance to AI starts with having access to the right data and preparing it for rich analytics and ML use cases. Without having that data in place – consolidated from different sources or easily accessed – any ML model is likely to fall short.

Machine learning workloads are often processed outside the data warehouse that feeds them their data. That data must be pre-processed by cleaning, normalizing, and deduplicating it before loading it into the pipeline. They must prepare semi-structured and structured data alike, applying these basic hygiene measures while also summarizing the data based on key attributes.

The need to pre-process and integrate data across AI ecosystems is becoming even more important in the era of generative AI. This is moving at breakneck pace, thanks largely to the emergence of fine-tuned large language models (LLMs) gaining traction for enterprise purposes. Companies can choose from an increasing number of foundational models – some of which are now open-source – and train them further with their own data to suit their specific use cases.

Multiple data sources must be pre-processed for these AI-driven data pipelines, in many formats ranging from streaming time series data to unstructured data from spreadsheets or social media feeds. This requires specific data science and scripting skills that are in short supply.

A lack of skilled personnel was a problem for over half (55 percent) of the IT and line-of-business decision-makers that IDC quizzed across 2,000 organizations in its AI StrategiesView 2021 Survey. The same proportion pointed to a lack of appropriate tools as an issue, and to the cost of computing resources to handle the training and inference, said the company in its report, Scaling AI/ML Initiatives: The Critical Role of Data.

It would be better from a technology and skills perspective to ingest data more simply or for the data warehouses to access it in place.

Simplifying the use of ML with Redshift

A well-tooled data warehouse can help to cut through this complexity by accessing data from various sources, either by ingesting it or using it in place. It can clean and deduplicate this data, aggregating it where necessary so that it is ready for processing in the AI pipeline.

“Building and managing these ETL [extraction, transformation and loading] data pipelines has been a traditional pain point for our customers,” said explained Swami Sivasubramanian, Vice President of Data and AI at AWS. “One of the ways we are helping our customers create a more integrated data foundation is our ongoing commitment to a zero-ETL future.”

Amazon has already laid the groundwork for better AI pipeline integration in Redshift ML, which it launched into general availability in May 2021. This service improves ease of use and helps businesses to build and run machine learning functions on information stored in its managed data warehouse service, manipulating machine learning models using SQL.

Instead of having to export data or learn new machine learning languages like Python or R (you can just stick to good old SQL), you can create the model directly within Redshift without needing an external ML platform, and then use it within the database for prediction and classification workloads.

Redshift ML customers use the CREATE MODEL statement to set things rolling. It creates an abstract model from underlying relational data. Customers provide training data with a SQL query or table name and the column they want to predict, and Redshift internally invokes Amazon’s SageMaker Autopilot machine learning tool to generate simple models. The tool automatically explores different model options based on the column or field that the user selects. Users can also use the product to make iterative improvements to those models for more complex predictions. SageMaker then trains the model, allowing Redshift to install it as a user-defined function.

Scaling data warehouse queries with AI

Having eased the data integration journey to support AI workloads in Redshift, Amazon has focused on using AI to enhance the performance and cost efficiency of data warehousing queries. This gives customers dealing with rising data volumes more confidence in the scalability and availability of their data infrastructure.

Amazon announced that it was using AI to improve scaling capacity in Redshift Serverless late last year. It trains the AI model on factors including query complexity and frequency, along with the size of their target dataset. Based on the query analysis, this enables it to plan query execution, which might include scheduling large non-critical queries using their own capacity to avoid impacting other queries.

Customers can decide whether they want to emphasize cost efficiency or performance improvement using these behind-the-scenes AI enhancements, and Amazon takes care of the rest. The optimizations happen as part of the standard Redshift Serverless service.

Supercharging Redshift queries with Amazon Q

Getting data cued up in AI pipelines and optimizing queries are two critical challenges for Redshift users. A third is extracting the maximum business value from that data through intuitive query interfaces. Providing AI tools and services to help customers on that journey was a focal point for Amazon last year.

A key component in more intuitive data analysis is Amazon Q, a generative AI work assistant designed to help business users in their everyday tasks using natural conversational interfaces.

Amazon Q stands out for two reasons: first, it enhances the power of generative AI by connecting LLMs to companies’ data, offering precision insights that are highly relevant to their tasks. Second, Amazon designed Amazon Q for deep integration into a range of other services to enhance their native interfaces and functionality.

Amazon Q capabilities have been embedded directly into the company’s Amazon Redshift Query Editor, enabling developers and data scientists to generate SQL queries based on natural language. The AI is context aware, tailoring its code to the database and schema that the user is querying, and also enables users to refine the SQL using a conversational interface, providing feedback and follow-up instructions.

Amazon Q has also made its way into Amazon’s QuickSight business intelligence tool. Launched in 2021, QuickSight Q adds natural language query capabilities that allow users to ask business questions intuitively and receive accurate answers with relevant visualizations. This makes querying easier for users – especially business users that are not skilled in SQL.

Amazon has also enhanced QuickSight Q’s intuitive query and response capabilities by including Amazon Q generative AI functionality directly within the business intelligence tool. These new capabilities extend the natural language processing a step further by making commands event more intuitive and more functional. For example, a visual authoring capability allows users to build visuals based on Redshift data in QuickSight Q using natural language, tuning and formatting them iteratively by following up with other commands. Users can also build calculations using natural language without having to learn or look up the specific syntax. A new feature called stories enables users to build entire narratives around the insights from their data through natural-language conversations with the tool.

Bring your own LLM to Redshift

Amazon Q doesn’t rely on a single LLM. Instead, it uses Amazon Bedrock, an AWS a service that exposes foundational models from Amazon and others as LLMs via an API. Launched in preview in April 2023 and becoming generally available in September, this enables customers to use a variety of LLMs, from Amazon’s own Titan model through to alternatives from companies including OpenAI, Meta, and Anthropic.

Customers can also use Bedrock to achieve their desired outcomes by creating and manipulating their own fine-tuned models in the cloud. Bedrock takes data from multiple sources including Redshift, enabling customers to use Redshift data to fuel LLM-powered scenarios with enterprise-specific data for personalized applications ranging from text generation to search.

Amazon’s commitment to bring-your-own LLMs extends into Redshift. In November, the company announced the ability to integrate pre-trained open-source LLMs with Amazon SageMaker JumpStart, its machine learning hub for getting started quickly with AI-powered data queries. Users can use LLM integration to extract more insights from their Redshift data, for example by summarizing feedback on your products or performing sentiment analysis. A community article detailing how customers can invoke models in Bedrock by using Lambda functions can be found here.

Redshift ML in action

Job search site Jobcase uses Redshift to store job seeker and job interaction data from millions of users. The company used to export data from Redshift into S3 to produce job recommendations for around 10 million members from a pool of up to 30 million listings each day. It would then train its models and use virtual machines to inference using that data, in a process that took around seven hours.

Jobcase switched to Redshift ML to scale up its analytics while cutting out those constant data transfers, training and applying machine learning models directly in Redshift using SQL. This enables it to perform up to a billion predictions each day. This also reduced the cost of external machine learning frameworks and compute, the company said. The biggest benefit came from the speed and sophistication of the models, though. Jobcase explained that the more efficient in-warehouse inference enabled it to expand from simple, linear models to more sophisticated ones. This contributed to an increased engagement rate of up to 10 percent with its job recommendations.

AI – and especially generative AI – promises to unlock the power of enterprise data. To realize its full potential, companies must get it to the right place in the right state, optimize its processing in the data warehouse, and then extract insights from it more intuitively. This requires a robust AI-enhanced tech stack from infrastructure through to AI-based services like Amazon Q. We fully expect more AI-focused integration between Redshift and other Amazon products like Bedrock in the future.

Sponsored by AWS.