Please use the menu below to navigate the article sections:
Today, the currency of a large number of technology companies, is data. We are producing more data than ever through our day-to-day activities, and with that comes a need to store, analyze and gain meaningful insights from the data.
In the modern world of cloud computing, there are a number of different tools that we can use to store and analyze massive amounts of data.
We are going to explore Amazon Athena vs Amazon RedShift, and more specifically, we will explore Amazon RedShift spectrum vs Athena for differences in performance, management and cost to help you decide which tool you need for a particular job.
First of all, let’s look at a general summary of Amazon RedShift Spectrum vs Amazon Athena.
What is Amazon RedShift?
Amazon RedShift is a fully-managed data warehousing solution within the AWS cloud. It is designed to deliver lightning fast querying speed for the largest data sets, with no performance drawbacks.
Within Amazon RedShift, the user sets up clusters (which are groups of servers which work together to run Amazon RedShift). Clusters within Amazon RedShift run an Amazon RedShift engine and hold one or more datasets. Users are then able to quickly run complicated SQL queries to analyze and gain insights into the data that is stored.
Amazon RedShift Serverless is now also in preview – which allows all the same functionality of Amazon RedShift, without configuring any clusters.
Amazon RedShift Spectrum is the querying functionality that exists within Amazon RedShift, and is the impetus for comparing Amazon RedShift vs Amazon Athena.
What is Amazon Athena?
By using basic SQL, Amazon Athena allows you to conveniently analyze data stored in Amazon Simple Storage Service (S3). Amazon Athena is also completely Serverless, meaning there is no foundation that needs to be managed or set up, and it is portable. Analyzing Amazon S3 data can be done with Amazon Athena for unstructured, semi-structured, and structured data.
After a brief overview of both Amazon RedShift and Amazon Athena, let’s explore some differences between them.
Let’s now look at the basic breakdown of each service – and how they compare and how they differ.
|Serializers/Deserializers (SeRDe) libraries for parsing data from different data formats like CSV, JSON, TSV, and Apache logs.
Partitions can be done by any key with up to 20,000 keys per table.
Arrays and object identifier types are not supported by Amazon Athena, however.
|Direct partitioning is not supported by default in Amazon RedShift.
It uses predefined distribution keys to optimize tables for parallel processing (parallelism).
Amazon RedShift chooses your manual partition key for you, as a poor choice here can dramatically impact query performance.
|Primary Key Constraint
|Key not required. Duplication exists only if already contained in S3 datasets
|Key not required. Can duplicate data multiple times. If needed, the key must be declared before data is loaded into the warehouse
|Supports complex data types like arrays and maps
|Does not support arrays, maps or object identifier types
|UDF (User Defined Functions)
|Does not support any User Defined Functions
|Supports user defined functions
The next criteria we will use to compare Amazon RedShift spectrum vs Amazon Athena is performance. It is important to know which solution is more performant (and in which scenario) when deciding which product to use.
|Amazon Athena uses Apache HQL and is slower than Amazon RedShift
|Amazon RedShift uses PostGreSQL and is therefore faster than Amazon Athena
|Simple Read – Faster than Amazon RedShift Aggregated. Slightly faster than Amazon RedShift Join Query. Slower than Amazon RedShift due to simpler focus
|Simple Read – Slower than Amazon Athena Aggregated. Slightly slower than Amazon Athena Join Query. Faster than Amazon Athena due to the ability to easily handle traditional joins and relational workloads
|Time to start up
|Milliseconds start up speed with no initial setup, it is serverless and can start querying data immediately.
|Amazon RedShift has to set up the cluster (unless using Amazon RedShift Serverless) and you must load the data manually into tables.
Thirdly, we will compare how each service is managed, and how easy it is to manage security, upgrades and querying tables.
|Tightly integrates with IAM, and users must have S3 access to query any buckets. Amazon Athena can also securely query encrypted data.
|Access is given through a security group for your clusters, within a VPC. KMS (Key Management Service) is available to encrypt your data at rest.
|Any upgrading is strictly tied to S3, as Amazon Athena sits on top.
|You can easily upgrade or scale node by node.
|Amazon Athena uses Presto to query S3, and you can partition your tables to increase performance. You can also convert your data into a columnar store which can increase performance.
|Once you have loaded your data (which can be time consuming) queries are faster than Amazon Athena.
Now we will do a dive into how each of the services are priced.
|During a query execution, Amazon Athena charges per terabyte of data scanned.
You have to query a minimum of 10 megabytes per query execution, with no charge for failed queries.
|Clusters are priced according to their type and number of nodes. For dense compute nodes as well as dense storage nodes, there is an hourly rate.
Despite its predictable price and no penalty for excess queries, Amazon RedShift may increase overall cost with fixed compute (SSD) and storage (HDD).
Ultimately, the choice between Amazon RedShift and Amazon Athena depends on your individual needs. In addition to providing different functions, Amazon RedShift and Athena provide different approaches to managing your data and gaining value from it. Amazon RedShift requires framework management and data preparation, while Athena gets straight to querying data directly from Amazon S3.
Amazon Athena is noteworthy because it doesn’t require initial setup (it is serverless), so it’s perfect for simple querying.
In general, Athena is best for running queries quickly and conveniently without setting up a complex infrastructure.
Amazon RedShift excels at performing complex joins, inner queries, and aggregation. The foundation of Amazon RedShift is great for expanding data, and adding more clusters is simple.
Overall, Amazon RedShift is the best option for performing high-performance queries involving large datasets.
Learn how to Master the AWS Cloud
AWS Training – Our popular AWS training will maximize your chances of passing your AWS certification the first time.
Membership – For unlimited access to our entire cloud training catalog, enroll in our monthly or annual membership program.
Challenge Labs – Build hands-on cloud skills in a secure sandbox environment. Learn, build, test and fail forward without risking unexpected cloud bills.