Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Athena is easy to use – simply point to your data in Amazon S3, define the schema, and start querying using standard SQL.
Amazon Athena uses Presto with full standard SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Apache Parquet and Avro.
While Amazon Athena is ideal for quick, ad-hoc querying and integrates with Amazon QuickSight for easy visualization, it can also handle complex analysis, including large joins, window functions, and arrays.
Amazon Athena uses a managed Data Catalog to store information and schemas about the databases and tables that you create for your data stored in Amazon S3.
With Amazon Athena, you don’t have to worry about managing or tuning clusters to get fast performance.
Athena is optimized for fast performance with Amazon S3.
Athena automatically executes queries in parallel, so that you get query results in seconds, even on large datasets.
Most results are delivered within seconds.
With Athena, there’s no need for complex ETL jobs to prepare data for analysis.
This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.
Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing you to create a unified metadata repository across various services, crawl data sources to discover schemas and populate your Catalog with new and modified table and partition definitions, and maintain schema versioning.
You can also use Glue’s fully managed ETL capabilities to transform data or convert it into columnar formats to optimize cost and improve performance.
Query services like Amazon Athena, data warehouses like Amazon Redshift, and sophisticated data processing frameworks like Amazon EMR, all address different needs and use cases.
Amazon Redshift provides the fastest query performance for enterprise reporting and business intelligence workloads, particularly those involving extremely complex SQL with multiple joins and sub-queries.
Amazon EMR makes it simple and cost effective to run highly distributed processing frameworks such as Hadoop, Spark, and Presto when compared to on-premises deployments. Amazon EMR is flexible – you can run custom applications and code, and define specific compute, memory, storage, and application parameters to optimize your analytic requirements.
Amazon Athena provides the easiest way to run ad-hoc queries for data in S3 without the need to setup or manage any servers.
The table below shows the primary use case and situations for using a few AWS query and analytics services:
|AWS Service||Primary Use Case||When to use|
|Amazon Athena||Query||Run interactive queries against data directly in Amazon S3 without worrying about formatting data or managing infrastructure. Can use with other services such as Amazon RedShift|
|Amazon RedShift||Data Warehouse||Pull data from many sources, format and organize it, store it, and support complex, high speed queries that produce business reports.|
|Amazon EMR||Data Processing||Highly distributed processing frameworks such as Hadoop, Spark, and Presto. Run a wide variety of scale-out data processing tasks for applications such as machine learning, graph analytics, data transformation, streaming data.|
|AWS Glue||ETL Service||Transform and move data to various destinations. Used to prepare and load data for analytics. Data source can be S3, RedShift or another database. Glue Data Catalog can be queried by Athena, EMR and RedShift Spectrum|
Best practices for performance with Athena:
- Partition your data – Partition the table into parts and keeps the related data together based on column values such as date, country, region, etc. Athena supports Hive partitioning.
- Bucket your data – Partition your data is to bucket the data within a single partition.
- Use Compression – AWS recommend using either Apache Parquet or Apache ORC.
- Optimize file sizes – Queries run more efficiently when reading data can be parallelized and when blocks of data can be read sequentially.
- Optimize columnar data store generation – Apache Parquet and Apache ORC are popular columnar data stores.
- Optimize ORDER BY – The ORDER BY clause returns the results of a query in sort order.
- Optimize GROUP BY – The GROUP BY operator distributes rows based on the GROUP BY columns to worker nodes, which hold the GROUP BY values in memory.
- Use approximate functions – For exploring large datasets, a common use case is to find the count of distinct values for a certain column using COUNT(DISTINCT column).
- Only include the columns that you need – When running your queries, limit the final SELECT statement to only the columns that you need instead of selecting all columns.
With Amazon Athena, you pay only for the queries that you run.
You are charged based on the amount of data scanned by each query.
You can get significant cost savings and performance gains by compressing, partitioning, or converting your data to a columnar format, because each of those operations reduces the amount of data that Athena needs to scan to execute a query.