AWS Analytics

Home » AWS Certification Cheat Sheets » AWS Certified Solutions Architect Professional » AWS Analytics

AWS Analytics

Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.

Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL.

Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis.

This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

Best practices for performance with Athena:

  • Partition your data – Partition the table into parts and keeps the related data together based on column values such as date, country, region, etc. Athena supports Hive partitioning.
  • Bucket your data – Partition your data is to bucket the data within a single partition.
  • Use Compression – AWS recommend using either Apache Parquet or Apache ORC.
  • Optimize file sizes – Queries run more efficiently when reading data can be parallelized and when blocks of data can be read sequentially.
  • Optimize columnar data store generation – Apache Parquet and Apache ORC are popular columnar data stores.
  • Optimize ORDER BY – The ORDER BY clause returns the results of a query in sort order.
  • Optimize GROUP BY – The GROUP BY operator distributes rows based on the GROUP BY columns to worker nodes, which hold the GROUP BY values in memory.
  • Use approximate functions – For exploring large datasets, a common use case is to find the count of distinct values for a certain column using COUNT(DISTINCT column).
  • Only include the columns that you need – When running your queries, limit the final SELECT statement to only the columns that you need instead of selecting all columns.

Amazon RedShift

Amazon Redshift is the fastest and most widely used cloud data warehouse.

Redshift is integrated with your data lake and offers up to 3x better price performance than any other data warehouse.

Amazon EMR

Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto.

Amazon EMR makes it easy to set up, operate, and scale your big data environments by automating time-consuming tasks like provisioning capacity and tuning clusters.

With EMR you can run petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark.

You can run workloads on Amazon EC2 instances, on Amazon Elastic Kubernetes Service (EKS) clusters, or on-premises using EMR on AWS Outposts.

Runs in one AZ within a VPC.

Uses Amazon EC2 for compute.

Amazon Elasticsearch

Amazon Elasticsearch Service is a fully managed service that makes it easy for you to deploy, secure, and run Elasticsearch cost effectively at scale.

You can build, monitor, and troubleshoot your applications using the tools you love, at the scale you need.

The service provides support for open source Elasticsearch APIs, managed Kibana, integration with Logstash and other AWS services, and built-in alerting and SQL querying. Amazon Elasticsearch Service lets you pay only for what you use – there are no upfront costs or usage requirements.

With Amazon Elasticsearch Service, you get the ELK stack you need, without the operational overhead.

Amazon Kinesis Data Streams

Multiple applications can consume same records in a shard (with SQS the record is processed once).

Records are ordered within a shard.

Store in shard for 24 hours to 7 days.

Kinesis Producers:

  • AWS SDK.
  • Kinesis Producer Library (KPL).
  • Kinesis Agent.

Kinesis Consumers:

  • AWS SDK.
  • Lambda.
  • Kinesis Client Library (KCL).

Kinesis vs SQS

Ingestion, analytics, monitoring, app clicks use cases = Kinesis.

Decoupling, worker pools, asynchronous use cases = SQS.

SQS has one production group and one consumption group.

SQS is designed for decoupling / asynchronous communication.

No persistence of messages in SQS

Kinesis is for large scale data ingestion.

Limits

Producer – 1MB/s or 1000 messages/s per shard.

ProvisionedThroughputException error if limit is exceeded.

Consumer classic – 2MB/s read per shard for all consumers 5 API calls per second per shard across all consumers.

Consumer enhanced fan-out: 2MB/s read per shard, per enhanced consumer (no API calls, push model).

200 ms latency for classic, 70 ms latency for enhanced fan out.

Data Streams is real-time, firehose is near real-time.

Amazon Kinesis Data Firehose

Fully managed service to load data to data lakes, data stores and analytics services.

Automatic scaling, fully serverless and resilient.

Near real time delivery (~60 seconds).

Data streams are real time (~200ms).

Supports transformation of data on the fly using AWS Lambda.

Billed based on data volume.

Destinations:

  • RedShift (via an intermediate S3 bucket).
  • Elasticsearch.
  • Amazon S3.
  • Splunk.
  • Datadog
  • MongoDB
  • New Relic
  • HTTP Endpoint

Can receive data from Kinesis Data Streams.

Firehose can also read directly from a Data Stream consumer.

Fills buffer of 1 MB or an interval passes (60 seconds) before delivering data.

Amazon Kinesis Data Analytics

Provides real-time SQL processing for streaming data.

Provides analytics for data coming in from Kinesis Data Streams and Kinesis Data Firehose.

Destinations can be Kinesis Data Streams, Kinesis Data Firehose, or AWS Lambda.

Amazon QuickSight

Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud.

QuickSight lets you easily create and publish interactive BI dashboards that include Machine Learning-powered insights.

QuickSight dashboards can be accessed from any device, and seamlessly embedded into your applications, portals, and websites.

QuickSight is serverless and can automatically scale to tens of thousands of users without any infrastructure to manage or capacity to plan for.

It is also the first BI service to offer pay-per-session pricing, where you only pay when your users access their dashboards or reports, making it cost-effective for large scale deployments.

AWS Glue

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.

AWS Glue provides all of the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months.

Data integration is the process of preparing and combining data for analytics, machine learning, and application development.

It involves multiple tasks, such as discovering and extracting data from various sources; enriching, cleaning, normalizing, and combining data; and loading and organizing data in databases, data warehouses, and data lakes.

These tasks are often handled by different types of users that each use different products.

Glue Crawlers

You can use a crawler to populate the AWS Glue Data Catalog with tables.

This is the primary method used by most AWS Glue users.

A crawler can crawl multiple data stores in a single run.

Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.

The ETL job reads from and writes to the data stores that are specified in the source and target Data Catalog tables.