General Amazon Kinesis Concepts
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information.
Collection of services for processing streams of various data.
Data is processed in “shards” – with each shard able to ingest 1000 records per second.
There is a default limit of 500 shards, but you can request an increase to unlimited shards.
A record consists of a partition key, sequence number, and data blob (up to 1 MB).
Transient data store – default retention of 24 hours, but can be configured for up to 7 days.
The Kinesis family includes Kinesis Data Streams, Kinesis Data Firehose, Kinesis Data Analytics and Kinesis Video Streams (out of scope).
Examples of streaming data:
- Purchases from online stores.
- Stock prices.
- Game data (statistics and results as the gamer plays).
- Social network data.
- Geospatial data (think uber.com).
- IoT sensor data.
The following Kinesis services are in scope for the exam:
- Kinesis Streams.
- Kinesis Firehose.
- Kinesis Analytics.
Kinesis Data Streams
Producers send data to Kinesis, data is stored in Shards for 24 hours (by default, up to 7 days).
Consumers then take the data and process it – data can then be saved into another AWS service.
One shard provides a capacity of 1MB/sec data input and 2MB/sec data output.
One shard can support up to 1000 PUT records per second.
The following diagram shows producers placing records in a stream and then consumers processing records from the stream. There are multiple options for destinations.
The total capacity of the stream is the sum of the capacities of its shards.
Kinesis Data Streams supports resharding, which lets you adjust the number of shards in your stream to adapt to changes in the rate of data flow through the stream.
There are two types of resharding operations: shard split and shard merge.
- In a shard split, you divide a single shard into two shards.
- In a shard merge, you combine two shards into a single shard.
Splitting increases the number of shards in your stream and therefore increases the data capacity of the stream.
Splitting increases the cost of your stream (you pay per-shard).
Merging reduces the number of shards in your stream and therefore decreases the data capacity—and cost—of the stream.
Kinesis Data Firehose
Producers send data to Firehose.
There are no Shards, it’s completely automated (scalability is elastic).
Firehose data is sent to another AWS service for storing, data can be optionally processed using AWS Lambda.
Data sent to RedShift must go to S3 first.
Fully managed service.
Near real-time (60 seconds latency).
Load data into RedShift, S3, Elasticsearch, or Splunk.
Supports many data formats (pay for conversion).
You pay for the amount of data going through Firehose.
Kinesis Data Analytics
Use SQL query to query data within Kinesis (Streams and Firehose).
Data is then stored in S3, RedShift or an Elasticsearch cluster.
Use for real-time analytics on Kinesis streams using SQL.
Provides auto scaling.
Fully managed service.
You pay for the actual consumption rate.
Can create streams out of the real-time queries.
Kinesis client library
Kinesis Client Library is a Java library that helps read records from a Kinesis Stream with distributed applications sharing the read workload.
The KCL is different from the Kinesis Data Streams API that is available in the AWS SDKs.
- The Kinesis Data Streams API helps you manage many aspects of Kinesis Data Streams (including creating streams, resharding, and putting and getting records).
- The KCL provides a layer of abstraction specifically for processing data in a consumer role.
The KCL acts as an intermediary between your record processing logic and Kinesis Data Streams.
When you start a KCL application, it calls the KCL to instantiate a worker. The KCL performs the following tasks:
- Connects to the stream.
- Enumerates the shards.
- Coordinates shard associations with other workers (if any).
- Instantiates a record processor for every shard it manages.
- Pulls data records from the stream.
- Pushes the records to the corresponding record processor.
- Checkpoints processed records.
- Balances shard-worker associations when the worker instance count changes.
- Balances shard-worker associations when shards are split or merged.
The KCL ensures that for every shard there is a record processor.
Manages the number of record processors relative to the number of shards & consumers.
If you have only one consumer, the KCL will create all the record processors on a single consumer.
Each shard is processed by exactly one KCL worker and has exactly one corresponding record processor, so you never need multiple instances to process one shard.
However, one worker can process any number of shards, so it’s fine if the number of shards exceeds the number of instances.
If you have two consumers it will load balance and create half the processors on one instance and half on another.
Scaling out consumers:
- With KCL, generally you should ensure that the number of instances does not exceed the number of shards (except for failure or standby purposes).
- Each shard can be read by only one KCL instance.
- You never need multiple instances to handle the processing of one shard.
However, one worker can process multiple shards.
- 4 shards = max 4 KCL instances.
- 6 shards = max 6 KCL instances.
Progress is checkpointed into DynamoDB (IAM access required).
KCL can run on EC2, Elastic Beanstalk, and on-premises servers.
Records are read in order at the shard level.
Control access / authorization using IAM policies.
Encryption in flight using HTTPS endpoints.
Encryption at rest using KMS.
Possible to encrypt / decrypt data on the client side.
VPC endpoints available for Kinesis to access within a VPC.
SQS vs SNS vs Kinesis
- Consumers pull data.
- Data is deleted after being consumed.
- Can have as many workers (consumers) as you need.
- No need to provision thorughput.
- No ordering guarantee (except with FIFO queues).
- Individual message delay.
- Push data to many subscribers.
- Up to 10,000,000 subscribers.
- Data is not persisted (lost if not deleted).
- Up to 10,000,000 topics.
- No need to provision throughput.
- Integrates with SQS for fan-out architecture pattern.
- Consumers pull data.
- As many consumers as you need.
- Possible to replay data.
- Meant for real-time big data, analytics, and ETL.
- Ordering at the shard level.
- Data expires after X days.
- Must provision throughput.