Aws Series Big Data

Aws Series Big Data

Post Date : 2023-12-31T19:06:05+07:00

Modified Date : 2023-12-31T19:06:05+07:00

Category: systemdesign aws

Tags: aws

The 3Vs of Big Data


  • Ranges from terabytes to petabytes of data


  • Includes data from a wide range of sources and formats


  • Data needs to be collected, stored, processed, and analyzed within a short period of time

Amazon Redshift

  • Fully managed, petabyte-scale data warehouse service in cloud
  • It’s a very large relational database traditional used in big data applications.

Fun fact: reserved its name due to desire to have people leave Oracle databases and leverage this AWS service instead!

Oview and Uses

  • Size: incredibly big - it can hold up to 16 PB of data. You don’t have to split up your large datasets.
  • Relational: This database is relational. You use your standard SQL and business intelligence(BI) tools to interact with it.
  • Based on PostgreSQL engine type, however, it is NOT used for OLTP workloads
  • Usage: is not meant to be replacement for standard RDS databases
  • High Performance: 10x performance of other data warehouse offered in cloud
  • Columnar: Storage of data is column-based instead of row-based. Allows for efficient parallel queries

High Availability, Snapshot, and Disaster Recovery

  • Redshift now supports Multi-AZ deployments! It only spans two AZs at this time
  • Snapshots are incremental and point-in-time. They can be automated or manual. Always contained in Amazon S3(you cannot manage the bucket)
  • No conversions from Single-AZ to Multi-AZ(or vice versa)
  • Leverage large batch inserts

Redshift Spectrum

  • Efficient query and retrieve data from Amazon S3 without having to load the data into Amazon Redshift tables
  • Massive parallelism allows this to run extremely fast against large datasets. Use Redshift servers that are independent of your clusters

Enhanced VPC Routing

  • All COPY and UNLOAD traffic between your cluster and your data repositories is forced to go through your VPC
  • Enables you to use VPC features: VPC Endpoints, VPC Flow Logs, …

Processing Data with EMR(Elastic MapReduce)

What is ETL?

  • Extract , Transform, Load



  • AWS Service to help with ETL processing
  • A managed big data platform that allows you to process vast amounts of data using open-source tools, such as Spark, Hive, HBase, Flink, Hudi, and Presto

EMR Storage

There are 3 different types of storage options within EMR

  • Hadoop Distributed File System(HDSF): distributed, scalable file systems for Hadoop that distributes stored data across instances.

Streaming Data with Kinesis

  • Allow you ingest, process, and analyze real-time streaming data. You can think of its as a huge data highway connected to your AWS account.

There 2 major versions Kinenis:

Data Streams

  • Purpose: Real-time streaming for ingesting data
  • Speed: Real time
  • Difficult: You’re responsible for creating consumer and scaling the stream.


Data Firehose

  • Purpose: Data transfer tool to get information to S3, Redshift, Elasticsearch, or Splunk
  • Speed: Near real time (within 60s)
  • Difficult: Plug and Play with AWS architecture


Kinesis Data Analytics

  • Analyze data using standard SQL
  • Easy: simple to tie Data Analytics into your Kinesis Pipeline. It’s directly supported by Data Streams and Data Firehose
  • This is full managed, real time serverless service, automatically handle scaling and provisioning of needed resources.
  • You only pay for the amount of resources you consume or your data passes through.

Kinesis and SQS

  • SQS is a message broker that is simple to use and doesn’t require much configurion. It doesn’t offer real-time message delivery
  • Kinesis is a bit more complicated to configure and its mostly used in big data applications. If does provide real-time communication.

Amazon Athena

  • Athena is an interactive query service that makes it easy to analyze data in S3 using SQL. This is allow you to directly query data in your S3 bucket without loading it into a database.

Amazon Glue

  • Amazon Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data. It allows you to perform ETL(extract,transform,load) workloads without managing underlying servers.


Visualizing Data with Amazon QuickSight

  • Fully managed serverless business intelligence (BI) data visualization service.
  • It allows you to dashboard to shared with users
  • SPICE: Robust in-memory engine to perform advanced calculations
  • Offer Column-Level Security(CLS)
  • Price per session and per user basis

Moving Transformed Data Using Amazon Data Pipeline

  • Managed AWS Service for ETL Workflows that automates movements and transformations of your data
  • Data Driven Workflow, can create dependencies between tasks and activities.
  • Storage Integrations: DynamoDB, RDS, Redshift, and S3
  • Compute Integrations: EC2, EMR


Amazon Managed Streaming for Apache Kafka(Amazon MSK)

  • Fully managed service for running data stream applications that leverage Apache Kafka
  • Automatic Recovery
  • Detection and replacement unhealthy node
  • Integration with KMS for SSE requirements
  • Encryption at rest by default
  • TLS 1.2 for encryption in transit between brokers in clusters
  • Deliver broker logs to Amazon CloudWatch, Amazon S3, Amazon Kinesis. API calls are logged to CloudTrail.


Analyzing Data with Amazon OpenSearch

  • OpenSearch is a managed service allowing you to run search and analytics engines for various use cases
  • It is the successor to Amazon Elasticsearch Service


image image