Aws Series Big Data
Post Date : 2023-12-31T19:06:05+07:00
Modified Date : 2023-12-31T19:06:05+07:00
Category: systemdesign aws
Tags: aws
The 3Vs of Big Data
Volume
- Ranges from terabytes to petabytes of data
Variety
- Includes data from a wide range of sources and formats
Velocity
- Data needs to be collected, stored, processed, and analyzed within a short period of time
Amazon Redshift
- Fully managed, petabyte-scale data warehouse service in cloud
- It’s a very large relational database traditional used in big data applications.
Fun fact: reserved its name due to desire to have people leave Oracle databases and leverage this AWS service instead!
Oview and Uses
- Size: incredibly big - it can hold up to 16 PB of data. You don’t have to split up your large datasets.
- Relational: This database is relational. You use your standard SQL and business intelligence(BI) tools to interact with it.
- Based on PostgreSQL engine type, however, it is NOT used for OLTP workloads
- Usage: is not meant to be replacement for standard RDS databases
- High Performance: 10x performance of other data warehouse offered in cloud
- Columnar: Storage of data is column-based instead of row-based. Allows for efficient parallel queries
High Availability, Snapshot, and Disaster Recovery
- Redshift now supports Multi-AZ deployments! It only spans two AZs at this time
- Snapshots are incremental and point-in-time. They can be automated or manual. Always contained in Amazon S3(you cannot manage the bucket)
- No conversions from Single-AZ to Multi-AZ(or vice versa)
- Leverage large batch inserts
Redshift Spectrum
- Efficient query and retrieve data from Amazon S3 without having to load the data into Amazon Redshift tables
- Massive parallelism allows this to run extremely fast against large datasets. Use Redshift servers that are independent of your clusters
Enhanced VPC Routing
- All COPY and UNLOAD traffic between your cluster and your data repositories is forced to go through your VPC
- Enables you to use VPC features: VPC Endpoints, VPC Flow Logs, …
Processing Data with EMR(Elastic MapReduce)
What is ETL?
- Extract , Transform, Load
EMR
- AWS Service to help with ETL processing
- A managed big data platform that allows you to process vast amounts of data using open-source tools, such as Spark, Hive, HBase, Flink, Hudi, and Presto
EMR Storage
There are 3 different types of storage options within EMR
- Hadoop Distributed File System(HDSF): distributed, scalable file systems for Hadoop that distributes stored data across instances.
Streaming Data with Kinesis
- Allow you ingest, process, and analyze real-time streaming data. You can think of its as a huge data highway connected to your AWS account.
There 2 major versions Kinenis:
Data Streams
- Purpose: Real-time streaming for ingesting data
- Speed: Real time
- Difficult: You’re responsible for creating consumer and scaling the stream.
Data Firehose
- Purpose: Data transfer tool to get information to S3, Redshift, Elasticsearch, or Splunk
- Speed: Near real time (within 60s)
- Difficult: Plug and Play with AWS architecture
Kinesis Data Analytics
- Analyze data using standard SQL
- Easy: simple to tie Data Analytics into your Kinesis Pipeline. It’s directly supported by Data Streams and Data Firehose
- This is full managed, real time serverless service, automatically handle scaling and provisioning of needed resources.
- You only pay for the amount of resources you consume or your data passes through.
Kinesis and SQS
- SQS is a message broker that is simple to use and doesn’t require much configurion. It doesn’t offer real-time message delivery
- Kinesis is a bit more complicated to configure and its mostly used in big data applications. If does provide real-time communication.
Amazon Athena
- Athena is an interactive query service that makes it easy to analyze data in S3 using SQL. This is allow you to directly query data in your S3 bucket without loading it into a database.
Amazon Glue
- Amazon Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data. It allows you to perform ETL(extract,transform,load) workloads without managing underlying servers.
Visualizing Data with Amazon QuickSight
- Fully managed serverless business intelligence (BI) data visualization service.
- It allows you to dashboard to shared with users
- SPICE: Robust in-memory engine to perform advanced calculations
- Offer Column-Level Security(CLS)
- Price per session and per user basis
Moving Transformed Data Using Amazon Data Pipeline
- Managed AWS Service for ETL Workflows that automates movements and transformations of your data
- Data Driven Workflow, can create dependencies between tasks and activities.
- Storage Integrations: DynamoDB, RDS, Redshift, and S3
- Compute Integrations: EC2, EMR
Amazon Managed Streaming for Apache Kafka(Amazon MSK)
- Fully managed service for running data stream applications that leverage Apache Kafka
- Automatic Recovery
- Detection and replacement unhealthy node
- Integration with KMS for SSE requirements
- Encryption at rest by default
- TLS 1.2 for encryption in transit between brokers in clusters
- Deliver broker logs to Amazon CloudWatch, Amazon S3, Amazon Kinesis. API calls are logged to CloudTrail.
Analyzing Data with Amazon OpenSearch
- OpenSearch is a managed service allowing you to run search and analytics engines for various use cases
- It is the successor to Amazon Elasticsearch Service