AWS Glue and PySpark Guide

Posted 1 CommentPosted in AWS, AWS Glue

In this post, I have penned down AWS Glue and PySpark functionalities which can be helpful when thinking of creating AWS pipeline and writing AWS Glue PySpark scripts. AWS Glue is a fully managed extract, transform, and load (ETL) service to process large amount of datasets from various sources for analytics and data processing. While […]

Filtering using Events Patterns – EventBridge

Posted Leave a commentPosted in AWS, EventBridge

Amazon EventBridge as the name suggest is a serverless pub/sub allowing applications to connect via an “event bus”. It helps build loosely coupled and distributed event driven architecture. EventBridge was formerly called CloudWatch Events. In this blog, I will give an example of setting filter based event pattern in Amazon EventBridge to send SNS notification. […]

AWS Athena – DML Queries

Posted Leave a commentPosted in AWS, AWS Athena

You can learn something new everyday, and today I learned that AWS Athena supports INSERT INTO queries. Lets create table based on marvel_superheroes using CTAS command – Creating the table partition based on “year” failed with : HIVE_COLUMN_ORDER_MISMATCH: Partition keys must be the last columns in the table and in the same order as the […]

Aurora MySQL – Export data to S3

Posted Leave a commentPosted in AWS, AWS Aurora

Using SELECT INTO OUTFILE S3 you can query data from an Aurora MySQL DB cluster and save it directly into text files stored in S3 bucket. 1. Create an IAM policy for S3. { “Version”: “2012-10-17”, “Statement”: [ { “Sid”: “VisualEditor0”, “Effect”: “Allow”, “Action”: [ “s3:DeleteObject”, “s3:GetBucketLocation”, “s3:GetObject”, “s3:ListBucket”, “s3:ListBucketMultipartUploads”, “s3:PutObject” ], “Resource”: [ “arn:aws:s3:::bucket-name”, […]

AWS Glue – Querying Nested JSON with Relationalize Transform

Posted 4 CommentsPosted in AWS, AWS Glue

AWS Glue has transform Relationalize that can convert nested JSON into columns that you can then write to S3 or import into relational databases. As an example – Initial Schema: >>> df.printSchema() root |– Id: string (nullable = true) |– LastUpdated: long (nullable = true) |– LastUpdatedBy: string (nullable = true) |– Properties: struct (nullable […]