AWS Glue and PySpark Guide

Posted 1 CommentPosted in AWS, AWS Glue

In this post, I have penned down AWS Glue and PySpark functionalities which can be helpful when thinking of creating AWS pipeline and writing AWS Glue PySpark scripts. AWS Glue is a fully managed extract, transform, and load (ETL) service to process large amount of datasets from various sources for analytics and data processing. While […]

Athena: Extracting data from JSON

Posted Leave a commentPosted in AWS, AWS Athena

Suppose you have a table in Athena and its column contain JSON data. How can you extract the individual keys? In the example, the table has column “fixedproperties” which contain JSON data – How can you display the data is below format? select json_extract(fixedproperties, ‘$.objectId’) as object_id, json_extract(fixedproperties, ‘$.custId’) as cust_id, json_extract(fixedproperties, ‘$.score’) as score […]

Merge json files using Pandas

Posted Leave a commentPosted in Coding, Pandas

Quick demo for merging multiple json files using Pandas – import pandas as pd import glob import json file_list = glob.glob(“*.json”) >>> file_list [‘b.json’, ‘c.json’, ‘a.json’] Use enumerate to assign counter to files. allFilesDict = {v:k for v, k in enumerate(file_list, 1)} >>> allFilesDict {1: ‘b.json’, 2: ‘c.json’, 3: ‘a.json’} Append the data into list […]