In this blog post I will list down the steps required to setup the AWS Glue job to scan the dynamodb table in another account. In my setup, I scan the dynamodb table in Account A (us-west-2), perform glue transformations in Account B (us-east-1) and write it to S3 in Account B.
Account A – Dynamodb table
Account B – AWS Glue Job, S3 Bucket
1. Create an IAM role in Account B (us-east-1) with Glue trusted entity, attach AWSGlueServiceRole service policy and attach another policy allowing sts and s3 action. Make note of the role arn.
arn:aws:iam::xxxxxxxxxxxx:role/ddb-to-s3-glue-role
{ "Version": "2012-10-17", "Statement": [ { "Action": [ "sts:AssumeRole", "sts:GetAccessKeyInfo", "sts:GetSessionToken", "sts:TagSession" ], "Effect": "Allow", "Resource": "*" }, { "Action": [ "s3:DeleteObject", "s3:GetBucketLocation", "s3:GetObject", "s3:ListBucket", "s3:ListBucketMultipartUploads", "s3:PutObject" ], "Effect": "Allow", "Resource": [ "arn:aws:s3:::your-bucket-name", "arn:aws:s3:::your-bucket-name/*" ] } ] }
2. Create IAM role in Account A (us-east-1) where Dynamodb table resides to allow “Scan” action.
arn:aws:iam::xxxxxxxxxxxx:role/scan-ddb-role
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "dynamodb:DescribeTable", "dynamodb:Scan" ], "Resource": [ "arn:aws:dynamodb:us-west-2:xxxxxxxxxxxx:table/service-statement" ] } ] }
Make sure to modify the Trust relationship, allowing the role created in step 1 to allow assume role and the access conditions for the role.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::xxxxxxxxxxxx:role/ddb-to-s3-glue-role" }, "Action": "sts:AssumeRole", "Condition": {} } ] }
3. Create AWS Glue job in Account B (us-east-1). Choose the IAM role created in step 1 as the role to be assumed by the job.
4. Create Glue script to scan the table.
import sys import json import boto3 import requests from time import time from awsglue.job import Job from datetime import datetime import pyspark.sql.types as t from awsglue.transforms import * import pyspark.sql.functions as f from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.transforms import Relationalize from awsglue.utils import getResolvedOptions from awsglue.dynamicframe import DynamicFrame from boto3.dynamodb.conditions import Key #Instantiate sts client sts_client = boto3.client('sts',region_name='us-west-2') #Assume the role created in step 2. assumed_role_object=sts_client.assume_role(RoleArn="arn:aws:iam::xxxxxxxxxxxx:role/scan-ddb-role", RoleSessionName="AssumeRoleSession1") #Retrieve credentials credentials=assumed_role_object['Credentials'] #Instantiate dynamodb client dynamodb_client = boto3.resource( 'dynamodb', aws_access_key_id=credentials['AccessKeyId'], aws_secret_access_key=credentials['SecretAccessKey'], aws_session_token=credentials['SessionToken'], region_name='us-west-2' ) #Function to scan the dynamodb table def scan_table(table_name, filter_key=None, filter_value=None): """ Perform a scan operation on table. Can specify filter_key (col name) and its value to be filtered. This gets all pages of results. Returns list of items. """ table = dynamodb_client.Table(table_name) if filter_key and filter_value: filtering_exp = Key(filter_key).eq(filter_value) response = table.scan(FilterExpression=filtering_exp) else: response = table.scan() items = response['Items'] while True: #print(len(response['Items'])) if response.get('LastEvaluatedKey'): response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey']) items += response['Items'] else: break return items #Variable to store the result of scan. Return list of dictionaries. table_items = scan_table(table_name='service-statement')
Reference –
https://martinapugliese.github.io/interacting-with-a-dynamodb-via-boto3/
One thought on “Scan Dynamodb table from AWS Glue in different account”
This is incredibly helpful! Anand wrote up the one guide that eludes the internet on Glue tutorials and it’s probably the most critical piece of knowledge, cross-account data aggregation from Dynamo, that software engineers have to deal with. Really appreciate this blog series and the sister post too (https://aprakash.wordpress.com/2020/05/16/cross-account-aws-glue-data-catalog-access-with-glue-etl/).