AWSAWS Glue

Implementing Glue ETL job with Job Bookmarks

AWS Glue is a fully managed ETL service to load large amounts of datasets from various sources for analytics and data processing with Apache Spark ETL jobs.

In this post I will discuss the use of AWS Glue Job Bookmarks feature in the following architecture.

For_Blog

AWS Glue Job Bookmarks help Glue maintain state information of the ETL job and helps process new data when rerunning on a scheduled interval, preventing the reprocess of old data.In a nutshell, Job bookmarks are used by AWS Glue jobs to process incremental data since the last job run, avoiding duplicate processing. 

In the above architecture, Kinesis Data Firehose streams events data to S3 bucket referred as raw data store based on buffer size or buffer interval configuration settings. Supposing the condition of buffer interval set at 900 seconds is satisfied first, it triggers data delivery to S3 every 15mins, writing the data to the S3 destination prefix. In Firehose, the S3 destination Prefix is configurable and is optional. As an example, to write the data in hourly partitions, you can set the following Prefix under Amazon S3 destination 

event/year=!{timestamp:yyyy}/month=!{timestamp:MM}/date=!{timestamp:dd}/hour=!{timestamp:HH}/

The AWS Glue ETL job is triggered using Glue ETL trigger. As AWS Glue job bookmark is enabled, it processes the incremental data since the last successful run. For S3 input sources, AWS Glue job bookmarks check the last modified time of the objects to verify which objects need to be reprocessed. If input source data has been modified since last job run, the files are reprocessed when the job is run again.

You can enable Job bookmark feature either while creating the job or later under “Advanced properties“.

During the job execution, it skips partition if it is empty or the creation timestamp of object is older than the timestamp of the last successful job run as captured by job bookmark. 

Glue ETL job output a manifest file containing a list of processed files per path. The manifest file is stored in the temporary location specified with the job. The default location is s3://aws-glue-temporary-<account_id>-<region>/admin/partitionlisting/<job_name>/<job_run_id>/datasource<n>.input-files.json

[
    {
        "path": "s3://bucket-name/raw/year=2020/month=04/day=15/hour=02",
        "files": []
    },
    {
        "path": "s3://bucket-name/raw/year=2020/month=04/day=17/hour=03",
        "files": [
            "s3://bucket-name/raw/year=2020/month=04/day=17/hour=03/run-1587094849920-part-r-00000",
            "s3://bucket-name/raw/year=2020/month=04/day=17/hour=03/run-1587094849920-part-r-00001",
            "s3://bucket-name/raw/year=2020/month=04/day=17/hour=03/run-1587094849920-part-r-00002",
            "s3://bucket-name/raw/year=2020/month=04/day=17/hour=03/run-1587094849920-part-r-00003"
        ]
    },
    {
        "path": "s3://bucket-name/raw/year=2020/month=04/day=16/hour=21",
        "files": []
    }
]

Some of the backfilling scenarios can be supported using the “Rewind Job Bookmark” to any previous job run, resulting in the subsequent job run reprocessing data only from the bookmarked job run.

Now for Job bookmark feature to work as expected with S3 input source, make sure of the following:-

  1. Enable job bookmark parameter for the Glue job.
  2. Include transformation context parameter (transformation_ctx) when creating the DynamicFrame. AWS Glue uses transformation_ctx to index the key to the bookmark state.
  3. Use job.commit() statement at the end of Glue ETL script as it  updates the state of the job bookmark.
  4. Review the glue job logs or manifest file to ensure AWS Glue job is not reprocessing data that it already processed in an earlier run.

To conclude, based on the use-case, AWS Glue Job bookmark feature can be very useful, as it helps process incremental data from S3 and relation database systems and also supports data backfilling scenarios better.

Reference –

https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

5 thoughts on “Implementing Glue ETL job with Job Bookmarks

Leave a Reply