UP | HOME

How to efficiently read large amounts data from athena with AWS Wrangler

Table of Contents

Recently I needed to write a job that read a huge quantity of data from AWS Athena, perform some transformations on it and write it back to S3. The problem was even using the parameter chunksize from awswrangler.athena.read_sql_query was not being executed successfuly due to running out of memory.

After thinking for a while how I could do it, I came across a function called zip_longets from itertools that would ultimatelly help me achieve the task.

So, lets start.

Reading data from athena

First, I needed to read data from two tables that should be merged together after performing some transformations on them.

table1 = wr.athena.read_sql_query(
    sql=sql_query,
    database=database,
    chunksize=True,
    params=params,
    categories=categories
)

table2 = wr.athena.read_sql_query(
    sql=sql_query,
    database=database,
    chunksize=True,
    params=params,
    categories=categories
)

Working with generators

This returns two generators, one of them, or both, could be empty, that's where zip_longest comes into play:

zip_longest(table1, table2, fillvalue=pd.DataFrame())

Doing this, I was able to iterate over each chunk of data, even if one of the generators is empty, and perform some tasks on them. Since now I am working chunk by chunk, I was able to operate on those huge tables without problems.

Author: Alejandro Alcalde

Date: 2021-02-27 Sat 00:00

Emacs 26.3 (Org mode 9.3.6)

Validate

Index