How to efficiently read large amounts data from athena with AWS Wrangler

Reading data from athena
Working with generators

Recently I needed to write a job that read a huge quantity of data from AWS Athena, perform some transformations on it and write it back to S3. The problem was even using the parameter chunksize from awswrangler.athena.read_sql_query was not being executed successfuly due to running out of memory.

After thinking for a while how I could do it, I came across a function called zip_longets from itertools that would ultimatelly help me achieve the task.

So, lets start.

Reading data from athena

First, I needed to read data from two tables that should be merged together after performing some transformations on them.

table1 = wr.athena.read_sql_query(
    sql=sql_query,
    database=database,
    chunksize=True,
    params=params,
    categories=categories
)

table2 = wr.athena.read_sql_query(
    sql=sql_query,
    database=database,
    chunksize=True,
    params=params,
    categories=categories
)

Working with generators

This returns two generators, one of them, or both, could be empty, that's where zip_longest comes into play:

zip_longest(table1, table2, fillvalue=pd.DataFrame())

Doing this, I was able to iterate over each chunk of data, even if one of the generators is empty, and perform some tasks on them. Since now I am working chunk by chunk, I was able to operate on those huge tables without problems.

How to efficiently read large amounts data from athena with AWS Wrangler

Table of Contents

Reading data from athena

Working with generators