# How to efficiently read large amounts data from athena with AWS Wrangler

Recently I needed to write a job that read a huge quantity of data from AWS Athena, perform some transformations on it and write it back to S3. The problem was even using the parameter chunksize from awswrangler.athena.read_sql_query was not being executed successfuly due to running out of memory.

After thinking for a while how I could do it, I came across a function called zip_longets from itertools that would ultimatelly help me achieve the task.

So, lets start.

First, I needed to read data from two tables that should be merged together after performing some transformations on them.

table1 = wr.athena.read_sql_query(
sql=sql_query,
database=database,
chunksize=True,
params=params,
categories=categories
)

sql=sql_query,
database=database,
chunksize=True,
params=params,
categories=categories
)


## Working with generators

This returns two generators, one of them, or both, could be empty, that's where zip_longest comes into play:

zip_longest(table1, table2, fillvalue=pd.DataFrame())


Doing this, I was able to iterate over each chunk of data, even if one of the generators is empty, and perform some tasks on them. Since now I am working chunk by chunk, I was able to operate on those huge tables without problems.

Date: 2021-02-27 Sat 00:00

Emacs 26.3 (Org mode 9.3.6)

Validate

Index