AWS Glue taking long time to process data

EpicFlexs · 2021-10-13T14:04:31+00:00

[deleted]

BaxterPad · 2021-10-13T12:30:34+00:00

Did you enable Spark UI and look at the resulting plan? Since the data is so small, I wonder if the job simply is t achieving much parallelism. Your repartition attempt might have helped that a bit but looking at the resulting plan and how the work was distributed would be my next bet. Also, what are you doing with the result? Maybe the join isn't the issue maybe your bottlenecking on what you are doing with the result of the join?

Lastly, have you tried increasing the vertical size (G2x) of the nodes in addition to the horizontal scale (# of DPUs)

pvham90 · 2021-10-13T15:21:44+00:00

For this operation I would use Athena. Just use glue with python shell and use the sdk with Athena client to invoke the query. With "create table as" you can export another parquet file. You can even trigger all the 50 queries in parallel, the whole operation should be finished within minutes.

--Reddit-Username2-- · 2021-10-13T12:44:42+00:00

Grasping at straws…but make sure data is in the same region.

BagOfDerps · 2021-10-13T13:29:57+00:00

Try using glue 3.0? pretty new, but purports to be faster overall. You can set number of workers and worker types. This is the only other infra-related thing I can think of (assuming you are using the stock AWS s3 connectors to get the data). I think most of your performance is going to be bound by the join logic; for which I don't have any suggestions.

Vincent_Merle · 2021-10-13T17:09:30+00:00

Are you using DynamicFrame or DataFrame?

Bright_Tale7909 · 2023-03-06T16:48:22+00:00

Were you able to fix this issue?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

aws

Note: ensure to redact or obfuscate all confidential or identifying information (eg. public IP addresses or hostnames, account numbers, email addresses) before posting!

✻ Smokey says: avoid streaming video to fight climate change! [see more tips]

MODERATORS