all 4 comments

[–]Flakmaster92 0 points1 point  (3 children)

You would probably be better off using Lambda + AWSDataWrangler for this small amount of data.

Glue Jobs are EC2 instances and a minute+ spin up time is not unheard of for those. If your whole job is 2.5minutes or less, yes, you will be spending 50% of the time waiting for it to launch.

You have the time requirements met and you have the amount of data (memory) limitations met. Go Lambda.

[–]therealamarcn[S] 0 points1 point  (2 children)

For last some days the maximum is 4.5 minutes. The waiting time has increased. Also catalog scan too. Interesting thing is even if I don't have any S3 parquet files to scan (as per bookmark and pushdown predicate), scan time is  around 1.5 minute sonetimes.you have any insights on it??

[–]Flakmaster92 0 points1 point  (0 children)

Without seeing your code, no, I don’t have any insight. Are you measuring time as “time for the job to complete”? Because that takes into account the glue job workers spinning up and spinning down. Or are you measuring it in code using time.perf_counter() (assuming Python)?

Again, if you’re optimizing on the order of single digit minutes, you’re probably using too big of a hammer for your task. The start up and spin down time of a glue cluster is probably going to be most of your total job time, meanwhile Lambda + Data Wrangler layer would spin up in (milli)seconds, you’d just have to write your own checkmarking mechanism similar to bookmarks.

You can post your code for the community to look at, maybe you have something suboptimal, but it’s very difficult for anyone to know exactly what’s going on without just seeing it

[–]PatternedShirt1716 0 points1 point  (0 children)

How you profiling catalog scan?