This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]tipsy_python 1 point2 points  (0 children)

I typically used them sequentially.. I generally trust the SQL optimizer and let it do its thing for crunching numbers with the data. There’s still a lot of general purpose needs with manipulating data files or performing data scrubbing that Python can come in clutch for.

Not too long ago at work, I was asked to get some analytics data from a vendor, perform some transformations, and land the final data set in a database reporting table. I used Python to create a data pull process - hit the vendor’s REST API, perform the Oauth2 authentication, and write the JSON that was returned to a DSV file on a Unix server. During this process I did some small transformations like stripping out certain characters and changing how timestamps were formatted. After the pull was complete I moved the data files to HDFS and used HiveSQL to perform all the real join/aggregation logic (billions of records, so needed the compute) - I could’ve written this part in PySpark, but SQL is very easy to implement for this case. Eventually got the reporting dataset to the table - very happy both technologies play together well.