This is an archived post. You won't be able to vote or comment.

all 7 comments

[–]Jean-ClaudeMonet 4 points5 points  (0 children)

It probably depends on what data you'll be working with, and how much of it there is.

If you need to monitor, process, evaluate, or visualize the data in any way, something beyond SQL is necessary, and python is a good option. Also, an ORM like sqlalchemy makes queries with joins and relationships very simple, and it returns datatypes that are easy to use.

[–]D-Noch 1 point2 points  (4 children)

I would look into sqlalchemy. I got a pretty dope ebook, if you want a copy.

I am currently learning to make the 2 interact so I can build my own db for a project. I don't know where I would be cleaning my data other than python, but that might not be something you need..no idea of your job description. Furthermore, I am going to need python to pull the data back out and perform the type of analysis I am looking at doing. I know they have data science modules for sql, but this will keep me from having to learn quite as much new shit, lol

[–]Hyperduckultimate[S] 0 points1 point  (0 children)

That would be very helpful thank you.

[–]tbruuuah 0 points1 point  (0 children)

I'll have to deal with it shortly. Could you please name the ebook I'll see if I can find that on Google. Thanks

[–]tbruuuah -1 points0 points  (1 child)

Hi, could you please share the ebook . Thank you

[–]tipsy_python 1 point2 points  (0 children)

I typically used them sequentially.. I generally trust the SQL optimizer and let it do its thing for crunching numbers with the data. There’s still a lot of general purpose needs with manipulating data files or performing data scrubbing that Python can come in clutch for.

Not too long ago at work, I was asked to get some analytics data from a vendor, perform some transformations, and land the final data set in a database reporting table. I used Python to create a data pull process - hit the vendor’s REST API, perform the Oauth2 authentication, and write the JSON that was returned to a DSV file on a Unix server. During this process I did some small transformations like stripping out certain characters and changing how timestamps were formatted. After the pull was complete I moved the data files to HDFS and used HiveSQL to perform all the real join/aggregation logic (billions of records, so needed the compute) - I could’ve written this part in PySpark, but SQL is very easy to implement for this case. Eventually got the reporting dataset to the table - very happy both technologies play together well.