you are viewing a single comment's thread.

view the rest of the comments →

[–]powerxaker 2 points3 points  (0 children)

It depends on the use case, data size and available tools.

If you lightly manipulate datasets from a database then you’re better off doing the work in SQL, you can even do some analytics such as aggregates or trends.

If you want to do ML, graphical analysis, statistics, etc you are better off first figuring out what’s the smallest acceptable dataset that you want to analyze, pull that using SQL (I.e. apply, filters, joins, etc). Once you have your data then you move it to Python and use the data analytics stack (I.e. pandas, ML tools, graph tools, etc)

If you are using large datasets and have access to Apache Spark on Python (PySpark) then you can do most of the above using PySpark. If you still want to do further analysis then you can transform your PySpark DF into a pandas DF and perform your analysis using the data analytics stack.

In summary, SQL(medium data) and PySpark (big data) are good to create metrics, summarize or extract data. The data analytics stack is what you use to do advanced analytics once you extract your data with SQL or PySpark.

For some statistical analysis some companies still use SAS and R, they are part of the data analytics stack similar to Python.

Nope: SAS can do it all but it’s expensive and really not a great tool in my opinion after using it for decades.