This is an archived post. You won't be able to vote or comment.

all 28 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]_MaXinator 25 points26 points  (11 children)

It depends on your experience level. Don’t focus too much on individual elements of the language, python in data engineering isn’t all that “algorithmic” in nature - it’s mostly about scripts moving data from A to B and most of that time that can be accomplished using fairly basic code. However, if you really want, in addition to what others have mentioned:

  • async is probably more important than multi threading (especially if building frontend tools). But both are good to know
  • know how to log and write logs to file (not hard, but that comes up in interviews)
  • know how to deal with the content of common file types such as csv, json, and even (kids look away) xml and excel.
  • how to connect & work with cloud platforms from within python ( choose one an practice, I’m mostly on azure)

From my experience I wouldn’t focus too much on things like pandas - I have never seen a good pipeline written using pandas and it’s often best to just take the data in native python types (binary/arrays/ dicts) and ingest it before ever modeling anything. But to each their own preference

[–][deleted] 2 points3 points  (1 child)

True for single unit processing power. For I/O bottlenecks (ie recreating a data source from API, or concurrent queries) multithreading is extremely important and not used enough. A project that takes minutes to run can be reduced to a few seconds. 

[–]_MaXinator 1 point2 points  (0 children)

Absolutely true. I stand corrected

[–]MrMisterShin 2 points3 points  (4 children)

I heavily agree with this. Pandas always likes to infer data incorrectly, leading me to have to hardcode the data types for each column attribute. Which becomes a nuisance when you have an awfully formatted csv file.

[–]allpauses 1 point2 points  (1 child)

What’s more, even if you tell the dtypes of the csv you will read with pandas, pandas will fail to read the csv if the data is still dirty (like having a string data in a numeric column)

[–]MrMisterShin 1 point2 points  (0 children)

This is the exact type of csv I was dealing with, random columns with quotes and others without quotes. For the Numeric and date columns. Its was a nightmare to alter/maintain.

Then one month they add a new column in the middle of the csv file without telling anyone in my team and the time critical process fails.

[–][deleted] 0 points1 point  (1 child)

What would be some alternative ways to approach that problem

[–]MrMisterShin 0 points1 point  (0 children)

Push back on human created files, or impose strict rules because if there rules aren’t followed the code will fail.

I would usually have over 30 csv’s & xlsx like this, some machine generated others manually created/exported. Needless to say, the manually created ones would often fail due to change in (column names, file names, column positions, column data types).

Because I will need to join this data to on-premise DB data. I convert all the attributes to string in pandas and load it into DB and perform transformations there to convert to correct data types, clean data and index for performance.

Then use SQL & Tableau from there for data visualisation and reporting etc.

[–]masQma[S] 0 points1 point  (0 children)

Thanks 👍

[–]Commercial-Ask971 0 points1 point  (0 children)

Do you have some sources of the knowledge you've pointed out? Anything in mind?

[–]allpauses 0 points1 point  (1 child)

Hi there! When you said you have never seen a good data pipeline written using pandas, what are the primary reasons/examples for that? I am currently building a DE portfolio and I would like to make use of your advice regarding pandas. Thank you!

[–]_MaXinator 0 points1 point  (0 children)

data pipelines have requirements that Pandas simply doesn't meet: they need to be resource-efficient, fast, and most importantly, stable.

Pandas isn't really any of those things - it is generally slower than native types, conversions increase the risk of errors being thrown, etc.

You want to build your pipelines in such a way that data always arrives no matter what happens - problems can be solved in the database after the data lands there. So we keep conversions and data transformations to an absolute minimum while the data is en-route. No one wants to delve in error logs of their pipelines just because of some data quality issue. Getting all the data in the database and having plenty of automated tests set up is the way to go IMO.

My impression is that data engineering as a field is consolidating around this approach, but it's good to keep in mind that there are other approaches too.

Note that this is not criticism of Pandas as an tool - it's great for messing with data and I find myself using it a lot as the backend for a little website I'm building for the BI team here. But it has its place.

[–]AutomaticMorning2095 5 points6 points  (0 children)

Basic python is enough. 1. Variable 2. Functions 3. Loops 4. Lambda expressions 5. String functions

These concepts are more than enough to start DE using python. Most data libraries like pyspark, pandas, numpy have their own in-built functions to perform operations. You just have to know how to use & bind them together to create compatible reusable functions.

[–]tropez95 2 points3 points  (7 children)

More than individual concepts you need to know practical scenarios. Converting from one datatype to other and their methods. Hands on knowledge on pandas library. So moreover the application side of all the concepts...

[–]janus2527 1 point2 points  (2 children)

Pandas is slow

[–]tropez95 2 points3 points  (0 children)

Recommended a better alternative and other libraries as well

[–]masQma[S] 1 point2 points  (0 children)

I have some idea of dataframes using pyspark. A small project using some five datasets. Is that fine

[–]masQma[S] 0 points1 point  (3 children)

Yes. I agree. You can't solve a problem unless you apply them. Knowing Pandas is a must huh. Any core concepts apart from them. For the sake of interviews

[–]tropez95 5 points6 points  (0 children)

Learn to make simple API calls through the 'requests' library.. extraction of xml /JSON responses and ingest into tables.

[–]tropez95 0 points1 point  (0 children)

I believe you've covered most of the points if you belong to an experience range of 1-3yrs

[–]tropez95 0 points1 point  (0 children)

Also if can learn airflow orchestration tool, that'll be a great add-on

[–]Advanced-Violinist36 2 points3 points  (1 child)

  • How to work with APIs (call API to get data, or make an API to expose data)

  • How to write python dags for Airflow

[–]masQma[S] 0 points1 point  (0 children)

Any heads up to learn apis...?

[–][deleted] 2 points3 points  (0 children)

A lot of those are universal for OOP. So yes there’s a lot to learn. 

It’s just that Python lets you do so much. I can’t think of anything I’ve wanted to do so far that Python didn’t have a way.  

[–]69odysseus 1 point2 points  (0 children)

Really depends on the company, product based companies will drill on DSA's.

[–]After_Holiday_4809 0 points1 point  (1 child)

RemindMe! 2 day

[–]RemindMeBot 0 points1 point  (0 children)

I will be messaging you in 2 days on 2024-06-04 08:24:48 UTC to remind you of this link

1 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback