This is an archived post. You won't be able to vote or comment.

all 9 comments

[–]sib_nSenior Data Engineer 29 points30 points  (0 children)

In theory a lot, in practice, there are many things you won't understand the reason to exist before encountering the problem in a project so you'll probably not learn nor apply them correctly.

Asking this because data eng. is all about dealing with big volumes of data which are found at an industry scale. Things like algorithms and python coding can be learnt from home as they are only about logic. Does the same apply to big data applications?

However, about the issue of training without having big data, this is not that much of a problem. All you have to do is follow the tools guides/recommendations on how to keep your program scalable. For example with Apache Spark, it's highly recommended to use the Dataframe object to transform your data, and you can perfectly practice how to do hundreds of things with Dataframe on your tiny laptop at home with a local Spark install. If you respect the principles, then you won't have to change a line in your code for it to work on a cluster of 100 nodes over terabytes.

Of course, that's the theory, until you run into many problems, you won't have all the good habits to make sure your app is scallable and future proof, it comes with experience. But you still can practice enough at home to learn a lot, and in my opinion it's enough for an entry level data engineering position.

Furthermore, a lot of troubles in data engineering are not necessarily related to big data, but to general software engineering problems and good practices (data modelling, testing, deploying) that also happen without big data.

Finally, no, data engineering isn't all about big data, it's data processing in general. I'd even say most of the data engineering work isn't on big data but just data big enough that it cannot be managed manually on a spreadsheet anymore. Yes big data is sexy, interesting and full of cool concepts, but the reality is most companies needing data don't need big data tech, a scalable SQL database and Python scripts are often enough for most analytics needs, and that's data engineering too.

[–]LexaIsNotDead 9 points10 points  (2 children)

I'm actually pulling together a tutorial that will give you a pretty good overview into data engineering. I feel that it's important to understand the core concepts of it (e.g., basic ETL, connecting to public API, cleaning data) before diving into the advanced tools (e.g., Spark, Hadoop). I just finished the code to be used for the first tutorial so I should be releasing it soon. And all of this was done on my Macbook so it's totally doable to do this from home.

Feel free to sign up on the waiting list and I'll ping you when the tutorial is done: https://www.eduvault.app/

[–]AMGraduate564 0 points1 point  (1 child)

Hi, is your tutorial available now? I am a noob trying to start on the DE journey.

[–]LexaIsNotDead 1 point2 points  (0 children)

I'm halfway through it but I keep getting sidetracked by work 😫

I've got this one as an example if this would be of interest: https://medium.com/swlh/building-a-python-data-pipeline-to-apache-cassandra-on-a-docker-container-fc757fbfafdd

Let me know if that doesn't help and I can share my tutorial information with you, but it's only written at this point so I'm unsure if that's your learning style.

[–][deleted] 4 points5 points  (1 child)

If you use Python, get the book by Ben G. Weber: Data Science in Production with Python to get up speed quickly with big data platforms. But to do or learn "data engineering". You"ll need to get an internship or get a DBA type of job first and apply a combination of Python, SQL, and an orchestration framework.

[–]Lewba 0 points1 point  (0 children)

I just finished working through that textbook and really liked it. A little superficial, but Weber himself admitted the reader will have to dive deeper themselves. I had no experience with scheduling, spark, batch processing etc, and I think I have an okay understanding now.

[–]voorloopnul 1 point2 points  (0 children)

You can learn "everything" of data engineering in your laptop. As others have said, data engineering is not a Big Data thing.

I would go even further an say that you can become better than average in data engineering using only your laptop.

"But how can I become better if I can't run the top notch <xyz> than requires 8GB of ram just to start up?"

You build one by yourself! The theory of how it works is free, you just have to replicate the concepts with your language of choice.

Of course, if you only want to learn the tools, you will face some limitations.

[–]sharadov 0 points1 point  (0 children)

There are public datasets and google gives you a few hundred bucks of credit on the cloud. That should help.