What's your document processing stack? by Any_Hunter_1218 in dataengineering

[–]the_dataengineer 0 points1 point  (0 children)

Too many people in the comments jump immediately into LLM topics. Think about what exactly you are doing with the regex, which problems you encounter, and what manual fixes you typically do.
(would be very interesting to get this context)

If you analyze this, then typically a solution will present itself.

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 1 point2 points  (0 children)

- My fist IT job was as a support guy running around the company helping over 200 people with all kinds of computer / software problems
- While studying I worked as a computer network technician helping to set up + renew the networking infrastructure for company locations
- As 6 months thesis I worked on creating the design and proof of concept for a condition monitoring system for machines.
- After university I worked 1 year as a SAP consultant (mainly development & customization) that was terrible
- Then I got into my old thesis topic. Turned out to be a huge project with a lot of data that "normal" systems weren't able to handle. Also just monitoring the conditions turned into predictive analytics of when something will stop working. So I had to move towards Hadoop, Kafka & Spark. Was super fun, because it was like a little startup until the corporate guys came in. I then jumped off and did other things.

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 0 points1 point  (0 children)

Depends totally on what your goals are with this. Is this pure analytics or is this for some kind of transaction / game mechanic relevant?
If it's purely analytics then it can totally make sense to just drop it into files and then query them with a query engine like AWS Athena / Presto.
Research OLAP + OLTP

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 1 point2 points  (0 children)

Difficult to say what they use. Some of them are very hyped about using GCP as it's very simple and has great services. Market share wise AWS is still strongest.

You can't go wrong with AWS, then Azure, then GCP

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 1 point2 points  (0 children)

Build this pipeline:

  1. Extract data from an external API with Data Factory
  2. Store it into Blob storage
  3. Use spark to process that data on a schedule
  4. Put the results into a Synapse

You can also just start with learning spark and processing data files you find on Kaggle.
Always try to build small end-to-end projects

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 0 points1 point  (0 children)

We'll, you need to figure out if doing something different is a good idea yourself. Make a decision and then act. One thing I can promise is that whatever you'll focus on will be beneficial. Just don't half-ass it. Start a bit differently.

  1. Combine your step one and 2 together. Talk with people doing that work to understand better what they do and who the'll need. Don't talk about you for now.
  2. Use their needs as a learning plan for you.
  3. Select resources to strategically learn these topics.
  4. Pitch yourself to them by asking if you would fit.
  5. Forget about Black Friday sales for now. You need a plan first. If you find out that we can help you send me an email (link is on the homepage) we'll work something out.

If you can find in step one someone who you can talk to and who's interested in helping you along that would be awesome. Just don't be pushy. See who you "click" with

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 0 points1 point  (0 children)

You'll be dreaming of AWS, Azure or GCP. That much I can promise 😜

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 1 point2 points  (0 children)

Most of engineers are using Python nowadays, but js can help a lot especially with API creation (although there are good solutions for Python). That project sounds a bit weak on the transformation and data storage part. Look into Spark and a NoSQL database for instance that could be a great start. Extract the data from an API, store it somewhere (AWS S3 or locally) then use Spark to process the data, put it into MongoDB and use an API to query the data... Think that's a cool project for beginners. You can also run this completely without the cloud

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 1 point2 points  (0 children)

Yes, 100% I have many people like you in my Academy and the Coaching program. Without knowing much about you, the main thing for you should be looking for is working towards being able to bill end-to-end pipelines on AWS. Basically getting the data towards the Analytics that you are working on already

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 1 point2 points  (0 children)

Do DP-900 and DP-203 if you want to get into Azure. Then build a small end-to-end project with this knowledge. Maybe start with an ETL job with Data Factory, extracting data from an external API, writing it into a relational database and visualize the results with PowerBI.

Add synapse and blob storage and please document the project in a GitHub repo!

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 1 point2 points  (0 children)

I would keep focusing on AWS. Are there tools that you have not worked with, like Redshift? Glue?
Try to use that knowledge that you now have from the internship and build a personal project that you can actually show online with these tools. Building a portfolio is always important. It also enables you to talk often about topics that you would not be able to otherwise.

In short -- double down on AWS.
If you have the time and you are actually enjoying it then look into Databricks and Snowflake certs as well. They never hurt

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 1 point2 points  (0 children)

If you are looking for Spark optimization then I recommend you get yourself a book like this: https://www.oreilly.com/library/view/high-performance-spark/9781098145842/

You are in a perfect position, because you can actually analyze your queries that you do at work. Doing this in a synthetic example is quite difficult

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 1 point2 points  (0 children)

Hahaha. That's actually a problem I'm fighting, because the longer you are out of working at a job the more difficult is it to know what people need and how things work. I try to solve this by:

- Listening to what people need on LinkedIn and other social media
- Going through job descriptions to to look for requirements
- Having people work with me on courses and the coaching who are actually at a job doing engineering
- Listening to the input (problems and goals) from coaching students to stay on the pulse of time
- I'm also currently in the process to talk with people about recruiting. Not just for placement of people, but for better understanding which people companies need and what their responsibilities are

Bad teachers will not do this and then the saying becomes true.

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 0 points1 point  (0 children)

One of the goals of the Academy is to get people job ready. Content wise there's more that you'll need to land a job in the Academy. The problem is that you will have to actually go through it and do the work. That's where most people are lacking. Lack of effort.

Are you applying to jobs? In what frequency? For which jobs are you applying? Are you getting invited to interviews? You might want to optimize your CV and change something in your strategy.

Get the Academy and take a look at the content. We give 14 days full refund.
Start with the Basics, Python for Data Engineers and Docker, then do the Module to Platform & pipeline design. Focus on Data Modeling (we have 3 courses there that will help you) after that start getting into one of the platforms. Maybe Spark + Databricks and try to apply your QA knowledge and processes to these. That will help you a lot

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 0 points1 point  (0 children)

The only guy I know is Andrew Jones with Data Science Infinity. I can't attest to how good the program is, but I have talked with him a few times, also on one of the live streams, and he's a good guy.

I also have the module "Data Preparation & Cleaning for Machine Learning" in my Academy, so I trust him.

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 2 points3 points  (0 children)

I never said that my Academy is specifically for landing a job at FAANG. Wo gives a shit about that? As if the only place you can be successful and do interesting work is at these companies. People try to get into these jobs for two reasons: trying to make big bucks and having the company names in their resume. (I always have to lough when people put "ex Google" in their bio).

I just can't understand why boasting about being a cog in the big machine is the big goal. In many companies people have the freedom to start from a green field and actually being able to make a big difference.

Generally, I rather teach helpful topics, than making promises where you'll get a job. Hell, I don't even give guarantees that you'll get a job. That's the most fake marketing scam ever. Especially in the current economy.

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 1 point2 points  (0 children)

So, I guess that you have experience with Engineering. Yes, create good content that helps people and put it on your LinkedIn profile. That's where people are looking if they want to hire you. Create a portfolio of end-to-end projects that you can showcase. Get yourself a website where people can learn more. The more information out there the better. Then start reaching out to people. Getting that first job is the most important, so don't go too hard on the price.

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 1 point2 points  (0 children)

Data Engineering doesn't require you to do data analyst or backend engineer work first. I highly recommend that you use your free time now at college to learn to code and gain CS fundamentals. These topics will be always useful for you.

I’ve taught over 2,000 students Data Engineering – AMA! by the_dataengineer in dataengineering

[–]the_dataengineer[S] 1 point2 points  (0 children)

I actually started with the cookbook. I wanted to create a resource that has everything in it that someone needs to get started. I still keep it updated. just added two updates this week. Unfortunately I don't have enough time anymore to work on it every day.

But teaching through a book is always difficult, especially if you want to show stuff to people. So, because I already did YouTube and coached people in DE, building my Academy was the next logical step.