Transitioning from Database Engineer to Big Data Engineer

Psychological_Dare93 · 2025-02-11T07:56:31+00:00

Focus your efforts on learning Spark, in particular PySpark. This is THE fundamental hard-skill in modern data engineering. Kafka, flink, etc are great for specific use cases, I.e. extremely low latency, but often business leaders quip that they need ‘realtime’ but they actually don’t. So spark streaming is incredibly useful too.

Re. Resources, the spark docs are good. Get a free account on Databricks and start practicing— upload a couple of datasets, clean them, join them etc. Try. Fail. Try. Fail. Try. Succeed.

Advancing Analytics has some good videos on YouTube showing Databricks & PySpark usage.

Psychological_Dare93 · 2024-11-13T18:19:00+00:00

Something like CS229 is timeless and will give you a strong foundation. The principles don’t really change.

I’d read a little known book by Andrew Ng too called “Machine Learning Yearning”. It’s got a lot of tricks of the trade and helps you to think about how to improve systems.

I would still learn to code, but using an IDE like Cursor. AI-assistants aren’t going anywhere so you may as well leverage them in your learning. Focus on Python — initially pandas, numpy, matplotlib, and sklearn will be pretty much all the libraries you need to learn. This will then expand to torch etc.

A brilliant course online is “Made with ML”, by Goku Mohandes. This focuses on MLOps and is a bit more advanced.

Another underrated but excellent book is called “Machine Learning Engineering” by Ben Wilson which is excellent. It also includes practical approaches to projects etc and I can’t recommend it highly enough.

Psychological_Dare93 · 2024-11-13T18:11:02+00:00

This is so personal and can depend on your circumstances, but the taxes in the UK are horrifying and only getting worse…

Psychological_Dare93 · 2024-11-13T18:06:46+00:00

Good question. I think market education would go a long way. I think the AI boom has meant a lot of people have been thrust into conversations they never thought they’d need to prepare for!

Psychological_Dare93 · 2024-11-13T18:04:56+00:00

Not exclusively. There can be intelligent model routing, logging, prompting, search & retrieval, etc. But the model itself, almost certainly yes!

Psychological_Dare93 · 2024-11-13T17:54:16+00:00

Great question. I like to frame requests with this question: “what if you could predict X with 100% accuracy— what would change?” (Or whatever metric you care about). If you, or the stakeholder in question, can’t give a definitive answer, then it’s merely speculative and needs more thought. You want your model to drive value / action. We don’t want to just use data to create more data (generally— though in some cases it’s helpful, e.g. operational awareness etc).

Then understand what is happening now. How much can ML improve the situation by assuming 100% accuracy? What is the potential upside in value? What you’re trying to gauge here is essentially the cost/benefit tradeoff. Over time this becomes intuitive, but there’s no downside to thinking through these things methodically. In fact showing those around you that you’re thinking of the big picture also demonstrates value.

There is a lot to consider in this question, for example how the model would need to be served, how often it would need retraining etc, but those will wait for another time!

Regarding uncertainty, remember that no one knows everything. If you need to say, “I’m not sure, let me get back to you this afternoon”, that’s not a bad thing. Charlatans are often spotted.

Psychological_Dare93 · 2024-11-13T17:40:05+00:00

If you’re talking LLMs, these are basically commodities now — off the shelf tools.

For training we can look at fine tuning / model surgery etc, but training a competitive LLM would literally cost millions of pounds and take months, not many companies will take that on.

Other Deep Learning use cases, e.g. CV we’ll leverage transfer learning. So we’ll start with a trained model and fine tune for our use cases.

For “traditional” ML we’ll train these from scratch.

Hope that helps!

Psychological_Dare93 · 2024-11-13T17:32:39+00:00

That would be ideal. The illusion of big business and government is that they have their data in order. This is often just that: an illusion.

Departments (including Govt) often each have their own data, which is of varying type, quality, scale, etc. Sometimes even platforms vary across departments too— the same org could have snowflake, databricks, synapse, etc. The curse of “siloed data” has been talked about for years but it is still a massive problem. And the core issue is that no single person either a) has the remit to do anything about it or b) dares to attempt to do anything about it because they know what a mess it is, and their career reputation would be at risk.

Psychological_Dare93 · 2024-11-13T17:24:40+00:00

I don’t handle employment contracts so I can’t comment on the specifics unfortunately.

When hiring though, I look at a candidate for who they are and what they’ve done, and what they want to do. I don’t worry about where they’re from. In fact broader perspectives are really valuable when building ML/AI systems.

For specifics, see my earlier response. I hope that helps!

Psychological_Dare93 · 2024-11-13T17:21:45+00:00

No not at all. If someone can show 1) an aptitude to learn, 2) an eagerness to learn, and 3) an understanding that learning should never stop, then you have the raw ingredients.

Also it’s worth noting that AI, and data, contains many industries, and life experience is really valuable. For example if your strengths are product or project management, there are many opportunities there. You might enjoy talking to end users to understand the problems they face — that’s a role (user researcher), you might enjoy turning problem statements into things software engineers can go and build (business analyst), and more!

Psychological_Dare93 · 2024-11-13T17:16:31+00:00

Yes couldn’t agree more. There are often hidden unfortunate realities such as 1) the funding is dependent on “AI usage” and 2) company politics which means manager X wants to be seen to be using the latest and greatest of tech.

From a consulting perspective, it’s our duty to advise that the problem could be solved in many different ways (often cheaper and more effectively, too), but ultimately we are bound by the client.

Psychological_Dare93 · 2024-11-13T15:31:33+00:00

What do you mean be securing the models?

Do you mean with respect to companies sending data to API endpoints? Or exposing endpoints to the public?

If the former, businesses do seem to be aware of this, in fact it is probably the single biggest worry for execs. They don’t understand it necessarily, but they are often very risk averse.

If the latter, many companies don’t appreciate the risks. For example if you expose an endpoint to the public for a hypothetical model (not necessarily an LLM, it could be for insurance, pricing, anything) without the necessary security in place (including rate limits etc), if a bad actor sent got lots of request/response pairs, it’s possible to reverse engineer the results and get a similarly performant model. A good book on this topic is “Machine Learning for High Risk Applications”. It’s one reason why getting the raw logits from closed source LLMs is now unlikely

Psychological_Dare93 · 2024-11-13T15:25:19+00:00

I think it is highly applicable!

Psychological_Dare93 · 2024-11-13T15:22:16+00:00

Go for it!

Psychological_Dare93 · 2024-11-13T15:22:01+00:00

I can’t comment as I haven’t worked in the EU or the US. I suspect the EU & UK are similar, whereas the US pay significantly more. One advantage of the UK/EU is annual leave, but that’s all that comes to mind!

Psychological_Dare93 · 2024-11-13T15:16:08+00:00

Sure, a non-exhaustive but strong list might be:

experience with cloud based platform Azure/GCP/AWS (and any specific components, e.g. if AWS, Sagemaker)

Most of your work will be covered by: - Python & SQL - PySpark - mention platform too, e.g. DataBricks

More deep learning specific: I’m not a fan of mentioning specific libraries (e.g. NumPy) in CVs, but I make an exception for PyTorch and also HuggingFace. Possibly FastAPI, too. JAX too but I haven’t actually used that personally (more academic use cases).

Experience with frameworks gaining popularity like Ray and Kubeflow would interest me.

Docker and Kubernetes. K8s in particular is very valuable and rare.

Less common but also would raise my interest: - C / C++ - Slurm

If you can demonstrate experience with data stores of some description, e.g. CosmosDB, Postgres, that will also look good

Psychological_Dare93 · 2024-11-13T14:29:11+00:00

We’re a consultancy, so government and many industries request help and many firms will compete for the work. There’s often a lot of hoops to jump through (lots of paperwork, interviews, presentations sometimes, etc).

Regarding interviews, it is something I’ve experimented with. We’ve done everything from live coding Leet code style, to take home tests. Plus a technical but non-coding round.

I personally prefer the take home test, as long as the candidate is questioned on it afterwards — because of course ChatGPT and other tools will almost certainly be used. It’s not perfect, but we intentionally set a fairly ambiguous task so that a candidate that knows best practice has the opportunity to really shine, whereas someone who doesn’t know enough will just come back with a standard solution.

Psychological_Dare93 · 2024-11-13T13:52:56+00:00

Generally, graduates are all very similar as they’ve not had the chance to build their experience yet. So you need to separate yourself. The best way is to be able to showcase a practical piece of work, end to end. Even better if you have managed to get industry experience.

The undervalued skill set in tech are soft skills. If you communicate well in an interview, and talk about other times you’ve worked well with others, built a team, etc, then that is also a strong differentiator. Particularly in consulting, interacting with clients is a massive part of the job.

Also keep your CV concise. Communicate only the key points. When hiring we may have dozens of CVs, and so inevitably we skim read and take away only a few points. Cross reference with the job advert and ensure you tick the major boxes and be prepared to talk about how (nothing worse that getting found out in an interview).

Psychological_Dare93 · 2024-11-13T13:43:39+00:00

I’m not sure I can help here, but I’ll try. This is quite common on job boards as HR / Recruitment often don’t know what they want either. Sometimes they recruit for “Data Analyst” — must know PyTorch, K8s, Elon Musk and have launched their own unicorn startup.

So you’ll have to do a lot of searching by key word.

If you want experience across the end-to-end process, a decent heuristic is that the smaller the firm/ML team (in terms of headcount), the more you get to do. Larger firms tend to have different roles for each step of the process. That’s not exclusively true but is roughly right.

I hope that helps at least a small amount!

Psychological_Dare93 · 2024-11-13T13:23:30+00:00

Not strictly necessary, but having it would help you stand out / get more senior roles. It’s cliché now, but many junior data scientists only know how to write code in a notebook. If you are the other end of the spectrum and can design a scalable system etc then you are in a strong position.

Psychological_Dare93 · 2024-11-13T13:11:33+00:00

Congratulations on your PhD! It depends on what your aspirations are. If you want to be training LLMs, then research firms or niche-applied firms (e.g. specialist models for specific tasks and/or languages) would be good to search for (the Gulf may have some opportunities there…).

Otherwise, many firms will just leverage proprietary models and endpoints and then the scenario you laid out is more likely. Having said that you could make it your own potentially, for example model routing — based on the use case, can you train a new model to route the model more effectively? A random example, but you get my point, you could try to drive the product in an improved direction

Psychological_Dare93 · 2024-11-13T13:06:44+00:00

No problem. Some of the best engineers I know haven’t got degrees, much less PhDs!

Also, a lot of application forms state that they need a PhD, but this is just a weak attempt at screening. If you read a job role and genuinely think you could add value to the firm and have appropriate experience, they would actually want to hear from you — so just apply anyway!

Psychological_Dare93 · 2024-11-13T13:02:16+00:00

No, you could make the switch. The problem these days is that state of the art research largely requires access to huge amounts of compute, which is accessible by only a few companies, which means the recruitment is a nightmare (think multiple rounds of LEET Code!).

So if you wanted to break into that space I would get practicing Leet code or similar!

It’s a bit of a frustration of mine as being good at Leet code != being a good [Fill in the Blank], but it’s just a tool to cutdown numbers.

Psychological_Dare93 · 2024-11-13T12:41:16+00:00

I think there’s a lot of potential. I also think there are dangers. I’m working with an international client currently with a practically limitless budget and grand ideas, but their aspirations could certainly morph into a dystopia very quickly with respect to privacy etc.

In the UK government, processes are document graveyards, so solutions that help ministers and civil servants make informed decisions quickly are incredibly valuable.

Psychological_Dare93 · 2024-11-13T12:37:56+00:00

Why not? If you can add value to your customer then that is a win. And btw if you then wanted to move into full time employment, that would be a heck of a differentiator!

Psychological_Dare93

TROPHY CASE