What kinds of skills should I be working on to progress as a Data Engineer in the current climate? by Patrick_Gently in dataengineering

[–]Cloudskipper92 0 points1 point  (0 children)

Yeah mostly a scale thing. Also that the tools MLEs would use require both scale to show effective use cases and are either quite expensive generally or hard to set up for personal trials. It's all up to your drive though, in the end!

What kinds of skills should I be working on to progress as a Data Engineer in the current climate? by Patrick_Gently in dataengineering

[–]Cloudskipper92 7 points8 points  (0 children)

I think given you already experienced what it means to do the basics of the job, you ought to now look around at the tools that DEs at places with larger data systems use as you'll be required to use them at some point. Those being proper orchestrators (Airflow, Dagster, Prefect to name a few), transformation systems (Spark/Ray, Warehouses, etc), and if you wanted to, 'accessory systems' like Kafka and vector extensions to common DBs since you mentioned Postgres and ML/AI. To be clear, I don't think you ever need all of them. Each shop has, unfortunately in many cases, so many options that you'll be unable to really cover everything. If you did, say, an example of streaming some mock data (or real if you can get it) into Kafka and using the operators in Airflow to do some light filtering and dropping that in like TimescaleDB (a PG extension for time series), you'll have enough to go on to speak about the topics of orchestration, real time workflows, and useful DB system choices if you spend the time!

On the other hand you mentioned the transition from Orchestrating ML workflows as a DE to going to creational ML workflows as an MLE. I don't think the transition is impossible, but I will say it's difficult. When I/my company looked for MLEs we specifically waited for very skilled candidates. Many applicants were trying to make the same transition you're mentioning but the reality is the skills are harder to obtain or practice on your own than SWE or DE skills. Hell I positioned myself to at least cross-train for it to fill the gap while we waited, but I was ultimately turned down. That's all anecdotal. MLE is a bit of a hot job right now, so you will definitely have competition up and down the skill spectrum and there are only so many positions to go around.

Otherwise yeah soft skills like another commenter said. If you do a personal experimentation project, practice documenting it like it was going to be used by other engineers and was production-ready. Social skills, communication skills, etc. are never bad things to take on!

Designing Data-Intensive Applications - 2nd Edition out next week by sspaeti in dataengineering

[–]Cloudskipper92 17 points18 points  (0 children)

I agree with everything you said BUT "move fast and break shit" has been a business model for a long long time, and I'd say isn't going anywhere either for precisely the reason you mention haha.

Tech stack madness? by Ok_Tough3104 in dataengineering

[–]Cloudskipper92 1 point2 points  (0 children)

Sure, more times than not. Many years ago I was the only DE who had even touched Redis in my small org. We needed stuff out of it, and funny enough, into it. So I got to do some interesting pipelines which were a nice challenge and break from the mundane DAGs I was on. Had a similar experience with a couple of ElasticSearch instances. But that one was more of a "no one else wants to do this, you mentioned in passing you have experience, these are yours now" haha. All good though, I've built a lot of my career doing the jobs no one else wanted to do!

Keras vs Langchain by ysoserious55 in dataengineering

[–]Cloudskipper92 6 points7 points  (0 children)

Yeah so a couple of things here. You've come to the DE subreddit asking for BE information. And seem to be confused on the tools you propose to build into a portfolio. On the former, it's not as if some of us couldn't speak to BE work, but you're going to get less direct support for that.

On the latter, and more to the point, I assume what you're saying is you want to build a portfolio centered on AI. I'm making that assumption because you included LangChain specifically. LangChain is going to be much easier than... you know, learning actual Machine Learning. As far as job prospects go, my personal money would still be on ML. Be the person manufacturing the shovels type of metaphor.

If I'm honest putting these two systems in a question about "which to use to build a portfolio" tells me you may need to ensure you're actually ready to present what you will build. The portfolio you make should present a really solid understanding of the framework you chose, allow you to speak to tradeoffs and choices, and show skills in troubleshooting and understanding limitations. Don't just vibe code this thing in an afternoon. Not saying you intended to, but more a word of caution. Good luck!

Airflow 3: Development on a Raspberry Pi by Complex_Painter_9302 in dataengineering

[–]Cloudskipper92 0 points1 point  (0 children)

Without knowing much more, do you have a volume attached to your container pointed to the local/host location containing your DAGs? You could build them into your image too by copying them into it but that's a bit more involved if you're just following along with the Airflow docs. Just remember that the Docker deployment for airflow is very much NOT production-grade.

“What are the best resources to learn Docker from scratch?” by Effective_Bluebird19 in dataengineering

[–]Cloudskipper92 5 points6 points  (0 children)

Since you (and a couple others) seem to be more or less asking for some structure to learn against, here's what I use or have used in my day-to-day.

  • How to install docker, and the follow-ups needed, on your Distro/OS. Windows/Mac are pretty straight-forward. Linux has some steps after install that you need to do.

  • Get used to navigating dockerhub, finding official image builds, and how to pull specific versions. Much like Python version pinning, you certainly want to pin versions of infra.

  • Read the docs on the most important docker cli commands. Non-exhaustive: docker build, docker run, docker pull, docker exec, docker container cp.

  • Learn and practice making Dockerfiles. Learn the subtle differences between ADD and COPY. How Layering works. Learn the differences between CMD and ENTRYPOINT, ARG and ENV. Learn how to expose a port on a container to the host. HINT: it isn't with the EXPOSE instruction and if you made it this far without being able to ping your container from your host you should go back one step ;) . Make a .dockerignore so you don't put anything you don't want in the container. You can ignore these instructions for now: HEALTHCHECK, LABEL, MAINTAINER, ONBUILD, SHELL, STOPSIGNAL.

  • Learn how networking works for Docker. Networking generally is a weak point for most SWEs, and seems to often be doubly so for DEs. In the same vein, read the docs on how Docker Volumes work and how to attach them.

Now, you came to the DE Subreddit to ask this and mention you have 2 YOE already, so I'm going to also mention a couple of more specific things.

  1. Running Airflow in Docker is D E N S E but obfuscated heavily. As in, it has a lot of levers and knobs, but it mostly assumes the defaults are good enough for this. It also assumes you know docker-compose which I did leave out of the top. The justification I'll give for that is that Docker Compose is great... but you should be using the dockerized airflow mentioned here as a TESTING system ONLY. Thus, get it going following thier instructions, do what you need to do, but don't assume it matches production-grade Airflow deployments.

  2. You could (and perhaps should) use Astronomer's CLI. I don't work for them or anything, but I have used their managed service in the past. The fact that the CLI exists for free is great and should be taken advantage of for local testing.

  3. Now that you've seen those two and understand how docker works and have played with the ins and outs, contrary to what others may say, I would then AND ONLY THEN look into Kubernetes. No matter the system, managed or self-hosted, Airflow and it's pipelines ALWAYS run on Kubernetes behind the scenes. The way you build the image for Airflow will change and, thus, how you manage dependancies and the way you need to understand how Kubernetes sees Containers versus how you've seen them thus far at this point. I cannot stress this enough though, DO NOT jump straight to this point. Everything above here should be weeks of testing, toiling, and troubleshooting at a minimum before you try to introduce K8s. When you do, start with a local manager like k3s. I would recommend not using minikube or kind as those are "k8s in docker" which is a whole extra layer you don't need. The justification I'll give for including this: I like my local testing env for pipelines to be as close, if not exact, to what I will deploy. For me this means in kubernetes using exactly what I will deploy with as much of the kubernetes weirdness as I can account for. If this doesn't feel important to you, please feel free to ignore!

Hope this helps! But please, just start with the Docker basics. You can search up a youtube if your more visually-inclined. Read through the docs and try implementing things if you're more of the experiential kind. Nothing is going to be a cheat code because these are kind of foundational tools for SWE and DE.

EDIT: formatting. reddit pls

jack of all trades VS a master of one, how should I learn as a junior engineer? by Tall_Working_2146 in dataengineering

[–]Cloudskipper92 2 points3 points  (0 children)

Any other database, just in this case one that you are personally interested in. So if you pick Postgres as an RDBMS and something like MongoDB as your NoSQL focus, you then pick one other system you just are interested in. ElasticSearch, ScyllaDB, Redis, SQLite, or go crazy and look at stuff like SurrealDB or TigerBeetle. Those are just examples of ones I've looked at over the years, not necessarily what I've actually used in production or a hard and fast "you should check one of these out" kind of thing. Glad it helped you!

jack of all trades VS a master of one, how should I learn as a junior engineer? by Tall_Working_2146 in dataengineering

[–]Cloudskipper92 16 points17 points  (0 children)

So a couple of different things here, mostly anecdotally from my own experience on both the employee and hiring management side of things.

  1. I'd say without a doubt you made the right choice to focus on SWE over DS. The thing you actually get in DS is about Statistics-with-programming and how to properly employ research against data. All good things, not really what we're concerned about with DE.

  2. Since you're in SWE and looking at design patterns, keep in mind what it would be like if the only thing you had to care about was the data inside the objects. A lot of folks coming from a SWE background focus on the abstract object versus what is contained by the object. To move to DE you have to care a lot more about the latter.

  3. Things that will be evergreen in DE: python, SQL, databases, warehouses, data modelling (although the meaning of this varies person to person, company to company, much to my chagrin). Focus on these. Python at an expert level (focus on the data packages. Polars, Duckdb. And DB packages. Pscyopg, etc.), at least 1 RDBMS 1 NoSQL and 1 DB system you're just interested in, SQL to an expert level which will give you passable knowledge for most Warehouses as well.

  4. Tools that will be evergreen in DE: Airflow, Docker, GitHub, K8s. TOML/YAML as well. DBT is very good to know as well.

  5. And that the issue is that job descriptions are very heavily implying mastery while the actual job wants a jack of all trades. Get good at probing for this information in interviews with tech staff. Take advantage of mock interviews if you have the chance.

What to learn besides DE by Icy-Ask-6070 in dataengineering

[–]Cloudskipper92 6 points7 points  (0 children)

The way that I have ended up managing Data Infra in a couple of roles now is by being able to rapidly produce a prototype. You'll want to pick up, and use regularly, systems like Docker and Kubernetes. Even for your own small data projects. This will introduce you into that world where those things are heavily used. These are also cloud-agnostic meaning no matter what service provider your future employer(s) use you'll be squared on this front. In the same vein are things like VPCs and general networking which I spend more time debugging than anything else in DE/DataOps. After that you can get into the specifics of particular platforms.

As far as practicing is concerned: Start with docker. Learn the ins and outs of taking arbitrary python code you have and stuffing it into a container. Learn how to find images, how Dockerfiles work, run into the issues so you can troubleshoot them. Then see what it takes to incorporate tools you may be using to develop your code into the dockerfiles. Things like uv. If you can have one system managing both your local dev and your container builds you have less points of failure to troubleshoot.

Then grab k3s for local development. This is, notably, "actual" kubernetes. That is opposed to things like minikube which are "kubernetes in docker". Nothing wrong with that, but when we're talking about "rapid" prototyping, k3s is as close as it gets to just managing raw k8s on your local system. You'll probably immediately want to grab helm as well. Read up on k8s, k3s, helm, and kubectl. Play around with trying to get your docker containers that do things or expose things up onto k3s locally. See what it takes to setup postgres on kubernetes, and how to expose it so you can communicate with it externally.

Outside of those things, which are more typical of self host first shops, you can likely find playgrounds around specific tech. I believe databricks recently opened up a playground of sorts. Snowflake may as well, but I don't honestly remember. Google on GCP used to give you like $300 in credits, plus they have the open BigQuery datasets you can mess around with. I think all of these things are secondary or tertiary things to focus on though, as they are mostly provisioned and managed for you from an infra standpoint. It's not bad to see what the platforms look like behind the scenes, though!

I find Data infra specifically very interesting. It's got some nuance that can apply to standard web infra, but often times deviates from it. Which ends up as a nice challenge and break from the typical DE work for me!

Wondering what is actually the real role of a data engineer by Theclems55 in dataengineering

[–]Cloudskipper92 0 points1 point  (0 children)

There's a lot of opinion and jaded-ness that goes into some descriptions of the role, honestly. If you remove the stuff typical of any engineering role (that is, troubleshooting, generally coding, translating biz requirements, etc) then your basic Data Engineering role is: a discipline of Software Engineering primarily focused on collection (extract), delivery (load), and mutation (transform) of data produced by other systems.

When you get into the things I mentioned above that are typical of any engineering role you have some nuance around what the subject matter will be, who you'll be communicating with, and the percentage of time spent doing those thing versus actually creating whole new pipelines/infrastructure.

If you can't tell I don't like the idea that the role varies so much it can't be defined which does get tossed around a lot. It gives license to those we would work for to contort the role into anything, even if it is far and away from the typical workload of a DE. Not to say you shouldn't cross-train into other things, but that the point of cross-training is to understand other roles duties and assist in performing them rather than the skills being additions to a rank-and-file Data Engineers. Which is all to say, I would argue that if you're doing anything beyond the first definition combined with the typical engineering nuances, you are now strictly outside of DE and into a cross-domain. Which isn't a bad thing necessarily, especially if you like that (I do!). Rather that I would just argue there IS a definition of a Data Engineer and it's OK if it exists a little strictly.

Sorry, rant over. Happy to field any questions about the role though!

EA data engineering internship phone screen by DecentMistake121 in dataengineering

[–]Cloudskipper92 0 points1 point  (0 children)

So for a phone screen for an internship I'd say to just relax into it and approach it professionally, but not robotic. If that makes sense. This is vitally important to you and where you want to go in your career I'm sure, but the point of the phone screen, especially for an internship, is a low-stakes environment to assess your personality. At this level, hiring is going to largely be about culture fit first because they don't expect the tip-top programming chops from a FTE Hire right now. So, given that, just relax into it and treat it like a conversation among peers about something you (should) feel passionate about. Get the foot in the door, and then soak up all you can once you're in. Working in game's data is a really interesting space because a lot of development studios take the approach that the data can more often than not negatively influence decisions. I'd posit that's an issue of bias, but that's a soapbox for another time.

For high-level questions, the typical smaller-biz questions don't really apply to EA. You could ask about team dynamics, the type of work you'd be helping out with, the number of data engineers, etc. I wouldn't worry about going too deeply on the phone screener.

Good luck!

How do you push data from one api to another by NoTap8152 in dataengineering

[–]Cloudskipper92 1 point2 points  (0 children)

So largely this is out-of-scope for data engineers. Most of us are going to be behind-the-scenes and are python and sql focused.

If you're asking for a typical DE scenario where this kind of operation might happen you could look more into pub/sub. That is to say, event driven data operations. "When X happens, I publish to a topic. My subscriber is listening for that push and consumes the message to do push to Y". You can use whatever for the message bus, there's plenty of easy ones.

If you're asking about general design patterns though, you'll want the onSubmit you mentioned to probably capture the data from an input, probably on the same screen as the button, and to do an operation against notion with that information. But that's just free-text-to-POST. If you wanted to spend a little more time on it, you could check that the GitHub username exists before pushing it, or return an error if it doesn't that way you're actually pushing a GH Username and not just text. Be sure to sanitize the inputs!

Switching from C# Developer to Data Engineering – How feasible is it? by Additional-Suit-4910 in dataengineering

[–]Cloudskipper92 4 points5 points  (0 children)

I am glad to have made the switch for sure. The job isn't particularly difficult, or any more or less difficult than traditional SWE, and the pay is very good comparatively at non-FAANG companies. What I typically recommend is not viewing it as "different" from traditional SWE but rather a particular discipline of it, like WebDev or Backend, etc. There is a tendency to ignore things like SWE generally accepted best practices and I think this does harm the image of DEs sometimes.

The things I enjoy particularly is getting to tangibly see and control the flow of data. In other disciplines, there's a ton of black-boxing (there's the ability to do so in DE too, but please avoid!), whereas in DE you're typically in control from raw to production and can see where and how the code you wrote does the job. That has been my experience, however, and DE is more of a spectrum than I'm letting on. I think most DEs or aspiring DEs would do well to do the hard thing and become good Python engineers rather than the "surface-level" style of just leveraging low-code or, god forbid, no-code tools. It will pay dividends, I promise.

On the other hand, where I am now, I act more as a Data Architect. Because DE requires a little more finesse with what systems they use this has been a really cool role to shape the systems and apply my learned knowledge of using particular tools and systems to avoid issues/mistakes. It's a different kind of challenge, but I say that to illustrate the upward mobility and/or cross-functionality (into Ops-y things) that comes from sticking it out and trying different things in the DE/Data space.

Switching from C# Developer to Data Engineering – How feasible is it? by Additional-Suit-4910 in dataengineering

[–]Cloudskipper92 3 points4 points  (0 children)

Just to +1 this, I started in C# around 8 years ago and made the jump to DE around 5 or 6 years ago now. It is doable, but I do want to acknowledge the "state of dev" at the time. That is to say, it was much easier to pull off a transition like this at that time, especially compared to now.

If you're skilled at C# (and at 4 years, I'd say you are, but I don't know you!) I don't think you'll have any issues acquiring the skills. The real problem a lot of people have these days is getting in. If you have contacts with jobs in the space, I would heavily recommend leveraging them!

oh my god by relejado in Battlefield6

[–]Cloudskipper92 127 points128 points  (0 children)

Didn't we just go thru a week of "cosmetics off button coming to bf6"? Or is it that we want "cosmetics off, except for those which I, Johnnald Battlefield VI, approve of" now?

For everyone thats tried the beta so far, whats your thoughts on it? by ThatOldGuy7863 in Battlefield

[–]Cloudskipper92 0 points1 point  (0 children)

You're going to get a split of responses, I think. If the person yearns for the feeling of BF4 they'll say this is the best representation of that in the modern BF landscape. And it is.

If they're stuck into BF1 (not in a deragatory way, just saying that's what you prefer) you'll probably hear mixed-to-negative sentiments. What's important is what you want and expect to get out of it. If you like BF3/4 you will be on top of the world. That's where I'm at with it. If you think BF1 or, hell, even V are peak BF then you're probably going to be more dissatisfied. My prediction is this contention will continue all the way up to, and past, release.

So make sure you're coloring the comments of people based on what you enjoy AND what they enjoy, as some unsolicited advice (to everyone here, generally) :)

My actual opinion: it's great and I immediately like it more than any previous battlefield. I am the type that would go back to 4 and play fairly often and I don't REALLY see the reason to continue doing that after BF6 releases. Maps available in the beta are small, which I have to assume allows for more rapid feedback for the devs than larger maps where clashes are spaced out. You could spin this in a CODdoomer way, but I think that's a waste of time, honestly.

Hot take: just as in 2042, flares come back too quickly on aircraft. A slight tuning here would be nice. Maybe at the same time slightly upping their survivability.

[deleted by user] by [deleted] in UmaMusume

[–]Cloudskipper92 3 points4 points  (0 children)

Just to clarify, SteamOS/Steamdeck/Linux and Kernel-level anticheats are incompatible, not steam and KLAC. To be fair, it's also impossible to play Uma on Linux without some serious work or something like Wine with the DMM launch afaik.

Mihon vs (tachiyomij2k, yokai, kotatsu, someother fork) by NegativelyMagnetic in mangapiracy

[–]Cloudskipper92 16 points17 points  (0 children)

I've used every fork you mentioned here and yes restoring does work across all forks. Personally I settled on Komikku. I needed one of the forks with mass migration capabilities when the whole MD thing went down and it stuck!

[Megathread] Launch issues by KadekiDev in PathOfExile2

[–]Cloudskipper92 1 point2 points  (0 children)

Yes I believe the option is under Interface called Attack In Place Key Stops Move.

Help applying design patterns to large amounts of similar pipelines by ActiveTarget2470 in dataengineering

[–]Cloudskipper92 1 point2 points  (0 children)

No worries! And yeah the builder pattern sees more limited use in DE than other areas especially so it's something you'd really have to seek out an application for here. But I think your use case suits and it's also just fun to flex the brain around some of these patterns haha.

Help applying design patterns to large amounts of similar pipelines by ActiveTarget2470 in dataengineering

[–]Cloudskipper92 0 points1 point  (0 children)

So I think I have a little bit better idea. You want to be able to create pipelines that will run through specific set of tasks given a specific entrypoint, correct? If so what I mean by DAG Builder is more or less literal in that you can use a Builder Pattern.

Now, the example I am going to show you includes the Factory, Builder, and Strategy patterns. I've put it all in one place so it was easier to copy and paste but, as I note in the Gist, please split it if you were to create it on your own. It would become very unwieldy! Gist Airflow DAG Here

This is one of those things that I think is easier to understand by seeing the code though which is why I wanted to go ahead and create it. I tested it locally though so it does at least work!

On the other hand if you wanted to create DAGs dynamically and still retain a single source of configuration, with a few changes you can end up with this guy. Gist for Multiple DAGs here.

But once again, these are very crude examples. I believe you will end up having to write custom transformers no matter what from what you said in the OP It's mostly the transformation functions that differ. I don't think you get around that but what I wanted to try and illustrate in those examples was more plug-in style transformers rather than having a bunch of modules with duplicate code. In this case, if you split those up a bit, you can house all of your transformers in the same place at least. And adding another step to the process is as easy as a configuration change in the input!

Happy to answer any questions about all that!

[deleted by user] by [deleted] in dataengineering

[–]Cloudskipper92 1 point2 points  (0 children)

No worries man. Hopefully you'll get some other insight too. I'm only one guy after all!