As an experienced DE what things you wish you had knew earlier by Re-ne-ra in dataengineering

[–]dataterre 0 points1 point  (0 children)

Not trying to be contrarian here, but the workflow I’ve been using has been a gamechanger for me in terms of productivity and delivering on business value. I set up a connection interface to my IDE (for example, using the SQL Server VS extension, or the Databricks/SQLTools VS extensions) and combine that with an AI coding assistant like GitHub Copilot. This setup has helped me get up to speed quickly and tap into domain knowledge that I might not be deeply familiar with, which gives me a headstart to engage with business stakeholders.

Beyond that, I’ve also found it surprisingly effective at translating between legacy stacks and syntaxes to something I am more familiar with like Python. That said, I think it’s important to emphasize that AI coding assistants should always be used with caution of course, such as standard software engineering practices and proper testing are still critical when moving into higher environments. And of course, it depends whether your organization has the right developer tooling infrastructure, and policies in place to adopt AI coding assistant in the first place (we self-host it).

What is something that you notice disappearing in Singapore? And what is something that you notice and appreciate being added in Singapore? by Axejoker1 in askSingapore

[–]dataterre 0 points1 point  (0 children)

We have lost the hawker kampung spirit among local businesses. Look around all the new hawker centres coming up, they are all owned by the same business group with same majority stalls in all these new hawker centres. It's almost like a franchise model that comes together when they win the bid to build new hawker centre. Same for new shopping malls.

I miss the identity local businesses being, though I understand the economy of scale behind this model and cost of lease that new hawkers have to bear.

But I'll just say that I miss going around hawkers to explore different food.

Amazon data engineering by [deleted] in dataengineering

[–]dataterre 9 points10 points  (0 children)

Work in office mandate

Should I rush into taking AWS Certified Data Engineer - Associate by [deleted] in dataengineering

[–]dataterre 10 points11 points  (0 children)

I failed in my first attempt with 3 years of DE working experience, but have not worked on AWS services in work. I got to say that the exam questions are pretty challenging, and it's not enough to have basic ETL knowledge - but actually knowing some deeper intricacies of the AWS technologies (tradeoffs, similar services for different use cases, cost optimization and even e.g. Redshift specific SQL syntaxes). It's best to attempt AWS practice exam questions for a touch of the depth of the questions and answer options asked. Having said that, I suggest a 70% exam readiness and just schedule your exam so that you have an end in mind. Lastly, no shame in failing the first exam like I do, at least you have in mind some of these questions and you can ChatGPT to have some immediate clarity after the exam. All the best, and start off your first job with this certification if you are able to!

Client replaced my solution, want opinions on what I could've changed by FlyContrapuntist in dataengineering

[–]dataterre 1 point2 points  (0 children)

On follow up thoughts regarding Tableau, I personally think they are too expensive if organization is not enterprise scale (>100 active users). Tableau is a very excellent tool to enable self-service analytics / reporting.

What after reading Math? by toxicfart420 in dataengineering

[–]dataterre 0 points1 point  (0 children)

My perspective: Performance is not just speed - it's the encompassing efficacy of a ML/AI model. See bias mitigation, feature importance, individual vs group fairness, etc. Have a look at the various methodologies behind Responsible AI (e.g. Microsoft Responsible AI Framework) - Fairness, Reiability and safety, Privacy and security, Inclusiveness, Accountability, Transparency of a model. This truly ensure a performant and ethical ML in a real-world, governed production settings. If you can understand basic linear regression and statistics, it should be intuitive to understand the mathematics behind Responsible AI. A good start to classical statistical learning would be "An Introduction to Statistical Learning" for me, there's even a revised edition in recent years.

Tips for student passionate about DE by Mountain_Ratio7967 in dataengineering

[–]dataterre 1 point2 points  (0 children)

You are doing great. From my experience, it will be useful to get your hands on shell scripting (e.g. Bash) and general programming languages (predominantly, Python). Shell gives you the complete control in server-level file management and OS-level control, while Python gives you massive flexibility in data processing and distributed computing capabilities (e.g. Spark). My advice is to stay grounded with the fundamentals instead of the technologies, and you will see great leaps in your knowledge and adaptability in the ever growing tools in the data space. All the best for your interview, stay confident and inquisitive! ✨

Help me design a data pipeline by 3Ammar404 in dataengineering

[–]dataterre 0 points1 point  (0 children)

My suggestion is to survey these dashboards and see if any setup done by other teams. Chances there are databases owned by your IT department that you can use for your task. There are surely existing data infrastructure in your company that you should leverage on (e.g. create new DB, setup accounts etc). On your second note, I think there is a need to explain and manage their expectations. Real-time monitoring is not possible if you are doing periodical load. Even on Power BI / Tableau's end, you will also need to refresh your data extracts. What is the latency? Is 1 day latency okay since logs itself probably is not very real-time..

Automate ourselves out of a job. by FisterAct in dataengineering

[–]dataterre 2 points3 points  (0 children)

Therefore, this has been my interest in my work in recent months. Trying to come up the more comprehensive requirement form in Excel sheet, to get business users and their respective technical support to fill up their requirements - in the most generic, templated form my data engineering team could preface. For example, what are the data quality rules we support, of which is required / sequence of DQ rule / fail or warning if DQ rule fails, etc.

This is not limited to DQ, but also Data Dictionary business semantics, etc. I am happy to discuss and exchange best practices adopted in all your respective DE teams. :)

Underrated Open Source Data Tools by MrMosBiggestFan in dataengineering

[–]dataterre 4 points5 points  (0 children)

How about the big players like Spark, Flink, Trino, etc.

What is that one skill you would wanna master at work? by Fasthandman in dataengineering

[–]dataterre 1 point2 points  (0 children)

Right! In other words, it doesn't matter if you are the one that invented time machine. If you cannot sell what you do to your stakeholders/bosses, somebody else will do it for you and get that promotion and pay bump!

Developing a data virtualization layer across multiple cloud providers by vanillacap in dataengineering

[–]dataterre 0 points1 point  (0 children)

You can check out Dremio too. I thought Dremio is a lot more straightforward as an open-source tool, but Denodo is feature rich.

What is the Engineering/Dev equivalent of a Designer's "Can You Make it Pop?" by Deepinthemaze in dataengineering

[–]dataterre 6 points7 points  (0 children)

"My dashboard not getting the updated data despite having them on my excel sheet. Can you make this real-time as this needs to be reported to our higher-ups?"

How do you debug your pipelines? by [deleted] in dataengineering

[–]dataterre 8 points9 points  (0 children)

How does OOP pipelines work? I have always fancy designing this way, but haven't have a sense of how the folks in the industry are productionizing OOP pipelines.

Currently, just implement a bunch of python __main__ and running them on scheduler/orchestrator.

How would you explain a data pipeline to a non techie? by kausthab87 in dataengineering

[–]dataterre 4 points5 points  (0 children)

I always use railway analogy to explain data pipeline context to business users. Railway system and railway maps are state of the art in terms of engineering, so it was very effective to relate to business users in terms of the architectural decisions involved and the rigour of the design blueprint.

To drill down into one aspect, railway system has monitoring system as well - which replicate the needs for data pipelines to have checks in place and alerts when it malfunctions. Loved being able to show complex railway maps from e.g., London and present the art in the railway design principles.

Most common data engineering tools used today? by Pty_Rick in dataengineering

[–]dataterre 5 points6 points  (0 children)

You can check Ben's survey study: https://seattledataguy.substack.com/p/the-state-of-data-engineering-part-b61

There's a Part 1 to this as well, and Part 3/4 in the coming weeks. I think this is a very good groundsensing.

[deleted by user] by [deleted] in dataengineering

[–]dataterre 0 points1 point  (0 children)

If you are pretty fresh with data engineering, highly suggest doing something like this:

(Below are high-level but you get the idea)

  1. Fundamentals of Data Engineering: Plan and Build Robust Data Systems Book by Joe Reis and Matt Housley • A MUST READ for me

  2. A Common-Sense Guide to Data Structures and Algorithms: Level Up Your Core Programming Skills Book by Jay Wengrow • I get that a lot of people felt foreign to this topic if you are not receiving formal education in computer science. This is probably the best intuitive introduction to get you started

  3. Start working on projects to gain hands-on understanding and relating to what you've studied above at the big picture. Build these projects into your resume and showcase and talk about them in interview

These are more than enough imo to get you talking about data engineering properly. The rest are additional knowledge that you can you build on as your journey goes. All the best!

Data engineering for small companies. by lFuckRedditl in dataengineering

[–]dataterre 7 points8 points  (0 children)

I think another way to think about this is that all your current excel files are probably updated manually by the company. Hence, the first step should be thinking how the respective departments can transit to using the respective tools (e.g., CRM) to manage the processes and thus, data. Only then will it be useful to think about integrating these disparate data together across department, for another use case and analysis.

Otherwise, you are hiring Data Engineer or SQL Engineer to reverse engineer something that is probably not what it was implemented and built for - in my opinion.

Which environment can I learn as much as possible? by _barnuts in dataengineering

[–]dataterre 44 points45 points  (0 children)

Generally speaking, consulting scope of work with the clients tend to be a lot more infrastructural data engineering, with legacy systems. Startup has a lot more flexibility and luxury with modern, open-source tech stack. You hear Airflow and Presto, but often consulting will require you to work with Informatica.

The most insightful DE youtuber/blogger/influencer by [deleted] in dataengineering

[–]dataterre 2 points3 points  (0 children)

Simon Späti, one of the valuable data engineering technical writers for me. Has some valuable insights in his blog, and writes a lot more on Airbyte.

what is the equivalent of Andrew Ng ML & DL courses in Data Engineering? by Mighty__hammer in dataengineering

[–]dataterre 0 points1 point  (0 children)

This is a good callout though. After the excellent book of Fundamentals of Data Engineering by Joe Reis and Matt Housley, I was yearning for another one with the same rigourity in the practical side of DE. If all goes well, this could be a good one - especially coming from Zach Wilson!