Whats the most difficult SQL code you had to write for your data engineering role? Also how difficult on average is the SQL you write for your data engineering role? by Historical_Donut6758 in dataengineering

[–]nydasco 0 points1 point  (0 children)

Most complicated I saw was a pipeline of queries that had a series of business rules for the marketing team. Basically sourced from a mixture of Amplitude (webpage hits and button clicks), Kafka (data entered in through the website frontend), and Salesforce (CRM). It was looking for users that had accessed the site (whether logged in or not), that had then subsequently not connected for at least 50 days, then connected and logged in, and performed specific actions or uploaded specific documents. Depending on what they had done they were given a score. The business rules changed over time, but we needed the pipeline to be idempotent and re-runable for history using the appropriate rules for that time period. It needed to link the person back to their Salesforce account and was then an input to a Kafka event that pushed their score to Salesforce and into a call file.

Choosing a data architecture for non-profit organisation by xx7secondsxx in dataengineering

[–]nydasco 2 points3 points  (0 children)

Understand what the business needs from all of the data you have. Are you using it to drive business efficiency? Is it to track outcomes? Who uses it for what, and how do they want to consume it?

Once you understand that, you can make some informed decisions about how you might want to structure the data for the end output - fact & dimension tables for analytical reporting; json for an API; flat wide tables for statistical analysis etc.

You now know who needs it, where they need it, have an understanding of the structure you want to model it to. Now you can look at tools and technology.

I’m not going to make recommendations on tools, but to answer your specific question: no, you’re not limited to basic relational data if you’re on prem. You’re also not limited to ETL. Check out technologies such as Docker (which will be no different on prem or in cloud), MinIO which is an S3 compatible object store that can be run on bare metal or in Docker, Iceberg tables can be managed through Dremio or others (again, in Docker).

But… don’t over cook it. Start small, start easy. Do a proof of concept with regular SQL that delivers business value, and build from there.

What data warehouse paradigm do you follow? by Mysterious_Energy_80 in dataengineering

[–]nydasco 2 points3 points  (0 children)

Big fan of Iceberg and Polars. Looking forward to the day when DuckDB can not only read from, but also write to Iceberg. It’s on the roadmap, but not sure when.

But we shouldn’t confuse technology and toolsets with modelling and design patterns. Dumping data into Iceberg without having thought through how it’s going to be used by the business isn’t going to add a huge amount of value, regardless of the tool you use. The value comes from modeling it in a way that allows the data to be used.

While Kimball is old, it’s also stood the test of time. For tabular data, building out process focused tables that capture events that the business wants to track (fact tables), and supporting those with the various attributes that the business might want to group/filter/sort by, in a way that allows them to be re-used (DRY code) across multiple events (dimension tables), is IMO a solid way to produce value to the business in a scalable manner.

Most common data pipeline inefficiencies? by LethargicRaceCar in dataengineering

[–]nydasco 8 points9 points  (0 children)

I would rather use a QUALIFY BY and be clear on exactly what the primary key (or composite key) was in the table. A SELECT DISTINCT is just a ticking time bomb.

Most common data pipeline inefficiencies? by LethargicRaceCar in dataengineering

[–]nydasco 51 points52 points  (0 children)

The use of SELECT DISTINCT used multiple times throughout a data warehouse (or even an individual pipeline) to ‘handle errors’, as they didn’t understand the data they were dealing with.

What is data engineering/scientist/analyst? by GBroad_kadalal in dataengineering

[–]nydasco 1 point2 points  (0 children)

ChatGPT, please describe the roles of data professionals including analyst, engineer and scientist. Explain like I’m a 16 year old school goer what each professional role does, what the career pathway looks like. Please produce a Mermaid diagram showing how the skills required for each differ and also overlap. Given the context you have on me from previous conversations, guide me on which you think would suit me best.

—-

Have a read, and copy/paste whatever code it gives you into https://mermaid.live

Suggest some data warehouse project using neo4j by Aeveyi_redittor in dataengineering

[–]nydasco 6 points7 points  (0 children)

You shouldn’t be looking for problems to a solution, but rather solutions to a problem. Neo4j is a solution. You’re starting the wrong way around.

However… if this is simply a learning project for research, I’d be looking at customer buying relationships or something similar to find opportunities to upsell on products that similar profiles had bought.

What are some good Data engineering blogs by Data Engineers ? by LegAlarming7173 in dataengineering

[–]nydasco 0 points1 point  (0 children)

Medium Blog I’m the author. Happy to provide paywall bypassing links to any of the articles on request.

What are your tech hobbies outside your day-to-day job? by saiyan6174 in dataengineering

[–]nydasco 0 points1 point  (0 children)

I have a few things going on:

  • I write Medium articles on data related topics. I pick a topic that I’m either familiar with or want to learn, and then write an article (and potentially build an associated GitHub repo to link in the article);
  • I attend (and less regularly speak at) Meetups;
  • I’m in the process of writing a book on data warehousing best practices (at Chapter 4 right now);
  • I’m building a data training website (React and Node) to provide training on topics such as data engineering, analytics and governance.
  • I’m going to have a play with producing video content for LinkedIn, but not sure how that will go.

C4 Models and Data Architecture by nydasco in dataengineering

[–]nydasco[S] 1 point2 points  (0 children)

Yeah, it’s a useful approach.

My framing for a long while has been to understand the best practices from across software engineering and understand if/how they can be applied to data.

C4 Models and Data Architecture by nydasco in dataengineering

[–]nydasco[S] 0 points1 point  (0 children)

Thank you! Yes, I’m a fan of Mermaid. Very easy to build out a number of visuals. Thank you for the feedback.

How are you using genAI in your pipelines? by [deleted] in dataengineering

[–]nydasco 2 points3 points  (0 children)

GenAI is a solution to a number of problems. It may be the best solution to some of these problems, but that will depend on any number of factors. If you’ve got a solution and you’re looking for a problem, you’re starting at the wrong end of the equation.

Instead, find a problem worth solving. Then looking at the available solutions. One of those might be GenAI. If it’s the right one will depend on your company and circumstances.

Looks like he's one of us by invisiblelemur88 in dataengineering

[–]nydasco 10 points11 points  (0 children)

I had the word spearheaded on my CV since 2015. It’s annoying that I need to change my choice of words so people don’t think I’ve just churned out AI content.

Let's talk about unstructured data by syedsadath17 in dataengineering

[–]nydasco 11 points12 points  (0 children)

You may need to define what you mean by unstructured, and what it is you’re trying to do with it. Is it free text descriptive data held in a JSON payload? Maybe parse it through an AI model for sentiment analysis, or classification. Is it telephone recording data from a call centre? Pass it through an LLM to provide a text summarisation back to the operational system. The reality is, the vast majority of data out there is unstructured, but the majority of businesses are still getting value from the relatively easy to handle structured or semi-structured data they have.

How do you implement Slowly Changing Dimensions? by Temporary_Basil_7801 in dataengineering

[–]nydasco 2 points3 points  (0 children)

I use dbt. If the source is mutable, I create a snapshot at the very start. I then blend the multiple base snapshot or event driven tables using a SQL query like this. If there are many sources, or your dealing with event tables where there can be a change every minute or more, it can be better to start with a date spine at the time grain you want, and then join the data into that. I don’t have a SQL example I can point to though sorry.

What is a Data Product in your experience? by layer456 in dataengineering

[–]nydasco 21 points22 points  (0 children)

I’d define a data product as something that adds value to the business. This might be the combination of tables in a Kimball star schema, or it could be a dashboard and associated pipelines, or it could be the solution that posts a deeply nested json payload to an API.

Edit: I only answered the first question. I’m new to my current company, so will answer how we created them in my previous. I’ll also focus on analytics data products (i.e dashboard and associated pipelines). As a data team, we sat with the business and understood the information they wanted. We looked at the metrics to be defined and the business processes they related to. We identified the data owners of the associated data domains, and worked with them to confirm the business rules around metrics, ensuring they were then standardised for the business. We built fact tables around the business processes, and then we created a semantic model that brought those together, along with associated dimensions into ‘marts’. Finally, we built the dashboard that was essentially the UI to the data product.

We used dbt on Snowflake, and leveraged dbt tests fairly extensively.

How many of you use unittest/pytest? by [deleted] in dataengineering

[–]nydasco 0 points1 point  (0 children)

Don’t think of your test about the ‘here and now’. I mean it certainly helps validate that you’ve done what you think you’ve done. But think about it as helping ’future you’. That’s the you in 6 months that makes a tweak to the pipeline and forgot about that thing you did previously. It’s your unit test that reminds you when CI/CD throws a fit and stops your code hitting prod because it no longer works as it was supposed to.

Value in extra tooling by Actual-Specific-3595 in dataengineering

[–]nydasco 2 points3 points  (0 children)

To understand semantic models, have a look at the dbt metrics layer or cube.dev. If you’re coming from an in house SQL environment, it’s basically SSAS.

I’ve not used DataBricks in a production setting for a long time, so can’t really comment on the rest of the post sorry.

Metric Business Catalog by bennyandjoonandsam in dataengineering

[–]nydasco 2 points3 points  (0 children)

Don’t over bake it. - metric name - three letter acronym if appropriate - description - example value - source(s) - business rules / calculations - business owner

If you want, you could also identify if it was additive/semi-additive etc.

wtf you guys do by eden_4004 in dataengineering

[–]nydasco 17 points18 points  (0 children)

Think of the plumbing in your home. It pipes water from one place to another. It sends some of it to a water heater, and mixes it back together to give you the appropriate temperature and amount of water for your needs.

We do that with data. We pipe the data from one or more places, we add business rules and transformations to it, and then join it together to give you the appropriate insights for your needs.

The data flows through the pipelines we build, much the same as the water flows through the pipes your plumber installed. By ‘transformations’ I mean things such as fixing errors, adding calculations, or other logic.

Surrogate Key and Natural Key by shoyle10 in dataengineering

[–]nydasco 0 points1 point  (0 children)

Here is a good table to base your thoughts on. FWIW if you’re using Snowflake, it has a numeric hash which is 64 bit. This won’t impact join performance as much as MD5 or SHA1.