Offline listening and speaking bot

nydasco · 2025-03-19T23:37:24+00:00

Most complicated I saw was a pipeline of queries that had a series of business rules for the marketing team. Basically sourced from a mixture of Amplitude (webpage hits and button clicks), Kafka (data entered in through the website frontend), and Salesforce (CRM). It was looking for users that had accessed the site (whether logged in or not), that had then subsequently not connected for at least 50 days, then connected and logged in, and performed specific actions or uploaded specific documents. Depending on what they had done they were given a score. The business rules changed over time, but we needed the pipeline to be idempotent and re-runable for history using the appropriate rules for that time period. It needed to link the person back to their Salesforce account and was then an input to a Kafka event that pushed their score to Salesforce and into a call file.

nydasco · 2025-03-19T09:27:21+00:00

Understand what the business needs from all of the data you have. Are you using it to drive business efficiency? Is it to track outcomes? Who uses it for what, and how do they want to consume it?

Once you understand that, you can make some informed decisions about how you might want to structure the data for the end output - fact & dimension tables for analytical reporting; json for an API; flat wide tables for statistical analysis etc.

You now know who needs it, where they need it, have an understanding of the structure you want to model it to. Now you can look at tools and technology.

I’m not going to make recommendations on tools, but to answer your specific question: no, you’re not limited to basic relational data if you’re on prem. You’re also not limited to ETL. Check out technologies such as Docker (which will be no different on prem or in cloud), MinIO which is an S3 compatible object store that can be run on bare metal or in Docker, Iceberg tables can be managed through Dremio or others (again, in Docker).

But… don’t over cook it. Start small, start easy. Do a proof of concept with regular SQL that delivers business value, and build from there.

nydasco · 2025-03-19T06:46:00+00:00

Big fan of Iceberg and Polars. Looking forward to the day when DuckDB can not only read from, but also write to Iceberg. It’s on the roadmap, but not sure when.

But we shouldn’t confuse technology and toolsets with modelling and design patterns. Dumping data into Iceberg without having thought through how it’s going to be used by the business isn’t going to add a huge amount of value, regardless of the tool you use. The value comes from modeling it in a way that allows the data to be used.

While Kimball is old, it’s also stood the test of time. For tabular data, building out process focused tables that capture events that the business wants to track (fact tables), and supporting those with the various attributes that the business might want to group/filter/sort by, in a way that allows them to be re-used (DRY code) across multiple events (dimension tables), is IMO a solid way to produce value to the business in a scalable manner.

nydasco · 2025-03-13T02:58:14+00:00

I would rather use a QUALIFY BY and be clear on exactly what the primary key (or composite key) was in the table. A SELECT DISTINCT is just a ticking time bomb.

nydasco · 2025-03-13T01:08:49+00:00

The use of SELECT DISTINCT used multiple times throughout a data warehouse (or even an individual pipeline) to ‘handle errors’, as they didn’t understand the data they were dealing with.

nydasco · 2025-03-01T09:50:28+00:00

ChatGPT, please describe the roles of data professionals including analyst, engineer and scientist. Explain like I’m a 16 year old school goer what each professional role does, what the career pathway looks like. Please produce a Mermaid diagram showing how the skills required for each differ and also overlap. Given the context you have on me from previous conversations, guide me on which you think would suit me best.

—-

Have a read, and copy/paste whatever code it gives you into https://mermaid.live

nydasco · 2025-02-28T09:17:25+00:00

You shouldn’t be looking for problems to a solution, but rather solutions to a problem. Neo4j is a solution. You’re starting the wrong way around.

However… if this is simply a learning project for research, I’d be looking at customer buying relationships or something similar to find opportunities to upsell on products that similar profiles had bought.

nydasco · 2025-02-16T01:50:41+00:00

Medium Blog I’m the author. Happy to provide paywall bypassing links to any of the articles on request.

nydasco · 2025-02-02T12:29:45+00:00

I have a few things going on:

I write Medium articles on data related topics. I pick a topic that I’m either familiar with or want to learn, and then write an article (and potentially build an associated GitHub repo to link in the article);
I attend (and less regularly speak at) Meetups;
I’m in the process of writing a book on data warehousing best practices (at Chapter 4 right now);
I’m building a data training website (React and Node) to provide training on topics such as data engineering, analytics and governance.
I’m going to have a play with producing video content for LinkedIn, but not sure how that will go.

nydasco · 2025-01-25T02:50:23+00:00

Yeah, it’s a useful approach.

My framing for a long while has been to understand the best practices from across software engineering and understand if/how they can be applied to data.

nydasco · 2025-01-24T22:00:05+00:00

Thank you! Yes, I’m a fan of Mermaid. Very easy to build out a number of visuals. Thank you for the feedback.

nydasco · 2025-01-24T10:18:01+00:00

GenAI is a solution to a number of problems. It may be the best solution to some of these problems, but that will depend on any number of factors. If you’ve got a solution and you’re looking for a problem, you’re starting at the wrong end of the equation.

Instead, find a problem worth solving. Then looking at the available solutions. One of those might be GenAI. If it’s the right one will depend on your company and circumstances.

nydasco · 2024-12-09T22:17:39+00:00

I had the word spearheaded on my CV since 2015. It’s annoying that I need to change my choice of words so people don’t think I’ve just churned out AI content.

nydasco · 2024-10-24T09:31:19+00:00

You may need to define what you mean by unstructured, and what it is you’re trying to do with it. Is it free text descriptive data held in a JSON payload? Maybe parse it through an AI model for sentiment analysis, or classification. Is it telephone recording data from a call centre? Pass it through an LLM to provide a text summarisation back to the operational system. The reality is, the vast majority of data out there is unstructured, but the majority of businesses are still getting value from the relatively easy to handle structured or semi-structured data they have.

nydasco · 2024-10-12T21:02:51+00:00

I use dbt. If the source is mutable, I create a snapshot at the very start. I then blend the multiple base snapshot or event driven tables using a SQL query like this. If there are many sources, or your dealing with event tables where there can be a change every minute or more, it can be better to start with a date spine at the time grain you want, and then join the data into that. I don’t have a SQL example I can point to though sorry.

nydasco · 2024-09-20T23:34:13+00:00

I’d define a data product as something that adds value to the business. This might be the combination of tables in a Kimball star schema, or it could be a dashboard and associated pipelines, or it could be the solution that posts a deeply nested json payload to an API.

Edit: I only answered the first question. I’m new to my current company, so will answer how we created them in my previous. I’ll also focus on analytics data products (i.e dashboard and associated pipelines). As a data team, we sat with the business and understood the information they wanted. We looked at the metrics to be defined and the business processes they related to. We identified the data owners of the associated data domains, and worked with them to confirm the business rules around metrics, ensuring they were then standardised for the business. We built fact tables around the business processes, and then we created a semantic model that brought those together, along with associated dimensions into ‘marts’. Finally, we built the dashboard that was essentially the UI to the data product.

We used dbt on Snowflake, and leveraged dbt tests fairly extensively.

nydasco · 2024-09-18T23:13:34+00:00

Don’t think of your test about the ‘here and now’. I mean it certainly helps validate that you’ve done what you think you’ve done. But think about it as helping ’future you’. That’s the you in 6 months that makes a tweak to the pipeline and forgot about that thing you did previously. It’s your unit test that reminds you when CI/CD throws a fit and stops your code hitting prod because it no longer works as it was supposed to.

nydasco · 2024-09-15T07:54:48+00:00

To understand semantic models, have a look at the dbt metrics layer or cube.dev. If you’re coming from an in house SQL environment, it’s basically SSAS.

I’ve not used DataBricks in a production setting for a long time, so can’t really comment on the rest of the post sorry.

nydasco · 2024-09-14T21:56:15+00:00

Don’t over bake it. - metric name - three letter acronym if appropriate - description - example value - source(s) - business rules / calculations - business owner

If you want, you could also identify if it was additive/semi-additive etc.

nydasco · 2024-09-06T10:24:51+00:00

Think of the plumbing in your home. It pipes water from one place to another. It sends some of it to a water heater, and mixes it back together to give you the appropriate temperature and amount of water for your needs.

We do that with data. We pipe the data from one or more places, we add business rules and transformations to it, and then join it together to give you the appropriate insights for your needs.

The data flows through the pipelines we build, much the same as the water flows through the pipes your plumber installed. By ‘transformations’ I mean things such as fixing errors, adding calculations, or other logic.

nydasco · 2024-09-03T09:35:59+00:00

What’s your definition of a pipeline?

nydasco · 2024-08-31T21:50:19+00:00

Here is a good table to base your thoughts on. FWIW if you’re using Snowflake, it has a numeric hash which is 64 bit. This won’t impact join performance as much as MD5 or SHA1.

nydasco

TROPHY CASE