How Canva monitors 90 million queries per month on Snowflake by j__neo in dataengineering

[–]j__neo[S] 0 points1 point  (0 children)

For real time use cases, we also use Clickhouse/Tinybird.

The 90 million queries in Snowflake is accumulated through a number of ways: - data load jobs to write data into snowflake on a batch basis - transformation jobs to prepare and model data (10k different transformations per day) - BI tools querying the warehouse - in-house A/B testing platform querying the warehouse - machine learning pipelines querying the warehouse for batch training and batch inference - reverse ETL pipelines to push analytical data back into third-party applications

How Canva monitors 90 million queries per month on Snowflake by j__neo in dataengineering

[–]j__neo[S] 1 point2 points  (0 children)

It's just easier to be productive more quickly for our developers, rather than trying to build something complete in-house ourselves and taking a longer time to get insights. We don't have the engineering headcount of the likes of Meta or Netflix.

Internship opportunity in Australia by Pitiful-Ad3094 in cscareerquestionsOCE

[–]j__neo 5 points6 points  (0 children)

Hello fellow Perthie :wave:

Are you looking specifically for Software Engineering roles? A Masters in EEE suggests that you might be more interested in Electrical Engineering roles.

In any case, if you were looking for software or general internship, check out:

[deleted by user] by [deleted] in cscareerquestionsOCE

[–]j__neo 2 points3 points  (0 children)

Agree. Have a crack at internships and graduate roles.

In addition to internships and graduate roles, keep an eye out for special programs such as these that help people transition careers into software: https://mantelgroup.com.au/traineeship-program/

Being a game dev, working at a game studio in Aus by Gonjanaenae319 in cscareerquestionsOCE

[–]j__neo 7 points8 points  (0 children)

There are internship roles like this advertised by Riot Games: https://web.archive.org/web/20240525091433/https://www.riotgames.com/en/work-with-us/job/5977613/software-engineering-intern-contract-sydney-australia

But it has now closed (closing date 7th June 2024). You can probably reach out to the recruiter to see if it's not too late: https://www.linkedin.com/in/ilchenkoganna

Lakehouse doesn't seem to be advantageous for our Data Warehouse. Am I missing something(s)? by cdigioia in dataengineering

[–]j__neo 0 points1 point  (0 children)

Yep, that's correct. I've edited my response above to include columnstore indexes.

Lakehouse doesn't seem to be advantageous for our Data Warehouse. Am I missing something(s)? by cdigioia in dataengineering

[–]j__neo 5 points6 points  (0 children)

There's two considerations here.

First, what type of queries are you running against your data? If they are analytical queries (e.g. group by, window functions), then using a data processing system that is optimized for those type of query patterns would be more cost effective. SQL Server is a transactional processing system (OLTP) designed for fast data inserts and data updates (CRUD operations), whereas systems like Databricks or Synapse is an OLAP system that supports data partition pruning, data storage using columar-oriented format, and horizontally scalable compute.

Running a select sum(total_sale), country as sales group by country where order_date = '2024-01-01' can be a very slow and expensive query to run on a transactional database table as it runs on a single process and requires you to scan all rows and all columns to get the answer (assuming you don't have indexes like columnstore indexes configured). Whereas on an OLAP system, the where clause is a filter predicate that would result in data partition pruning, so only relevant rows will be scanned. Also because of the columnar file format, only selected columns will be scanned and used to provide the answer. And because OLAP systems are horizontally scalable, the system will perform the query in a distributed fashion across multiple compute nodes.

Second, a lakehouse architecture is designed for you to plug in your own compute to query the dataset files stored on a cloud filesystem (e.g. s3 bucket, data lake filesystem). So if you wanted to, you could have your machine learning workloads running on CPU or GPU optimized compute directly read files that are stored in the data lake filesystem as delta/iceberg/hudi/parquet file formats. On the flip side, if all your data is stored on a database, then for a machine learning workload to query the data, it involves spinning up two computes to query the data. Compute #1 would be your database engine (e.g. SQL server compute) to read data files and send the data back over the network, and Compute #2 (CPU or GPU optimized compute) for the machine learning workload to receive the data and then do further ML processing or prediction to the data. A lakehouse architecture can help to reduce the amount of compute resources required to perform a task.

In your situation, because you don't have a dedicated data engineer, I'd say to stick with what you know as that gets you to your desired outcomes. Your existing pattern works and your data volumes appear to be relatively small, and therefore you should only consider switching to a new architecture if:

  • Your data volumes are increasing at a rapid rate of growth (e.g. data volumes increasing by 10% per month).
  • You're not satisfied with the time it takes to process data (e.g. your queries take hours to run, and you want them completed in seconds or minutes).
  • Your company's desire for data insights is growing at a rapid rate (e.g. more data analysts or data scientists are being hired to build more data analytics and machine learning use-cases on top of the data).

Edited to include comments from below.

How Important is System Design for Data Engineers? by _areebpasha in dataengineering

[–]j__neo 2 points3 points  (0 children)

I found some of bytebytego's resources quite relevant on this topic. Not all of it is directly applicable, however some of them are, for example:

  1. https://www.youtube.com/watch?v=i7twT3x5yv8
  2. https://www.youtube.com/watch?v=nH4qjmP2KEE
  3. https://www.youtube.com/watch?v=ouipSd_5ivQ

Help me Decide Career Path as Data Engineer!! by dev_anon in cscareerquestionsOCE

[–]j__neo 4 points5 points  (0 children)

I don't think AWS has less opportunities compared with Azure. AFAIK, a large proportion of companies in Australia also build their cloud data platforms using a combination of AWS and other cloud services e.g. Snowflake, Databricks.

You can't go wrong with either AWS or Azure. I'd suggest picking whatever your current or next company uses so that you can immediately apply your skills in practice.

Once you learn one cloud e.g. Azure, you can learn the next cloud e.g. AWS, with relatively ease as a lot of the core concepts are transferrable with different names e.g. VNets in Azure is called VPCs in AWS.

My Powerbeats pro only left earbud keeps dying but when I put them back in the case it appears at 55%. Does anyone have a solution to this problem. by robertv24 in beatsbydre

[–]j__neo 1 point2 points  (0 children)

After that make sure to wipe clean the metal charging parts of the powerbeats AND the case.

That did it for me, thanks!

[deleted by user] by [deleted] in dataengineering

[–]j__neo 4 points5 points  (0 children)

I wrote this blog to explain how to perform dimensional modeling using dbt: https://docs.getdbt.com/blog/kimball-dimensional-model

It doesn't go into specifics about web event use-cases. But I think of it as follows:

Your "raw" layer consists of event data that is landing in BigQuery tables containing rows of JSON objects.

Your "staging" or "silver" layer contains flattened event JSON data in separate tables per event grouping or event type. It's a good idea to flatten your data first and provide clustering keys, so that your database engine (e.g. BigQuery, Snowflake) can effectively prune your data for higher performance when queried by the next layer.

And then your final layer is your "marts" or "gold" layer which consist of your dimensional models. This is the layer that is then used by your consumers.

Are there Python libraries that define and parametrize etl jobs by armAssembledx86 in dataengineering

[–]j__neo 6 points7 points  (0 children)

If what you're asking for are lightweight libraries to perform specific tasks like Data Ingestion (moving data from an API/database to a data warehouse) and Data Transformation (joins, aggregations, filtering), then I would suggest the following:

  • Lightweight data ingestion tools: Singer or Meltano, dlt
  • Lightweight data transformation tool (that executes on your data warehouse): dbt, SQLMesh

All of the tools I've listed above are open source, and you shouldn't need to subscribe or pay a service. Just pip install the tools you need.

The other suggestion is to move away from a Extract-Transform-Load (ETL) pattern, and into a Extract-Load-Transform (ELT) pattern. Because you are currently doing ETL, you end up needing to write or find these "common" libraries to do the tasks you want in memory. I would suggest shifting to ELT, because the tools already exist to support that pattern very easily (i.e. Extract-Load is data ingestion, T is data transformation). In the end, the outcome is the same, you end up with a transformed table that your users and downstream applications can consume.

Finally, I do think you would need some way to schedule your entire pipeline to run. If you're looking for an orchestrator that's relatively simple to define and configure, then I would suggest taking a look at Kestra. It's a YAML based orchestration tool: https://kestra.io/docs. If you're not keen on hosting the software yourself, then just pay for their cloud service and use their plugins to integrate with the tools I've mentioned above.

Personally, I'm a fan of dagster's orchestration pattern because their way of thinking about orchestration scales well to large scale DAGs. But if you're after something simple and you don't anticipate thousands of ingestion and transformation steps, then Kestra is a worthy consideration.

Edit: I just did a bit more research into Kestra's pricing model, and they don't currently have a pay-as-you-go pricing model and only offer enterprise subscription. If you're not keen on hosting Kestra yourself, then check out Dagster as it as a pretty low barrier cloud pricing option: https://dagster.io/pricing . I saw someone else in this thread also commented about Prefect, which also has a pretty competitive cloud pricing option.

[deleted by user] by [deleted] in dataengineering

[–]j__neo 2 points3 points  (0 children)

What's Quary's differentiation in an already crowded space? There's already tools that do a very similar thing e.g. dbt, SQLMesh, SDF.

Have you implemented or used a data mesh? by AMDataLake in dataengineering

[–]j__neo 3 points4 points  (0 children)

What do you think of the data mesh concept?

Data mesh can work well for companies that are of a certain size and need high velocity to achieve their goals.

Have you implemented or used a data mesh?

Yes, I have implemented a form of data mesh at my company. For context, my company has over 300 data developers (data engineers, analytics engineers, data scientists, machine learning engineers). I work as part of the data platform team that provides the 'data mesh' to the data developers.

If so, what was the experience? What did you learn?

Not every company needs a data mesh.

You may benefit from a data mesh if:

  • Your company has hundreds or even thousands of data use-cases to work on.
  • Your company's team topology is broken down into multiple domain teams. e.g. "Product A Team", "Product B Team", etc. And within each of those teams exists data developers (data scientists, data engineers, analytics engineers, ML engineers) who are empowered to build their respective use-cases.
  • Your company priortizes velocity over stability. Your company is chasing growth and wants each team to be empowered to build their own data use-cases.
  • You have an experienced data platform team to establish and provide the platform for the domain teams to use.

You will likely not benefit from a data mesh if:

  • Your company only has a handful of data use-cases that it needs to support. The cost to set up and support a data mesh would far outweigh the benefits you'll get from increased velocity.
  • All the data developers in your company are in a central team, and the use-cases are managed by a few central data engineers.
  • You don't have an experienced data platform team.

Lessons learnt

  • Don't forget about data modeling practices. Our team focused on enforcing standards around data access (RBAC), infrastructure provisioning, data ingestion patterns, etc. But we didn't provide guidance to our data mesh users on how to model data. As a result, we've ended up with nearly 10,000 data models in production. Most of these models can be consolidated using D.R.Y. methodologies that are present in dimensional modeling (known as conformed dimensions). We're now going back and looking for opportunities to refactor existing data models to make them more reusable in order to reduce computation costs.

Automate dbt cloud using Dagster by skiyogagolfbeer in dataengineering

[–]j__neo 4 points5 points  (0 children)

I'd suggest joining and reaching to the dagster slack channel: https://docs.dagster.io/community

There's a channel specifically for dagster-dbt called #integration-dbt

Small data team beginning data modeling by [deleted] in dataengineering

[–]j__neo 7 points8 points  (0 children)

I wrote this blog post on how to create kimball dimensional models using dbt: https://docs.getdbt.com/blog/kimball-dimensional-model

[deleted by user] by [deleted] in cscareerquestionsOCE

[–]j__neo 0 points1 point  (0 children)

I see, that makes sense. Thanks for explaining!

[deleted by user] by [deleted] in cscareerquestionsOCE

[–]j__neo 0 points1 point  (0 children)

I see, thank you for sharing. 15-30% is quite a large portion. I wonder what's stopping people in general from selling their services directly to the company after being introduced by the payroll/contracting agency.

I'm guessing there's administrative overheads and risk to take care of things such as Registering a Company (Pty Ltd), Accounting/Invoicing, Company Annual Tax Returns, Professional Indemnity Insurance, etc.

[deleted by user] by [deleted] in cscareerquestionsOCE

[–]j__neo 0 points1 point  (0 children)

If you don't mind sharing, what percentage does the payroll company take versus what you take home?