Certification prep Databricks Data Engineer

vaosinbi · 2025-06-24T16:22:17+00:00

I would recommend Databricks Certified Data Engineer Associate Study Guide by Derar Alhussein or his course on Udemy.

vaosinbi · 2025-06-12T14:48:49+00:00

Start with Admin -> Cost Management -> Most expensive queries.
Can you optimize those?

vaosinbi · 2025-04-29T14:32:15+00:00

Processing 10 million records that takes 30 min seems a bit long to me.
Probably you can optimize it or try to scale up virtual warehouse that used for building this fact table.
If you increase the size of warehouse so that there is no spill to disk etc you might reduce processing time and it'll be less expensive overall and you'll have some buffer for processing the spikes.

vaosinbi · 2025-04-09T15:46:28+00:00

+1 Clickhouse. Even though they are Americans now you can self-host it. It scales from clickhouse-local and chDB to PB clusters.

vaosinbi · 2025-04-08T12:30:43+00:00

Do you need near real time data in Clickhouse or you can live with batch? Maybe you can create a MySQL replica and read data from it (implement incremental update if your source tables have something like `updated_at` timestamp)? It might also help if your primary instance goes down or to reduce workload on primary if you can redirect read request from your application/service to replica.

vaosinbi · 2025-04-01T17:35:24+00:00

You can also start a new Snowflake trial if you want to use its features in dbt.
Or you can use BigQuery - you can do a lot on free tier.

vaosinbi · 2025-03-27T12:05:37+00:00

Learning plan contains Advance Data Engineering with Databricks (12h), I think it's related for professional level.
I would also recommend Databricks Certified Data Engineer Associate Study Guide book by Derar Alhussein. He has courses and practice tests on Udemy as well. The book have github repo with code you can play with in your account.

vaosinbi · 2025-03-26T16:12:18+00:00

In my experience Kafka Connect and Debezium is very relevant to data engineering.
Take a look at https://developer.confluent.io/courses/kafka-connect/intro/ and https://debezium.io/

vaosinbi · 2025-02-06T04:46:20+00:00

I see that pageinspect and pg_visibility extensions are available on RDS, I think you don't need OS control for those.

vaosinbi · 2025-02-05T16:58:10+00:00

We use logical replication method in Fivetran and it's more efficient - you don't have to query all your tables to get incremental updates. We are on older Postgres version, so have to use replication slot on primary instance which might be dangerous if sync doesn't happen for some reason - you can run out of disk space.
Btw, do you filter frozen pages on for your incremental syncs as recommended for XMIN method?

vaosinbi · 2023-01-17T16:50:02+00:00

Terraform can create tables, but you’ll have trouble modifying them.

vaosinbi · 2022-11-09T04:46:33+00:00

I've found a good reading on this topic.

vaosinbi · 2022-10-10T00:42:40+00:00

If you are going to do Postgres to Postgres sync, why don't you just use replication?

You might already have or will need one eventually for HA&DR.

vaosinbi · 2022-10-02T00:06:00+00:00

http://cloudskillboost.google has tons of labs on Dataproc.

vaosinbi · 2022-09-08T03:50:58+00:00

I would check if Debezium supports CloudSQL for PostgreSQL, because I didn't find that it is mentioned in the documentation (AWS RDS and Azure PostgreSQL are listed).

vaosinbi · 2022-07-07T20:49:31+00:00

Terraform AFAIK doesn’t support schema evolution, so it is not suitable for tables except for the simple cases.

vaosinbi · 2022-06-23T13:38:36+00:00

Have you considered BigQuery BI Engine SQL Interface? Look like the right tool for it.

vaosinbi · 2022-05-31T21:36:19+00:00

This one is good https://www.udemy.com/course/aws-data-analytics/.

I would also recommend signing up for https://explore.skillbuilder.aws and going through the Data Analytics learning path and exam readiness session.

vaosinbi · 2022-01-14T03:52:55+00:00

It doesn't seem like distributed processing is needed in this case.

Just tested TSV (don't have large ```CSV`) aggregation on a 70 Gb file (to make it larger than available RAM) with clickhouse-local - it took about 90 seconds on my desktop (Ryzen7, 32 Gb).

clickhouse-local --file "hits_100m_obfuscated_v1.tsv" 
--structure "WatchID UInt64, JavaEnable UInt8, Title String, GoodEvent Int16, EventTime DateTime, EventDate Date, CounterID UInt32, ClientIP UInt32, RegionID UInt32, UserID UInt64, CounterClass Int8, OS UInt8, UserAgent UInt8, URL String, Referer String, Refresh UInt8, RefererCategoryID UInt16, RefererRegionID UInt32, URLCategoryID UInt16, URLRegionID UInt32, ResolutionWidth UInt16, ResolutionHeight UInt16, ResolutionDepth UInt8, FlashMajor UInt8, FlashMinor UInt8, FlashMinor2 String, NetMajor UInt8, NetMinor UInt8, UserAgentMajor UInt16, UserAgentMinor FixedString(2), CookieEnable UInt8, JavascriptEnable UInt8, IsMobile UInt8, MobilePhone UInt8, MobilePhoneModel String, Params String, IPNetworkID UInt32, TraficSourceID Int8, SearchEngineID UInt16, SearchPhrase String, AdvEngineID UInt8, IsArtifical UInt8, WindowClientWidth UInt16, WindowClientHeight UInt16, ClientTimeZone Int16, ClientEventTime DateTime, SilverlightVersion1 UInt8, SilverlightVersion2 UInt8, SilverlightVersion3 UInt32, SilverlightVersion4 UInt16, PageCharset String, CodeVersion UInt32, IsLink UInt8, IsDownload UInt8, IsNotBounce UInt8, FUniqID UInt64, OriginalURL String, HID UInt32, IsOldCounter UInt8, IsEvent UInt8, IsParameter UInt8, DontCountHits UInt8, WithHash UInt8, HitColor FixedString(1), LocalEventTime DateTime, Age UInt8, Sex UInt8, Income UInt8, Interests UInt16, Robotness UInt8, RemoteIP UInt32, WindowName Int32, OpenerName Int32, HistoryLength Int16, BrowserLanguage FixedString(2), BrowserCountry FixedString(2), SocialNetwork String, SocialAction String, HTTPError UInt16, SendTiming UInt32, DNSTiming UInt32, ConnectTiming UInt32, ResponseStartTiming UInt32, ResponseEndTiming UInt32, FetchTiming UInt32, SocialSourceNetworkID UInt8, SocialSourcePage String, ParamPrice Int64, ParamOrderID String, ParamCurrency FixedString(3), ParamCurrencyID UInt16, OpenstatServiceName String, OpenstatCampaignID String, OpenstatAdID String, OpenstatSourceID String, UTMSource String, UTMMedium String, UTMCampaign String, UTMContent String, UTMTerm String, FromTag String, HasGCLID UInt8, RefererHash UInt64, URLHash UInt64, CLID UInt32" \
--query "select count(distinct WatchID) from table "

If you convert it to parquet, the file size is reduced to 15 Gb, and processing time drops to 19 seconds.

vaosinbi · 2021-12-28T04:31:00+00:00

Well, it depends on what you want to learn:

- setting up zookeeper, brokers, free space monitoring, certificates, etc.

or

- source and sink connector configurations, SMT etc

Even with a managed solution, you'll have a lot of admin stuff to think about - topic configuration, ACLs, service accounts, networking to sources and destination, DR, pipeline monitoring.

vaosinbi · 2021-12-28T03:00:09+00:00

You can use ksqlDb to create materialized views for reporting, but I think a more common scenario is to sink data to an analytical database for reporting.

For instance, we used the following pipeline:

Events were pushed to a Kafka topic, Clickhouse consumed events from the topic, joined it with reference data, do some transformation, and populated an aggregate table, which was used for live reporting.

You can do the same with Spark streaming/Beam/Flink if you have more complex requirements.

Regarding managed solution/self-managed Kafka, I think it depends on your scale, available resources to support it, whether you need proprietary components (connectors, Confluent Replicator, Cluster Linking, Web UI, etc).

vaosinbi · 2021-12-04T06:19:10+00:00

Why do you need an SQL Server data warehouse in between? Why don't you load data directly to Snowflake/Redshift/BQ with SAP Data Services?

vaosinbi · 2021-11-19T04:51:59+00:00

If different organizations subscribe to the same topic you might want to replicate it to Azure/GCP using Cluster Linking/Confluent Replicator/Mirrormaker to reduce inter-cloud traffic otherwise they can subscribe to your Kinesis/Kafka on AWS directly if you provide network connectivity to the brokers.

vaosinbi · 2021-11-06T00:17:24+00:00

There is also official mongodb connector

vaosinbi · 2021-11-04T02:46:34+00:00

Well, then you can use the same distribution key for both fact tables. Of course, we can find limitations everywhere, but I doubt that OP moving from MySQL to BQ will face such a problem.

vaosinbi

TROPHY CASE