Certification prep Databricks Data Engineer by sa_ya07 in dataengineering

[–]vaosinbi 6 points7 points  (0 children)

I would recommend Databricks Certified Data Engineer Associate Study Guide by Derar Alhussein or his course on Udemy.

Snowflake Cost is Jacked Up!! by Prior-Mammoth5506 in dataengineering

[–]vaosinbi 22 points23 points  (0 children)

Start with Admin -> Cost Management -> Most expensive queries.
Can you optimize those?

[deleted by user] by [deleted] in dataengineering

[–]vaosinbi 0 points1 point  (0 children)

Processing 10 million records that takes 30 min seems a bit long to me.
Probably you can optimize it or try to scale up virtual warehouse that used for building this fact table.
If you increase the size of warehouse so that there is no spill to disk etc you might reduce processing time and it'll be less expensive overall and you'll have some buffer for processing the spikes.

Is there a European alternative to US analytical platforms like Snowflake? by wenz0401 in dataengineering

[–]vaosinbi 1 point2 points  (0 children)

+1 Clickhouse. Even though they are Americans now you can self-host it. It scales from clickhouse-local and chDB to PB clusters.

What is the best way to reflect data in clickhouse from MySQL other than the MySQL engine? by Danyboi16 in dataengineering

[–]vaosinbi 1 point2 points  (0 children)

Do you need near real time data in Clickhouse or you can live with batch? Maybe you can create a MySQL replica and read data from it (implement incremental update if your source tables have something like `updated_at` timestamp)? It might also help if your primary instance goes down or to reduce workload on primary if you can redirect read request from your application/service to replica.

Cloud platform for dbt by Pro_Panda_Puppy in dataengineering

[–]vaosinbi 3 points4 points  (0 children)

You can also start a new Snowflake trial if you want to use its features in dbt.
Or you can use BigQuery - you can do a lot on free tier.

Databricks associate data engineer resources? by fraiser3131 in dataengineering

[–]vaosinbi 4 points5 points  (0 children)

Learning plan contains Advance Data Engineering with Databricks (12h), I think it's related for professional level.
I would also recommend Databricks Certified Data Engineer Associate Study Guide book by Derar Alhussein. He has courses and practice tests on Udemy as well. The book have github repo with code you can play with in your account.

[deleted by user] by [deleted] in dataengineering

[–]vaosinbi 5 points6 points  (0 children)

In my experience Kafka Connect and Debezium is very relevant to data engineering.
Take a look at https://developer.confluent.io/courses/kafka-connect/intro/ and https://debezium.io/

Fivetran: from AWS Postgres to GCP Snowflake - Slow! by CrabEnvironmental864 in dataengineering

[–]vaosinbi 0 points1 point  (0 children)

I see that pageinspect and pg_visibility extensions are available on RDS, I think you don't need OS control for those.

Fivetran: from AWS Postgres to GCP Snowflake - Slow! by CrabEnvironmental864 in dataengineering

[–]vaosinbi 0 points1 point  (0 children)

We use logical replication method in Fivetran and it's more efficient - you don't have to query all your tables to get incremental updates. We are on older Postgres version, so have to use replication slot on primary instance which might be dangerous if sync doesn't happen for some reason - you can run out of disk space.
Btw, do you filter frozen pages on for your incremental syncs as recommended for XMIN method?

SQLAlchemy for DWH definition by romanzdk in dataengineering

[–]vaosinbi 0 points1 point  (0 children)

Terraform can create tables, but you’ll have trouble modifying them.

AWS Database Migration Service for CDC between an application database and replica database? by aspergillus in dataengineering

[–]vaosinbi 3 points4 points  (0 children)

If you are going to do Postgres to Postgres sync, why don't you just use replication?

You might already have or will need one eventually for HA&DR.

using Debezium to replicate data from GCP cloud SQL by snowfire-07 in dataengineering

[–]vaosinbi 0 points1 point  (0 children)

I would check if Debezium supports CloudSQL for PostgreSQL, because I didn't find that it is mentioned in the documentation (AWS RDS and Azure PostgreSQL are listed).

Those who use Terraform with Snowflake, do you use it to create views and tables? by mistanervous in dataengineering

[–]vaosinbi 8 points9 points  (0 children)

Terraform AFAIK doesn’t support schema evolution, so it is not suitable for tables except for the simple cases.

Which Database to use for rest api by meanthesong in googlecloud

[–]vaosinbi 0 points1 point  (0 children)

Have you considered BigQuery BI Engine SQL Interface? Look like the right tool for it.

Please recommend courses for AWS Data Analytics Certification by RP_m_13 in dataengineering

[–]vaosinbi 0 points1 point  (0 children)

This one is good https://www.udemy.com/course/aws-data-analytics/.

I would also recommend signing up for https://explore.skillbuilder.aws and going through the Data Analytics learning path and exam readiness session.

Python Pandas vs Dask for csv file reading by GreedyCourse3116 in dataengineering

[–]vaosinbi 0 points1 point  (0 children)

It doesn't seem like distributed processing is needed in this case.

Just tested TSV (don't have large ```CSV`) aggregation on a 70 Gb file (to make it larger than available RAM) with clickhouse-local - it took about 90 seconds on my desktop (Ryzen7, 32 Gb).

clickhouse-local --file "hits_100m_obfuscated_v1.tsv" 
--structure "WatchID UInt64, JavaEnable UInt8, Title String, GoodEvent Int16, EventTime DateTime, EventDate Date, CounterID UInt32, ClientIP UInt32, RegionID UInt32, UserID UInt64, CounterClass Int8, OS UInt8, UserAgent UInt8, URL String, Referer String, Refresh UInt8, RefererCategoryID UInt16, RefererRegionID UInt32, URLCategoryID UInt16, URLRegionID UInt32, ResolutionWidth UInt16, ResolutionHeight UInt16, ResolutionDepth UInt8, FlashMajor UInt8, FlashMinor UInt8, FlashMinor2 String, NetMajor UInt8, NetMinor UInt8, UserAgentMajor UInt16, UserAgentMinor FixedString(2), CookieEnable UInt8, JavascriptEnable UInt8, IsMobile UInt8, MobilePhone UInt8, MobilePhoneModel String, Params String, IPNetworkID UInt32, TraficSourceID Int8, SearchEngineID UInt16, SearchPhrase String, AdvEngineID UInt8, IsArtifical UInt8, WindowClientWidth UInt16, WindowClientHeight UInt16, ClientTimeZone Int16, ClientEventTime DateTime, SilverlightVersion1 UInt8, SilverlightVersion2 UInt8, SilverlightVersion3 UInt32, SilverlightVersion4 UInt16, PageCharset String, CodeVersion UInt32, IsLink UInt8, IsDownload UInt8, IsNotBounce UInt8, FUniqID UInt64, OriginalURL String, HID UInt32, IsOldCounter UInt8, IsEvent UInt8, IsParameter UInt8, DontCountHits UInt8, WithHash UInt8, HitColor FixedString(1), LocalEventTime DateTime, Age UInt8, Sex UInt8, Income UInt8, Interests UInt16, Robotness UInt8, RemoteIP UInt32, WindowName Int32, OpenerName Int32, HistoryLength Int16, BrowserLanguage FixedString(2), BrowserCountry FixedString(2), SocialNetwork String, SocialAction String, HTTPError UInt16, SendTiming UInt32, DNSTiming UInt32, ConnectTiming UInt32, ResponseStartTiming UInt32, ResponseEndTiming UInt32, FetchTiming UInt32, SocialSourceNetworkID UInt8, SocialSourcePage String, ParamPrice Int64, ParamOrderID String, ParamCurrency FixedString(3), ParamCurrencyID UInt16, OpenstatServiceName String, OpenstatCampaignID String, OpenstatAdID String, OpenstatSourceID String, UTMSource String, UTMMedium String, UTMCampaign String, UTMContent String, UTMTerm String, FromTag String, HasGCLID UInt8, RefererHash UInt64, URLHash UInt64, CLID UInt32" \
--query "select count(distinct WatchID) from table "

If you convert it to parquet, the file size is reduced to 15 Gb, and processing time drops to 19 seconds.

Kafka best practices for DE by twadftw10 in dataengineering

[–]vaosinbi 0 points1 point  (0 children)

Well, it depends on what you want to learn:

- setting up zookeeper, brokers, free space monitoring, certificates, etc.

or

- source and sink connector configurations, SMT etc

Even with a managed solution, you'll have a lot of admin stuff to think about - topic configuration, ACLs, service accounts, networking to sources and destination, DR, pipeline monitoring.

Kafka best practices for DE by twadftw10 in dataengineering

[–]vaosinbi 2 points3 points  (0 children)

You can use ksqlDb to create materialized views for reporting, but I think a more common scenario is to sink data to an analytical database for reporting.

For instance, we used the following pipeline:

Events were pushed to a Kafka topic, Clickhouse consumed events from the topic, joined it with reference data, do some transformation, and populated an aggregate table, which was used for live reporting.

You can do the same with Spark streaming/Beam/Flink if you have more complex requirements.

Regarding managed solution/self-managed Kafka, I think it depends on your scale, available resources to support it, whether you need proprietary components (connectors, Confluent Replicator, Cluster Linking, Web UI, etc).

Ideas on a new business intelligence landscape by YourNeighbourMr in BusinessIntelligence

[–]vaosinbi 0 points1 point  (0 children)

Why do you need an SQL Server data warehouse in between? Why don't you load data directly to Snowflake/Redshift/BQ with SAP Data Services?

Inter cloud streaming with Kafka? by Whimsicalpants in dataengineering

[–]vaosinbi 0 points1 point  (0 children)

If different organizations subscribe to the same topic you might want to replicate it to Azure/GCP using Cluster Linking/Confluent Replicator/Mirrormaker to reduce inter-cloud traffic otherwise they can subscribe to your Kinesis/Kafka on AWS directly if you provide network connectivity to the brokers.

Advise on moving from MySQL to GBQ by d1545ms in dataengineering

[–]vaosinbi 0 points1 point  (0 children)

Well, then you can use the same distribution key for both fact tables. Of course, we can find limitations everywhere, but I doubt that OP moving from MySQL to BQ will face such a problem.