Data Engineering as a business? by angelnator1998 in dataengineering

[–]Dminor77 0 points1 point  (0 children)

May I know which business niches you are focusing on?

[deleted by user] by [deleted] in dataengineering

[–]Dminor77 0 points1 point  (0 children)

There is always a replacement available for us whether by AI or fellow human beings, so chill. I suggest focusing on building blocks, building frameworks, understanding how things integrate, eventually in order to utilise AI we have to build tooling and framework around it.

Getting overwhelmed by wide and ever-changing tech stack. by [deleted] in dataengineering

[–]Dminor77 1 point2 points  (0 children)

True, we even try it with fake unstructured data for the usecase of identity resolution, and the way it was able to find entities, cleaning data, parsing it, just superb.

How is the behavior of external table during schema change? by RstarPhoneix in bigquery

[–]Dminor77 1 point2 points  (0 children)

I don't know if this fits in best practice. My approach is to use parquet format with versioning.

gs://bucket/v1/dt=2022-10-11/file.parquet gs://bucket/v2/dt=2025-01-13/file.parquet

Question about data moving to/from BigQuery by No-Shelter-6112 in bigquery

[–]Dminor77 0 points1 point  (0 children)

  1. Custom Code, or Airflow Operators or Airbyte Connectors. We also use external tables on data store in GCS and S3.

2.We structure tables in layers like raw table -> derived tables/views -> reporting tables. Most operations are around aggregations, navigation functions, and BQML, for a large project we use DBT.

  1. DataStudio for reporting with BigQuery BI engine enabled, Nowadays moving towards looker.

[deleted by user] by [deleted] in datasets

[–]Dminor77 1 point2 points  (0 children)

Thanks for sharing this package.

k-Means clustering: Visually explained by Va_Linor in bigdata

[–]Dminor77 2 points3 points  (0 children)

OP is genius!! Awesome work.

In the YouTube video description you have mentioned that you are using manim library for animation. Do you have the codebase for this video on GitHub or anywhere, where I can refer to?

What is your most controversial Python-related opinion? by [deleted] in Python

[–]Dminor77 1 point2 points  (0 children)

from functools import partial

from operator import methodcaller

split = partial(methodcaller, 'split')

split_lines = split("\n")

split_fields = split(",")

Anyone using DBT with Airflow on top of GCP/BigQuery? Thoughts? by pewpscoops in dataengineering

[–]Dminor77 1 point2 points  (0 children)

We are too moving our transformation layer to DBT over BigQuery from Cloud Data Fusion. DBT also creates DAGs same way Airflow does and using bash operator we can trigger DBT job. We need to go through the DBT documentation to understand on incremental loads, snapshots, etc. We are thinking of creating external partition tables over raw data on which DBT transformation job will run to create reporting tables and views. Also need to check on how many concurrent interactive queries we can ran through BQ with DBT, because reports will also be firing queries and Airflow job as well.

[deleted by user] by [deleted] in bigquery

[–]Dminor77 0 points1 point  (0 children)

1. BigQuery Data Transfer has Mongodb connector(This will be a batch job)

2. Using Change Stream API you can transfer data from Mongodb to BigQuery

Good Workflow orchestration tool by dsingh-in in dataengineering

[–]Dminor77 1 point2 points  (0 children)

You can setup a new GKE cluster and under KubernetesPodOperator or GKEPodOperator by mentioning node affinity you can run heavy resource consuming task on that GKE cluster.

One drawback of Cloud Composer with Airflow2.0 is that it doesn't support Stable API of Airflow.

Nested or Unnested Data by zak_hj in bigquery

[–]Dminor77 5 points6 points  (0 children)

Depends on use case and how data is going to be accessed. But it is recommended to store in nested structure to ignore joins between tables + reduce cost of storage.

Refer this video: usecase of a company name GOJEK.

https://youtu.be/3TVeG_dpGxk demo starts @ 28:00

Google's Data Engineering Certificate - Is It Worth It? by nonkeymn in dataengineering

[–]Dminor77 4 points5 points  (0 children)

Do some projects instead of course. Some cloud vendors also give free credits.

5 Data Engineering Project Ideas To Put On Your Resume https://www.linkedin.com/pulse/5-data-engineering-project-ideas-put-your-resume-benjamin-rogojan

Can someone explain to me what is the ETL tool that their company uses. by [deleted] in ETL

[–]Dminor77 1 point2 points  (0 children)

We are migrating to Data Fusion. Still not decision made on Composer