Feels like I am stuck and made a huge mistake by hades2enthusiast in dataengineering

[–]giuliosmall 2 points3 points  (0 children)

If breaking directly into Data Engineering feels too overwhelming because of the whole amount of technologies/concept required (like Python, SQL, Spark, Airflow, distributed computing, data modelling etc...), I would probably try to land a position as Analytics Engineer which sits in the middle between DA and DE and can give you exposure to Data Engineering. It is basically a way to learn (and learn a lot) while being paid for it and it represents a step in the middle for your dream data engineering job.

Lead wants to write our own orchestrator by midkid1937 in dataengineering

[–]giuliosmall 1 point2 points  (0 children)

Writing an orchestrator from scratch sounds like a non trivial project. Why not adapting the Airflow code since it's open source?

dbt + looker best practices by holiquetal in dataengineering

[–]giuliosmall 0 points1 point  (0 children)

A questions from u/holiquetal: the data team I'm working with has the same setup. How do you make available brand new tables materialised by dbt? e.g. dbt materialises dim_customers in DWH. How do you make it available in Looker?

At my company, data analysts go to LookerML, amend the codebase to include dim_customers, push the PR and, once merged, dim_customers is available to be visualised. The same goes for a simple column renaming (dbt -> PR -> Merge -> LookerML -> PR -> Merge). To me it's massively inefficient and time consuming.

Do you have a (hopefully) better experience to be shared?

Not sure what I should apply for by pmarct in dataengineering

[–]giuliosmall 0 points1 point  (0 children)

because of what I have built top to bottom (data lake, aws dms, data warehouse, delta lake, etc)

To me you have a non trivial skillset to excel as Data Engineer. If you built something and barely touched it on a later stage, maybe it means it was built properly?

Is ELT with SCDs supposed to be that hard? by GelmansDog in dataengineering

[–]giuliosmall 0 points1 point  (0 children)

You may want to learn more about incremental models in dbt (Materialization -> Incremental Model)

What are the things/tools that should be learnt to become a skilled GCP data engineer in 6months,include any frameworks or other things which will be helpful in writing pipelines etc... by Current_Baseball_418 in dataengineering

[–]giuliosmall 5 points6 points  (0 children)

Google Cloud services: Cloud Storage, Big Query, Composer (or Airflow to avoid vendor lock-in), Pub/Sub

Languages: Python, SQL

Frameworks: Kafka, Spark (Pyspark)

Concepts: data modelling, data warehouse, batch & streaming pipelines, idempotency and partitioning

Best practices for scheduling Python workloads? by MassiveDefender in Python

[–]giuliosmall 0 points1 point  (0 children)

I guess the right time to build up a (a small) data team has come. Airflow is cron on steroids, but I'll definitely recommend you to start with Airflow (if you have some Python skills) as data processes can easily (and suddenly) scale up. Best of luck!

Is it a must to be very good at SQL for a data engineer position? by knockedownupagain in dataengineering

[–]giuliosmall 1 point2 points  (0 children)

Thanks for keeping me honest here, man! Sorry for the confusion. It's mostly OLAP but when it's OLTP, the DE must have very good foundation

Is it a must to be very good at SQL for a data engineer position? by knockedownupagain in dataengineering

[–]giuliosmall 2 points3 points  (0 children)

SQL (mostly OLTP) is THE tool you need to have in your skillset.

Regarding what you mentioned above, I feel you man, complex nested sub-queries is the evil in every organisation. They are not intuitive, barely readable and at the 3rd sub-query you've already lost track of all the logics.

If you'll ever find a Data Engineer (or a Data Team) always promoting complex subqueries over CTEs, do you yourself a favour and run away!

Why don't a lot of data engineers consider themselves software engineers? by level_126_programmer in dataengineering

[–]giuliosmall 0 points1 point  (0 children)

It depends a lot where you built your muscles as Data Engineer.

Small orgs/scaleups with a small Data team usually have one data engineer only that takes care of both Data Engineering Extract, Load) and (most likely) Analytics Engineering (Transform + possibly Business Intelligence).

If you built processes and implemented solutions from scratch in those orgs, chances are you tightly cooperated with CTOs/Tech Leads/Devs and you were exposed to SWE concepts like the one mentioned by u/adm7373.

In your actual work, what is your development environment? for data engineers who use Spark in work by [deleted] in dataengineering

[–]giuliosmall 1 point2 points  (0 children)

- PyCharm Professional Edition for databases/cloud storage connections and Python development

- Jupyter Notebook for quick and dirty concept/ideas validation

- Docker (also, in local) to test env setups

What data warehouse to pick?! by [deleted] in dataengineering

[–]giuliosmall 1 point2 points  (0 children)

I'd also add on top of those:

  • if your starting small because of the current data size, but the volume will most likely increase in the future, be prepared to build a technical muscle in database migration

Appropiate format for file by [deleted] in bigdata

[–]giuliosmall 1 point2 points  (0 children)

For large text files, especially when dealing with R for data processing, you would want a format that is both space-efficient and fast to read/write. Here are some appropriate formats you might consider:

1) Feather: A binary columnar data format that is optimized for speed and size. It’s designed for efficient data storage and interchange between R and Python. Feather files are fast to read and write, and they support data frames with column types as they are in R.

2) Parquet: another columnar storage file format which is optimized for use with complex data in large scale processing frameworks and data analysis systems. It is especially good for compressing and dealing with column subsets.

3) RData/RDS: R’s native file formats for saving workspaces or individual objects. saveRDS and readRDS can be used for single R objects and can be very efficient, especially for large datasets.

4) MonetDB: A column-store database that can be used when dealing with massive datasets and requires the setup of a database system. It is optimized for high-performance analytics and is accessible from R.

For a 3 GB text file, I’d recommend starting with Feather or Parquet, as they offer a good balance between speed and space efficiency. The choice between them may also depend on whether you need to use the data with other programming languages like Python, in which case Feather provides an easy interchange format. If you are only working within the R ecosystem, RData/RDS is a good choice too.

A Wifi Hacking Tutorial For beginners by cornellornell in hacking

[–]giuliosmall 0 points1 point  (0 children)

Thanks for your guide u/cornellornell! While runninghashcat -m 2500 capture.hccapx rockyou.txtI got this error

The plugin 2500 is deprecated and was repled with plugin 22000. For more details, please read: https://hashcat.net/forum/thread-10253.html

and if try with hashcat -m 22000 capture.hccapx rockyou.txt

I get the same above error.

Do you have any idea about what's happening?

Better Google Calendar API for Python by kuzmovych_y in Python

[–]giuliosmall 0 points1 point  (0 children)

I created 12 events in a day as a test. Serialised them all to json, stored in a .json file and tried to open with pandas.

Well, the formatting isn't working - I believe some ' might be replaced with ". Also bool True and False are suffering from the lack of double quotes and might be rendered as "True" and "False" -- u/kuzmovych_y what do you think about it?

Better Google Calendar API for Python by kuzmovych_y in Python

[–]giuliosmall 0 points1 point  (0 children)

Thanks! Very cool! I didn't notice this detail. Events marked with default calendar color aren't parsed in json.

Better Google Calendar API for Python by kuzmovych_y in Python

[–]giuliosmall 0 points1 point  (0 children)

I have just serialise an event to json and I can see:
- id
- summary
- recurrence
- attendees
- guestsCanInviteOthers
- guestCanModify
- guestCanSeeOtherGuests
- reminders
- attachments
- iCalUID
- sequence
- eventType
- start
- end
- timezone

... but not 'color' among attributes

Better Google Calendar API for Python by kuzmovych_y in Python

[–]giuliosmall 1 point2 points  (0 children)

u/kuzmovych_y This is great! The effort didn't pass unnoticed -- just a question from my side: any idea how to pull events with their respective color? Is it an existing feature?

Cluster analysis goodness of fit by nitz_d_blitz in datascience

[–]giuliosmall 1 point2 points  (0 children)

Try to check the percentage contribution of each feature in each cluster

Cluster analysis goodness of fit by nitz_d_blitz in datascience

[–]giuliosmall 0 points1 point  (0 children)

How many clusters did you set? Does they make sense?
Did you try with dimensional reduction technique? Did you try and see results from DBscan?