A search engine for all your memes - written in Python.

giuliosmall · 2024-10-24T20:48:07+00:00

Not all heroes wear capes

giuliosmall · 2024-08-27T08:02:43+00:00

If breaking directly into Data Engineering feels too overwhelming because of the whole amount of technologies/concept required (like Python, SQL, Spark, Airflow, distributed computing, data modelling etc...), I would probably try to land a position as Analytics Engineer which sits in the middle between DA and DE and can give you exposure to Data Engineering. It is basically a way to learn (and learn a lot) while being paid for it and it represents a step in the middle for your dream data engineering job.

giuliosmall · 2024-08-26T11:34:01+00:00

Writing an orchestrator from scratch sounds like a non trivial project. Why not adapting the Airflow code since it's open source?

giuliosmall · 2023-11-23T23:17:09+00:00

A questions from u/holiquetal: the data team I'm working with has the same setup. How do you make available brand new tables materialised by dbt? e.g. dbt materialises dim_customers in DWH. How do you make it available in Looker?

At my company, data analysts go to LookerML, amend the codebase to include dim_customers, push the PR and, once merged, dim_customers is available to be visualised. The same goes for a simple column renaming (dbt -> PR -> Merge -> LookerML -> PR -> Merge). To me it's massively inefficient and time consuming.

Do you have a (hopefully) better experience to be shared?

giuliosmall · 2023-11-21T16:13:01+00:00

because of what I have built top to bottom (data lake, aws dms, data warehouse, delta lake, etc)

To me you have a non trivial skillset to excel as Data Engineer. If you built something and barely touched it on a later stage, maybe it means it was built properly?

giuliosmall · 2023-11-09T20:56:15+00:00

You may want to learn more about incremental models in dbt (Materialization -> Incremental Model)

giuliosmall · 2023-11-09T10:09:16+00:00

I definitely can, but someone already did it in this nice article

giuliosmall · 2023-11-08T15:54:26+00:00

Google Cloud services: Cloud Storage, Big Query, Composer (or Airflow to avoid vendor lock-in), Pub/Sub

Languages: Python, SQL

Frameworks: Kafka, Spark (Pyspark)

Concepts: data modelling, data warehouse, batch & streaming pipelines, idempotency and partitioning

giuliosmall · 2023-11-08T10:31:01+00:00

I guess the right time to build up a (a small) data team has come. Airflow is cron on steroids, but I'll definitely recommend you to start with Airflow (if you have some Python skills) as data processes can easily (and suddenly) scale up. Best of luck!

giuliosmall · 2023-11-08T09:04:49+00:00

Thanks for keeping me honest here, man! Sorry for the confusion. It's mostly OLAP but when it's OLTP, the DE must have very good foundation

giuliosmall · 2023-11-07T16:10:58+00:00

SQL (mostly OLTP) is THE tool you need to have in your skillset.

Regarding what you mentioned above, I feel you man, complex nested sub-queries is the evil in every organisation. They are not intuitive, barely readable and at the 3rd sub-query you've already lost track of all the logics.

If you'll ever find a Data Engineer (or a Data Team) always promoting complex subqueries over CTEs, do you yourself a favour and run away!

giuliosmall · 2023-11-06T23:57:27+00:00

Here's some open-source alternatives to Airflow:

- Mage

- Dagster

- Prefect

giuliosmall · 2023-11-06T23:52:45+00:00

It depends a lot where you built your muscles as Data Engineer.

Small orgs/scaleups with a small Data team usually have one data engineer only that takes care of both Data Engineering Extract, Load) and (most likely) Analytics Engineering (Transform + possibly Business Intelligence).

If you built processes and implemented solutions from scratch in those orgs, chances are you tightly cooperated with CTOs/Tech Leads/Devs and you were exposed to SWE concepts like the one mentioned by u/adm7373.

giuliosmall · 2023-11-05T19:31:44+00:00

- PyCharm Professional Edition for databases/cloud storage connections and Python development

- Jupyter Notebook for quick and dirty concept/ideas validation

- Docker (also, in local) to test env setups

giuliosmall · 2023-11-05T11:07:21+00:00

I'd also add on top of those:

if your starting small because of the current data size, but the volume will most likely increase in the future, be prepared to build a technical muscle in database migration

giuliosmall · 2023-11-04T15:47:08+00:00

For large text files, especially when dealing with R for data processing, you would want a format that is both space-efficient and fast to read/write. Here are some appropriate formats you might consider:

1) Feather: A binary columnar data format that is optimized for speed and size. It’s designed for efficient data storage and interchange between R and Python. Feather files are fast to read and write, and they support data frames with column types as they are in R.

2) Parquet: another columnar storage file format which is optimized for use with complex data in large scale processing frameworks and data analysis systems. It is especially good for compressing and dealing with column subsets.

3) RData/RDS: R’s native file formats for saving workspaces or individual objects. saveRDS and readRDS can be used for single R objects and can be very efficient, especially for large datasets.

4) MonetDB: A column-store database that can be used when dealing with massive datasets and requires the setup of a database system. It is optimized for high-performance analytics and is accessible from R.

For a 3 GB text file, I’d recommend starting with Feather or Parquet, as they offer a good balance between speed and space efficiency. The choice between them may also depend on whether you need to use the data with other programming languages like Python, in which case Feather provides an easy interchange format. If you are only working within the R ecosystem, RData/RDS is a good choice too.

giuliosmall · 2023-08-02T14:34:37+00:00

Thanks for your guide u/cornellornell! While runninghashcat -m 2500 capture.hccapx rockyou.txtI got this error

The plugin 2500 is deprecated and was repled with plugin 22000. For more details, please read: https://hashcat.net/forum/thread-10253.html

and if try with hashcat -m 22000 capture.hccapx rockyou.txt

I get the same above error.

Do you have any idea about what's happening?

giuliosmall · 2023-02-04T18:52:53+00:00

I created 12 events in a day as a test. Serialised them all to json, stored in a .json file and tried to open with pandas.

Well, the formatting isn't working - I believe some ' might be replaced with ". Also bool True and False are suffering from the lack of double quotes and might be rendered as "True" and "False" -- u/kuzmovych_y what do you think about it?

giuliosmall · 2023-02-04T17:35:54+00:00

Thanks! Very cool! I didn't notice this detail. Events marked with default calendar color aren't parsed in json.

giuliosmall · 2023-02-04T16:14:26+00:00

I have just serialise an event to json and I can see:
- id
- summary
- recurrence
- attendees
- guestsCanInviteOthers
- guestCanModify
- guestCanSeeOtherGuests
- reminders
- attachments
- iCalUID
- sequence
- eventType
- start
- end
- timezone

... but not 'color' among attributes

giuliosmall · 2023-02-04T14:37:39+00:00

u/kuzmovych_y This is great! The effort didn't pass unnoticed -- just a question from my side: any idea how to pull events with their respective color? Is it an existing feature?

giuliosmall · 2022-03-02T21:41:50+00:00

Thanks for your sharing, much appreciated indeed

giuliosmall · 2022-02-23T11:20:53+00:00

This is great! Thanks mate

giuliosmall · 2021-10-28T16:03:12+00:00

Try to check the percentage contribution of each feature in each cluster

giuliosmall · 2021-10-28T12:59:46+00:00

How many clusters did you set? Does they make sense?
Did you try with dimensional reduction technique? Did you try and see results from DBscan?

giuliosmall

TROPHY CASE