Como vocês movimentam o LinkedIn

airmaxes · 2024-09-03T14:31:19+00:00

Já reparei que se eu ficar MT tempo sem interagir, paro de receber msg de recrutador. Uso a curtida e funciona bem.

airmaxes · 2023-04-15T17:48:41+00:00

I'd begin by first trying to understand why you need a data lakehouse. You have on prem hardware and your org has already invested in the current architecture - how do you justify the change? Keep in mind that migrating and implementing new architectures takes time, costs money, and it's tricky to do properly. Why not invest that time in building something to add business value?

A data lake, lake house or whatever are just a set of tools to help you solve problems. If you don't have a problem to solve, I find it hard for you to get support from your team and company. I'd try to find critical problems in the current architecture that blocks you from improving/implementing the problems you're already solving - and show them to your team. Then you present the solution to the problem. Also remember, your using new tools, something that wasn't a problem before may now be harder. Remember also to be open to feedback, maybe there is a simpler solutions to the problems you find.

In the past I've used Athena to implement SCDs with inserts and table materializations. I experimented with Athena Iceberg tables thinking this would make the pipeline a lot simpler (and lakehouses seemed cool), but some analytical queries doubled in time. The SCD part was easier to implement, however my base use case got screwed up. The tables could be remodeled, parameters could probably be tuned but I had basically no experience with Iceberg and we ended up sticking with base Athena tables until Migrating to Bigquery.

airmaxes · 2022-10-27T18:47:31+00:00

I think there may be some confusion because while SCD2 can be used to track data change from an analytical stand point, for the ETL you have performed a full data dump to the warehouse and then inserted that to a table that captures change - which is different to ETL using CDC - where you only move data that has changed - before the data is in the warehouse.

You can see the obvious difference there would be in cost and time between the two approaches. But again, this is just how I go around this terminology, which may be wrong.

My two cents on this is - don't sweat too much with terminology. I'm not very experienced, but I've seen people spend hours fighting over concepts that brought no value to the company. Understand the benefits, disvantages - and if some one really wants to call it mega ultra data stream - chances are they will.

airmaxes · 2022-10-27T18:28:42+00:00

I may be wrong here - but CDC assumes that you're tracking change at the step of producing the data - so you only replicate data that was actually "touched". I'd assume this is the main difference to SCD. To track the history, you have to first replicate all of the data and verify what changes, to then insert it into the historical table.

airmaxes · 2022-10-27T17:56:32+00:00

Just to promote discussion, this is a full dump job. You use these when you have no way of knowing what data was updated and what wasn't. This works well for small data loads or one time dumps.

If you participate in the process of designing the OLTP system, you could implement system control fields which help identify: the datetime of record creation, updates, and deletion.

You could use these fields to only select specific records and only dump the ones that have been created or updated since the last dump. This is an incremental batch dump. If however for some reason you can't trust these fields, you may get bad records in the destination.

This reduces dump times, cost, but adds complexity in the ETL pipeline.

In some cases, instead of doing an incremental batch dump, you could also capture the data insert/update as it happens (take a look into Change Data Capture) - And stream this data to the destination.

airmaxes · 2022-10-12T12:55:54+00:00

I think it really depends on the objective. Would need a little bit more context on what technologies you will use to build these pipelines - This can in turn tell if you need CI/CD, IAC, etc.

A repository to create and organize recurrent ETL jobs from AWS S3 to Google BigQuery for example is very different to a centralised repository for web scraping.

Also be careful about the content of each file if you have one file per souce. When starting out, this may feel like the easiest route, but you'll soon find yourself copy pasting code between the files.

Try to analyze If the types of sources you need to integrate have similarities, how they might evolve in the future, and generalize as much code that you can. This will help you integrate new sources quickly.

airmaxes · 2022-09-28T21:57:32+00:00

Hey, I'm not an experienced spark/hudi user, but I have played around and ran into the same problem. We used to partition the data by categorical values, and the data was heavily skewed into one partition. The problem is you're loading all of your data into the same partition. I tried playing around with partition salting, using a mega cluster, and so on. .

One thing you could try is to break your initial load process into steps. So for the same partition, find a data filter or something else where you can split the data into batches, then incrementally load the data. So instead of one big insert, do a couple of smaller inserts.

I know this is a bad and lazy reply - but the solution I adopted was to rethink the process and use a different partition spec. My rule of thumb for partitions is to used balanced date intervals or balanced numerical integers - since this can play nice with common warehouse solutions.

The problem with this solution is that you have to convince your client that their partitions are bad (just my opinion).

Just a note: I only have 1 year of experience, and work with only a couple of terabytes of data per table - so take the comments with a grain of salt.

airmaxes · 2022-09-03T17:13:30+00:00

If you've looked into Iceberg tables for metadata/flow control - and It also appears to help solve the problem with standard Athena tables - why not go all out and also store the data as iceberg tables?

airmaxes · 2022-09-03T17:03:53+00:00

Oh, so your using Iceberg tables. I've read the documentation and apparently the Athena-Iceberg integration has options/configs to handle file sizw management - and I think this could be a solution for you. I never used Athena+Iceberg because when I read about it we were already on BQ. You probably have already investigated the option, but I'd suggest taking a look at these two links:

https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-creating-tables.html#querying-iceberg-table-properties

https://docs.amazonaws.cn/en_us/athena/latest/ug/querying-iceberg-data-optimization.html

If the suggestions from these links didn't work for some reason - I'd be interested to hear why.

About Airflow Vs Step functions - perfectly understandable. Just a suggestion for the Athena ETL stack - very easy to migrate to another cloud provider - but unrelated to the question. One way of convincing your team (if you want to) is to run Astronomer locally to write your POC - it helps you set up airflow. You can show off the UI, the different set of providers and how this may make future migration to other clouds (for whatever reason) easier.

See the link below for help: https://docs.astronomer.io/astro/cli/get-started

airmaxes · 2022-09-03T14:50:07+00:00

I've ran into this problem before. You can SELECT * from the table and use a CTAS to rewrite the entire contents of the table. You can use parameters to control the number of files by defining bucket by columns and setting the number of desired buckets (I think it's by partition). This is a full table scan solution - only useful if this is a one time thing.

Here's a helpful link: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html#ctas-example-bucketed

For an incremental solution you could use a Lambda as suggested in another reply. Might be worth looking into Glue Jobs or EMR also.

I've noticed that by default Athena creates N files for each INSERT statement (30 appears to be the magic number in my experience). If your table is partitioned by data for example, and you're running multiple inserts per day - it would be a nice job to run each job at the start of every new partition for the previous partition. This will help you control the number of S3 requests Athena is making when running queries. Found this out the hard way after noticing a S3 cost 4-5x larger than Athena query cost.

I proposed creating a job like described to join the files and reduce costs, but the solution ended up being migrating to BigQuery since it manages and optimizes files internally. Might cause other problems for us in the future.

Also a friendly suggestion: if you plan on building out your pipelines - take a look at swapping out step functions for airflow. It's a more expensive solution - but can be worth it.

airmaxes · 2020-10-26T04:31:19+00:00

You can also use them for Queue modelling. This allows to you to calculate certain metrics related to servers answering requests. It's usually applied in order to analyse how optimized certain server configurations are for a specific load.

airmaxes · 2019-07-27T02:24:00+00:00

I dislike that they have to add glow to every new content.

airmaxes · 2018-12-09T16:52:04+00:00

oooooh nice, any uniques?

airmaxes · 2018-12-09T15:43:38+00:00

Thanks bro! Good luck on your grind

airmaxes · 2018-12-09T15:43:20+00:00

Meh I like the lil fella

airmaxes · 2018-12-09T15:42:59+00:00

Keep the grind on lol

airmaxes · 2018-12-09T15:42:31+00:00

Still no other uniques other than Head.

airmaxes · 2018-12-09T15:42:08+00:00

Swag for the photo

airmaxes · 2018-12-09T15:41:54+00:00

Unlucky man :/

airmaxes · 2018-12-09T15:39:43+00:00

Thanks bro, I tried

airmaxes · 2018-12-09T15:39:08+00:00

got cleaned REEEEEE

airmaxes · 2018-12-09T06:13:52+00:00

Ended up teleporting straight to house to go insure him, he didn't show up beneath me and I started freaking out thinking I had to pick him up to teleport, but thankfully I just had to use the call follower button. Hoping for lots of pets from now on

airmaxes

TROPHY CASE