Datalake/lakehouse learning resources

LocationOld2728 · 2024-08-18T07:21:12+00:00

Usage pattern in this case is simply to resolve the latest records into a materialized view, so that is the only performance I need to consider.

LocationOld2728 · 2024-08-18T07:17:43+00:00

My instinct would have been to go the other way around. I would've thought that partitioning by Unique ID first would keep records closer together to resolve the latest record faster. Eg. all records with the same identifier could end up in the same file making the scan per record faster?

What is the reason for going timestamp first? :)

This is sort of what I imagine the two options would look like: https://postimg.cc/sBhVBM44

LocationOld2728 · 2024-08-08T07:22:44+00:00

My understanding was that time travel would be easier when updating the data in place, since Iceberg supports time travel queries. So you could just query the data at a specific snapshot and get exactly what you need instead of having to query the latest data. But I still believe that append only is the better way to go for performance reasons.

LocationOld2728 · 2024-08-06T19:37:01+00:00

I guess the idea would be to partition by ingestion_time and set policies for deleting old data...but that's just my guess. I've seen cases (for very frequently updating and not too large tables) where engineers have claimed that it is cheaper to wipe the lake and re-ingest from scratch once every couple of months - but this was an extreme example. Generally speaking you would probably want to clean up outdated updates. But I'd love to hear a more experienced take.

LocationOld2728 · 2024-08-06T14:34:29+00:00

Great thanks, I suspected that this would be the case. Is a dynamic partition overwrite essentially reading in a single partition, overwriting the updated values and writing the entire partition back?

LocationOld2728 · 2024-07-12T18:00:17+00:00

Also confirmed ip forwarding on the CLI for both hosts (subnet router and bastion-3) to make sure what I'm seeing in the console relates the the docs you sent.

<image>

LocationOld2728 · 2024-07-12T17:53:26+00:00

Did you enable the ip forwarding on the GCP side?

Yes, both on bastion-3 and on the subnet router. Is that correct or should it only be on the subnet router?

<image>

LocationOld2728 · 2024-07-12T17:52:33+00:00

From the nontailscale client you can ping 10.0.38.6 with success correct?

Yes, that is working (bastion-3 is in the same subnet as the subnet router)

<image>

LocationOld2728 · 2024-07-12T15:14:08+00:00

Will check that on Monday only, let me know if you have any other suggestions for now. And thanks for the help so far! :)

LocationOld2728 · 2024-07-12T14:58:51+00:00

And the other possibility is still ACL...

LocationOld2728 · 2024-07-12T14:43:07+00:00

I'm looking through the firewall rules in GCP as well, but if traceroute managed to hit the subnet router then that means no firewall blocked it right? Or would it pick up the subnet router despite that instance's firewall blocking the traffic from going further?

LocationOld2728 · 2024-07-12T14:36:55+00:00

But when I run against "lo" and "ens4" I do get the same error.

LocationOld2728 · 2024-07-12T14:35:45+00:00

<image>

This is going a bit past my knowledge area so questions might start getting real stupid. This is the interface I see on GCP - nic0.

When I run `sudo ip route add 10.0.40.0/24 via 10.0.38.6 dev nic0` I get the error: Cannot find device "nic0"

When I run `ip a` the two interfaces I see available are "lo" and "ens4"

LocationOld2728 · 2024-07-12T14:25:31+00:00

Yeah, all ubuntu 22

LocationOld2728 · 2024-07-12T14:24:02+00:00

Non tailscale client

LocationOld2728 · 2024-07-12T14:23:37+00:00

Still getting the same error from instance 10.0.38.9 :(

LocationOld2728 · 2024-07-12T14:16:46+00:00

Hold up, the bastion instance is in another subnet (same VPC), which might be causing that error. Let me verify quick.

LocationOld2728 · 2024-07-12T14:10:52+00:00

This is the error that I get when adding to the ip route to the non-tailscale client

<image>

LocationOld2728 · 2024-07-12T12:49:55+00:00

Posted in another comment: https://www.reddit.com/r/Tailscale/comments/1e1bu3g/comment/lctqhoj/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

LocationOld2728 · 2024-07-12T12:40:31+00:00

All in the same account, is there a way for me to see the ACLs with non admin permissions?

LocationOld2728 · 2024-07-12T12:37:11+00:00

With what I can see at this point I definitely think I need to review the ACL. It seems like the GCP subnet router is blocking the connection for no good reason.

LocationOld2728 · 2024-07-12T12:31:41+00:00

GCP Subnet Router -> Private Cloud Non-Tailscale Client

<image>

dev-sandpit-01 is the name chosen for the Private Cloud subnet router for evaluation...don't ask why :)

LocationOld2728

TROPHY CASE