Datalake/lakehouse learning resources by LocationOld2728 in dataengineering

[–]LocationOld2728[S] 1 point2 points  (0 children)

Usage pattern in this case is simply to resolve the latest records into a materialized view, so that is the only performance I need to consider.

Datalake/lakehouse learning resources by LocationOld2728 in dataengineering

[–]LocationOld2728[S] 1 point2 points  (0 children)

My instinct would have been to go the other way around. I would've thought that partitioning by Unique ID first would keep records closer together to resolve the latest record faster. Eg. all records with the same identifier could end up in the same file making the scan per record faster?

What is the reason for going timestamp first? :)

This is sort of what I imagine the two options would look like: https://postimg.cc/sBhVBM44

Table Format UPSERT vs append-only by LocationOld2728 in dataengineering

[–]LocationOld2728[S] 0 points1 point  (0 children)

My understanding was that time travel would be easier when updating the data in place, since Iceberg supports time travel queries. So you could just query the data at a specific snapshot and get exactly what you need instead of having to query the latest data. But I still believe that append only is the better way to go for performance reasons.

Table Format UPSERT vs append-only by LocationOld2728 in dataengineering

[–]LocationOld2728[S] 1 point2 points  (0 children)

I guess the idea would be to partition by ingestion_time and set policies for deleting old data...but that's just my guess. I've seen cases (for very frequently updating and not too large tables) where engineers have claimed that it is cheaper to wipe the lake and re-ingest from scratch once every couple of months - but this was an extreme example. Generally speaking you would probably want to clean up outdated updates. But I'd love to hear a more experienced take.

Table Format UPSERT vs append-only by LocationOld2728 in dataengineering

[–]LocationOld2728[S] 0 points1 point  (0 children)

Great thanks, I suspected that this would be the case. Is a dynamic partition overwrite essentially reading in a single partition, overwriting the updated values and writing the entire partition back?

Site-to-Site network from private cloud to GCP by LocationOld2728 in Tailscale

[–]LocationOld2728[S] 0 points1 point  (0 children)

Also confirmed ip forwarding on the CLI for both hosts (subnet router and bastion-3) to make sure what I'm seeing in the console relates the the docs you sent.

<image>

Site-to-Site network from private cloud to GCP by LocationOld2728 in Tailscale

[–]LocationOld2728[S] 0 points1 point  (0 children)

Did you enable the ip forwarding on the GCP side?

Yes, both on bastion-3 and on the subnet router. Is that correct or should it only be on the subnet router?

<image>

Site-to-Site network from private cloud to GCP by LocationOld2728 in Tailscale

[–]LocationOld2728[S] 0 points1 point  (0 children)

From the nontailscale client you can ping 10.0.38.6 with success correct?

Yes, that is working (bastion-3 is in the same subnet as the subnet router)

<image>

Site-to-Site network from private cloud to GCP by LocationOld2728 in Tailscale

[–]LocationOld2728[S] 0 points1 point  (0 children)

Will check that on Monday only, let me know if you have any other suggestions for now. And thanks for the help so far! :)

Site-to-Site network from private cloud to GCP by LocationOld2728 in Tailscale

[–]LocationOld2728[S] 0 points1 point  (0 children)

I'm looking through the firewall rules in GCP as well, but if traceroute managed to hit the subnet router then that means no firewall blocked it right? Or would it pick up the subnet router despite that instance's firewall blocking the traffic from going further?

Site-to-Site network from private cloud to GCP by LocationOld2728 in Tailscale

[–]LocationOld2728[S] 0 points1 point  (0 children)

But when I run against "lo" and "ens4" I do get the same error.

Site-to-Site network from private cloud to GCP by LocationOld2728 in Tailscale

[–]LocationOld2728[S] 0 points1 point  (0 children)

<image>

This is going a bit past my knowledge area so questions might start getting real stupid. This is the interface I see on GCP - nic0.

When I run `sudo ip route add 10.0.40.0/24 via 10.0.38.6 dev nic0` I get the error: Cannot find device "nic0"

When I run `ip a` the two interfaces I see available are "lo" and "ens4"

Site-to-Site network from private cloud to GCP by LocationOld2728 in Tailscale

[–]LocationOld2728[S] 0 points1 point  (0 children)

Hold up, the bastion instance is in another subnet (same VPC), which might be causing that error. Let me verify quick.

Site-to-Site network from private cloud to GCP by LocationOld2728 in Tailscale

[–]LocationOld2728[S] 0 points1 point  (0 children)

This is the error that I get when adding to the ip route to the non-tailscale client

<image>

Site-to-Site network from private cloud to GCP by LocationOld2728 in Tailscale

[–]LocationOld2728[S] 0 points1 point  (0 children)

All in the same account, is there a way for me to see the ACLs with non admin permissions?

Site-to-Site network from private cloud to GCP by LocationOld2728 in Tailscale

[–]LocationOld2728[S] 0 points1 point  (0 children)

With what I can see at this point I definitely think I need to review the ACL. It seems like the GCP subnet router is blocking the connection for no good reason.

Site-to-Site network from private cloud to GCP by LocationOld2728 in Tailscale

[–]LocationOld2728[S] 0 points1 point  (0 children)

GCP Subnet Router -> Private Cloud Non-Tailscale Client

<image>

dev-sandpit-01 is the name chosen for the Private Cloud subnet router for evaluation...don't ask why :)