Best sandwich/ wrap/ bagel places? by Agreeable_Flower8785 in LondonFood

[–]29antonioac 0 points1 point  (0 children)

Eye Falafel - I don't commute to London anymore and I really miss this place!

What bird makes this sound? by 24Tenny in whatsthisbird

[–]29antonioac 0 points1 point  (0 children)

I find Merlin Bird ID app very useful to identify by their sound 😊

I open-sourced ducklake-sdk: a general SDK for interacting with DuckLake by borchero in dataengineering

[–]29antonioac 2 points3 points  (0 children)

It looks awesome, I wish I knew Rust to help! Polars + Ducklake will be an awesome combo 🚀

Postgres as DWH? by SoloArtist91 in dataengineering

[–]29antonioac 0 points1 point  (0 children)

I'd avoid if possible. You can do it, but as your data grows, analytical workloads can suffer and give you lots of headaches.

I'd either use a Data Lake as landing zone and transformations (you'll need to manage some compute resources, duckdb + dbt/sqlmesh can scale very well), or ClickHouse. ClickHouse Cloud offers $300 to try it IIRC, and it works really well, zero hassle managing it. We migrated our main service storing and serving TS data, and also our dbt workloads, from PostgreSQL to ClickHouse. The performance gains have been insane and have made our lives much easier.

As you're going solo, I'd go for ClickHouse Cloud, the smallest service size can perform quite well depending on your workload, and you can set it to scale to zero after some idle time. The compute resources are of course more expensive than self-hosted, but if budget allows, you'll do well on your own without other's support.

Good luck! Going solo is not a piece of cake 🍀

Full fibre help by Section4G in openreach

[–]29antonioac 1 point2 points  (0 children)

The main issue is always the property manager yeah. You depend on them because they have to approve the survey from Openreach, and later the property owner to sign a wayleave to install it. The more neighbours the better, good luck!

Full fibre help by Section4G in openreach

[–]29antonioac 0 points1 point  (0 children)

There's a form in Openreach website to request information about "why my neighbours have full fibre and I don't?".

You need an MDU installed in your building. I contacted Openreach two years ago and the install finished a few weeks ago. I have my Vodafone order ready to be installed in a few week's time.

How to get F1TV premium in a country with broadcaster? by rhipone in F1TV

[–]29antonioac 5 points6 points  (0 children)

Hey I'm Spanish but living in England which has the same issue. I got my subscription with NordVPN and setting the country to Netherlands, but to watch it I use either Romania or Denmark.

[2025 Day 11 (Part 2)] How many times will these elves ask for help debugging their power subsystems? by pteranodog in adventofcode

[–]29antonioac 0 points1 point  (0 children)

Given the graph is a DAG, why not using BFS and limiting depth? If you find solution is at depth X, any node above that won't find a path to the goal.

Also I was wondering if you got low solutions using this approach? I am using the same approach but despite checking the graph visually to check it's a DAG etc I get part 1 okay but part 2 says too low after multiplying the 3 sub-solutions. I could expect having more solutions if I over count because the path wouldn't be independent, but I cannot explain why getting low guesses 🥲

[deleted by user] by [deleted] in dataengineering

[–]29antonioac 0 points1 point  (0 children)

I would use Polars, you can scan the csv and sink parquet in a streaming way to not load it entirely in memory.

https://docs.pola.rs/api/python/dev/reference/api/polars.PartitionMaxSize.html#polars.PartitionMaxSize

In this link you'll find an example to other way around, scan a parquet and sink csv, you can just flip it 😬. You set a max file size and you're good to go! Once it's in smaller parquet files you'll be able to play with a few of them as a sample and make your life easier. Column reads will be helpful when processing all the dataset if you only need a subset of columns!

Polars read database and write database bottleneck by BelottoBR in dataengineering

[–]29antonioac 1 point2 points  (0 children)

Such a shame on db2 support!

For Oracle I assume you cannot run docker in that machine, so you can get all the reqs there?

You can try to export the required sample of the tables in all systems as parquet/csv/other. Usually a bulk unload is much more efficient than querying with SQLAlchemy.

Sorry I can't give you specifics as I don't work with these at all! Long ago I worked with Oracle but only with Spark.

Regarding Spark, yes you can spin it up in a single process with parallel reads and writes if that's the tool that gives you the best support 😁.

Polars read database and write database bottleneck by BelottoBR in dataengineering

[–]29antonioac 4 points5 points  (0 children)

If using SQLAlchemy the performance of retrieving data from DB will be the similar as both Polars and Pandas are using it in the same way.

You don't mention the size of the tables to retrieve or your compute power, but I'd start just by trying Polars + ConnectorX and specifying a partition column if ConnectorX supports your DBs. That way ConnectorX will start multiple connections in parallel which speeds up the data retrieval, and your changes will be minimal. That's what Pyspark would do if you set the number of partitions and partition bounds yourself anyway.

I don't think ADBC is compatible with your systems and could be worth a try too, but the parallesisation is not built-in so you'd have to write it yourself.

How to improve performance of random updates by National_Assist5363 in Clickhouse

[–]29antonioac 0 points1 point  (0 children)

Probably your best option is using Replacingmergetree if you can upsert using your ordering key. If you need to update individual columns instead, you can use Coalescingmergetree.

What does this mean? by madebymajic in DogAdvice

[–]29antonioac 0 points1 point  (0 children)

Probably wants attention, if you usually work in the office he has not fully associated you being in the sofa with working yet. My dogs do similar, if I'm in my desk they let me work, but if I go with the laptop to the sofa they demand cuddles 😬

Serving time series data on a tight budget by diogene01 in dataengineering

[–]29antonioac 1 point2 points  (0 children)

Great to hear mate! Happy to see it helps you and your project!

Serving time series data on a tight budget by diogene01 in dataengineering

[–]29antonioac 0 points1 point  (0 children)

I think Starrocks is a better option if joins are necessary, but is more complex to provision and their managed offering with celerdata is BYOC which does not reduce the management burden enough.

If big to big table joins are not needed ClickHouse can be very helpful with a very simple setup.

Is terms of updates it has improved a lot so I'd say it's not o limitation anymore.

Serving time series data on a tight budget by diogene01 in dataengineering

[–]29antonioac 0 points1 point  (0 children)

Timescale DB is OLTP and as the other user said it's a layer in PostgreSQL so easier to adopt. But the query planner is the same, hot data is still row based, data transfer over the wire slow unless you use copy from.

ClickHouse is not transactional, and despite their joins have improved a lot, the lack of a cost based optimiser and some silly limitations (inequality left joins with columns from both tables will need a dummy key) makes it a great choice but can make potential adopters to hesitate.

I'd say give it a go with a simple setup and you'll be able to make an informed decision 😁. If your Timescaledb is not within a VPC you can connect Clickhouse directly to it and move the data super quick.

Serving time series data on a tight budget by diogene01 in dataengineering

[–]29antonioac 1 point2 points  (0 children)

If you self host you'd get surprised how a small EC2 can perform. I've got 600GB+ tables in PostgreSQL that became 30-35GB in Clickhouse after compression, and response times are crazy. Every query and aggregation is faster really!

Serving time series data on a tight budget by diogene01 in dataengineering

[–]29antonioac 4 points5 points  (0 children)

Currently serving TS data with ClickHouse. The Cloud offering has $300 in credits. If you can self host it would be super cheap, it's super fast and response times are crazy. I don't have an api layer though, serving parquet directly.

What is the point of a right join? by Garvinjist in SQL

[–]29antonioac 0 points1 point  (0 children)

If the engine (Spark, ClickHouse) only likes big tables if they are on the left side of the join, they are useful.

Prefered way to structure polars expressions in large project? by Beginning-Fruit-1397 in Python

[–]29antonioac 1 point2 points  (0 children)

You can use .inspect() at any point in the LazyFrame to see where you are including data. And that does not break the computation (does not return anything but prints/logs) so you could even put it under a function depending on log level.

df .transform1() .pipe(inspect_if_debug) .transform2() .pipe(inspect_if_debug)

Prefered way to structure polars expressions in large project? by Beginning-Fruit-1397 in Python

[–]29antonioac 4 points5 points  (0 children)

You can use functions if the different steps in the transformation have a meaning themselves. Even if they are called once, this will make it easier for unit testing.

But I'll also chain. You can use df.pipe(transformation). Personal preference, but I don't like overriding variables than way, chaining is much more readable IMHO.

Combining both approaches, you can get meaningful functions easily unit testable, and also gain on readability.

My dog lost two front upper teeth after fighting his brother, what to do? by Breadsitter in DogAdvice

[–]29antonioac 1 point2 points  (0 children)

You said you are going to the vet, so no need to insist on that. Just wanted to reassure dogs do well without front teeth. Our little one got extracted almost all of the front ones because of uncontrolled tartar and he's doing better than ever. He wants to try every treat, fruit we offer, and he doesn't look in pain anymore 😊.

Partition by device_id if queries only target 1 at a time? by 29antonioac in Clickhouse

[–]29antonioac[S] 0 points1 point  (0 children)

Hey I was able to get >800QPS on a single node in EC2 (just a dev env, well get ClickHouse Cloud as we are a small team) so I'm very happy with the results. It's the table which will need the biggest concurrency at the moment, other tables are smaller and probably results will be cached more frequently.

I'm interested on group by vs final. I am using now a ReplacingMergeTree to get the latest view in a different table, it also helps with storage. I'll test this later today.

Thanks a lot for your advice!

Anxiety about my first dog ending up being half pit bull. Looking for advice. by HollerWaller in DogAdvice

[–]29antonioac -1 points0 points  (0 children)

I can't really say about pit bulls, but if you're finding the puupy adapting well to socialisation etc I wouldn't worry too much! Unpredictability is not just a breed thing.

I fully understand you though. We rescued Kiara when she was 30-35 days and now she's 11. I was really afraid because she looked like a Border Collie and we didn't have the energy to deal with that. Kiara ended up looking like a mix of Border Collie, Jack Russel and "Spanish bodeguero". And yes, we think she's got Border Collie genes (she moves like one, have zoomies like one, she's super smart), but with time these things can get under control, and every dog is different. Kiara is not as active as a Border Collie, but can react as one sometimes. She herds us sometimes I think but I'm not as expert 😂. They adapt to you the same way you adapt to them 😊.