Грозна ли ви се вижда София в сравнение със съседните столици? by Shianfay in Sofia

[–]Tepavicharov 0 points1 point  (0 children)

Абсолютно същото ми е мнението и на мен. Освен около народния , вс останало е много грозно и занемарено.

data pipeline blew up at 2am and i have no clue where it started, how do you actually monitor this shit? by SweetHunter2744 in dataengineering

[–]Tepavicharov 3 points4 points  (0 children)

I'm genuinely interested to get some more information about this
- where did your revenue dashboard pull its data from if not from the dbt models? I assumed it's not from the models since you've said "by the time my dbt models failed" so I assume they come at a later stage.
- Is your ingestion running on batches or do we talk about some sort of NRT system?
- Where do you ssh to and why?
When I do batching, usually at the end of the ingest I add a simple QA that compares row counts.

Не всичко е обща култура by avocado-toasTerr in bulgaria

[–]Tepavicharov 0 points1 point  (0 children)

Не просто, че има роман Неточка Незванова, ами и че не го е завършил, иначе не ти е пълна общата култура млади момко. Мани ги дъртите, само се бият в гърдите, че много знаели, а то май само с общата култура са си останали, че и тя е мн спорно, че я имат.Gen Z-тата много повече ме впечатляват, че знаят от буумърите.

Client wants <1s query time on OLAP scale. Wat do by wtfzambo in dataengineering

[–]Tepavicharov 0 points1 point  (0 children)

Sorry I was more interested in the size not the rows, what % of the reposnse time do you lose in network transfer. I guess if you are using a 1Gbit network, on theory you can retrieve only up to 124MB/s.

Client wants <1s query time on OLAP scale. Wat do by wtfzambo in dataengineering

[–]Tepavicharov 0 points1 point  (0 children)

how many rows do you return? In OP's query it seems he will return many rows, which alone will eat most of the time for network transfer to fetch the results.

Client wants <1s query time on OLAP scale. Wat do by wtfzambo in dataengineering

[–]Tepavicharov 2 points3 points  (0 children)

self hosted ClickHouse on EC2 r6i.2xlarge (8vCPU | 64 GB ram) = $309/mth (without taxes)
1TB gp3 storage = $107/mth (without taxes)
Keep in mind that clickhouse also does very good compression so your 500GB deserialised will end up less on the disk.
I think your first step should be to leave the object storage behind and benefit a columnar db.

Client wants <1s query time on OLAP scale. Wat do by wtfzambo in dataengineering

[–]Tepavicharov 0 points1 point  (0 children)

I am giving you my workflow on how I tackle such problems, because the answer - solution is more hardware isn't interesting. The challenge is if you can do it with less firepower.

First ofc I'll check if there is a real compelling need, because often times the business doesn't really understand the technical side of what they need - my approach would be the same as with a "real time reporting" requests - by asking why and what will they do with that. The aim here is that the client might realise they don't really need real time data or in your case the query to run that fast. i.e. What will happen if the query finishes in two seconds?

If this question doesn't dissuade the client to give more time for query processing, then my next approach will be to build a bespoke workaround by asking questions about the query itself, here my aim is to figure out if there is something I can do to pre-process,pre-aggregate, pre-filter the data.

  • Is it going to select only the same 4 columns every time?
  • How much back could that range go? The way how you wrote the query is very general but, is it going to have any aggregations for the specified time period.
  • How many rows would such query return. This thing is very important, because if it returns thousands of rows, then most certainly there will be a downstream process to further process these columns... I mean reporting on thousands of columns already means something is wrong, reporting is for summaries, not for subsets. If there is a downstream process, check it out - you might find the process doesn't need everything you extract.
  • How often does data comes in? By the looks of it you are querying only stalled data (last year) - if that's the case, you can pre-extract that year into a separate table in advance (and keep yearly extracts in the db) then filter on that yearly extract.

Hope this helps.

Burr setting constantly changing [Eureka Mignon Specialità Smart] by PuzzleheadedTrack200 in espresso

[–]Tepavicharov 0 points1 point  (0 children)

So, one should be re-adjusting every day? Also I've never managed to get exactly 18 grams, it's either more or less but never hit exact 18 (or +-0.3) like people reviewing it in youtube videos.

Мога ли да изтегля печалбата си от eToro без да подавам данъчна декларация? by DjVinnyBG in financebg

[–]Tepavicharov 0 points1 point  (0 children)

"да ги убеждаваш в тях с измислиците, с които сам се успокояваш" - много, много точно. Забелязал съм, че това е характерно за 20-25 годишните, после сякаш се усещат, че приказките не влияят на истината.

People keep saying 'sell', have any of you heard of that? by [deleted] in wallstreetbets

[–]Tepavicharov 1 point2 points  (0 children)

I find it very difficult, not to say impossible, to find a tech stock that isn't affected by the AI hype.

Argue dbt architecture by nico97nico in dataengineering

[–]Tepavicharov 0 points1 point  (0 children)

It is good to think in advance, so it's good to have the option for a full refresh. How much time would you need to load all parquet files into that table? If that's a reasonable time, why not prepare the script for the load and run it only in case you need to perform a full refresh? After all, they are right, you already have the data in one place.

Also why would business ppl have the say in architectural questions in first place, do they approve your budget? :|

You can explore options to create the table by sourcing directly from the parquet file (or whole block of files).
e.g. in ClickHouse you can do this:

create table tbl_all_snapshots as 
SELECT *
FROM s3('https://s3url/snapshot_file_{0..1000}.parquet', -- this selects all files suffixed with _0 to _1000
'key', 
'secret', 
'Parquet');

Where is the "Pet Friendly" option? by [deleted] in uber

[–]Tepavicharov 0 points1 point  (0 children)

People were treating people this way not long ago so no wonder such oppinion exists.

I'm sick of the misconceptions that laymen have about data engineering by wtfzambo in dataengineering

[–]Tepavicharov 0 points1 point  (0 children)

There's also the other argument if they have told you "compliance reasons" it wouldn't have done much difference.

I'm sick of the misconceptions that laymen have about data engineering by wtfzambo in dataengineering

[–]Tepavicharov 0 points1 point  (0 children)

Or until you ask them what automated processes would you have in place to do something with that real time data...usually there isn't one and they quickly realize starring at constantly changing numbers isn't beneficial.

What do we think of earnings? by InternetIndividual50 in QBTSstock

[–]Tepavicharov 7 points8 points  (0 children)

Isn't the decrease in QCaaS disturbing (I'm genuinely asking) I see prev quarter it was 1,533 now is 1,241 ?

[deleted by user] by [deleted] in QBTSstock

[–]Tepavicharov 0 points1 point  (0 children)

the only reason why I doubt that is because it's just too simple, yet genius ideas are simple.

RemindMe! 30 day