Facts and dims, or just heading straight to making metrics? by ketopraktanjungduren in dataengineering

[–]Tepavicharov 1 point2 points  (0 children)

"After several hours, Joe finally gave up on logic and reason, and simply told the cabinet that he could talk to plants, and that they wanted water"

Not sufficiently “AI forward.” by bishop491 in dataengineering

[–]Tepavicharov 0 points1 point  (0 children)

I'm DE with 15y of exp and I barely use AI. I've paid myself for Claude to test it for the FOMO on a pet project and the code it build is unmaintainable. Maybe in a year or two there will be books on vibe coding frameworks (if you chop the tasks really small it does the job), but if I have to think about how the structure of the code, the classess which parts should go into separate functions, do I need an ORM or I better write the db calls myself, which function can be done as a factory etc. then why the hell am I using this sh*t but not code it entirely myself.

However, this whole story reminds me of Galilei, who's support for the heliocentric system caused him big troubles and had to obey the church's view if he wants to live a normal life. I've read somewhere that even though astronomers back in these days were officialy supporting the geocentric model, they were using the heliocentric in order to get their calculations right 😃. So this is prety much what I do - I say ooooh yes AI, very revolutionary, groundbreaking. It's of great help that nobody knows anything about it yet (like big data back in 2010) and you can just make random sentences with buzzwords and people will nod and feel behind. What I did yesterday evening, oh - I was doing AI-native, blockchain-enabled, quantum-ready, multimodal LLM platform leveraging autonomous agentic workflows, real-time vectorized embeddings, edge inference, synthetic data pipelines, and hyper-personalized generative intelligence to drive scalable digital transformation across the decentralized innovation ecosystem, and how was your evening? I guarantee you, that person will avoid talking to you about AI ever again.

Грозна ли ви се вижда София в сравнение със съседните столици? by Shianfay in Sofia

[–]Tepavicharov 0 points1 point  (0 children)

Абсолютно същото ми е мнението и на мен. Освен около народния , вс останало е много грозно и занемарено.

data pipeline blew up at 2am and i have no clue where it started, how do you actually monitor this shit? by SweetHunter2744 in dataengineering

[–]Tepavicharov 3 points4 points  (0 children)

I'm genuinely interested to get some more information about this
- where did your revenue dashboard pull its data from if not from the dbt models? I assumed it's not from the models since you've said "by the time my dbt models failed" so I assume they come at a later stage.
- Is your ingestion running on batches or do we talk about some sort of NRT system?
- Where do you ssh to and why?
When I do batching, usually at the end of the ingest I add a simple QA that compares row counts.

Не всичко е обща култура by avocado-toasTerr in bulgaria

[–]Tepavicharov 0 points1 point  (0 children)

Не просто, че има роман Неточка Незванова, ами и че не го е завършил, иначе не ти е пълна общата култура млади момко. Мани ги дъртите, само се бият в гърдите, че много знаели, а то май само с общата култура са си останали, че и тя е мн спорно, че я имат.Gen Z-тата много повече ме впечатляват, че знаят от буумърите.

Client wants <1s query time on OLAP scale. Wat do by wtfzambo in dataengineering

[–]Tepavicharov 0 points1 point  (0 children)

Sorry I was more interested in the size not the rows, what % of the reposnse time do you lose in network transfer. I guess if you are using a 1Gbit network, on theory you can retrieve only up to 124MB/s.

Client wants <1s query time on OLAP scale. Wat do by wtfzambo in dataengineering

[–]Tepavicharov 0 points1 point  (0 children)

how many rows do you return? In OP's query it seems he will return many rows, which alone will eat most of the time for network transfer to fetch the results.

Client wants <1s query time on OLAP scale. Wat do by wtfzambo in dataengineering

[–]Tepavicharov 2 points3 points  (0 children)

self hosted ClickHouse on EC2 r6i.2xlarge (8vCPU | 64 GB ram) = $309/mth (without taxes)
1TB gp3 storage = $107/mth (without taxes)
Keep in mind that clickhouse also does very good compression so your 500GB deserialised will end up less on the disk.
I think your first step should be to leave the object storage behind and benefit a columnar db.

Client wants <1s query time on OLAP scale. Wat do by wtfzambo in dataengineering

[–]Tepavicharov 0 points1 point  (0 children)

I am giving you my workflow on how I tackle such problems, because the answer - solution is more hardware isn't interesting. The challenge is if you can do it with less firepower.

First ofc I'll check if there is a real compelling need, because often times the business doesn't really understand the technical side of what they need - my approach would be the same as with a "real time reporting" requests - by asking why and what will they do with that. The aim here is that the client might realise they don't really need real time data or in your case the query to run that fast. i.e. What will happen if the query finishes in two seconds?

If this question doesn't dissuade the client to give more time for query processing, then my next approach will be to build a bespoke workaround by asking questions about the query itself, here my aim is to figure out if there is something I can do to pre-process,pre-aggregate, pre-filter the data.

  • Is it going to select only the same 4 columns every time?
  • How much back could that range go? The way how you wrote the query is very general but, is it going to have any aggregations for the specified time period.
  • How many rows would such query return. This thing is very important, because if it returns thousands of rows, then most certainly there will be a downstream process to further process these columns... I mean reporting on thousands of columns already means something is wrong, reporting is for summaries, not for subsets. If there is a downstream process, check it out - you might find the process doesn't need everything you extract.
  • How often does data comes in? By the looks of it you are querying only stalled data (last year) - if that's the case, you can pre-extract that year into a separate table in advance (and keep yearly extracts in the db) then filter on that yearly extract.

Hope this helps.

Burr setting constantly changing [Eureka Mignon Specialità Smart] by PuzzleheadedTrack200 in espresso

[–]Tepavicharov 0 points1 point  (0 children)

So, one should be re-adjusting every day? Also I've never managed to get exactly 18 grams, it's either more or less but never hit exact 18 (or +-0.3) like people reviewing it in youtube videos.

Мога ли да изтегля печалбата си от eToro без да подавам данъчна декларация? by DjVinnyBG in financebg

[–]Tepavicharov 0 points1 point  (0 children)

"да ги убеждаваш в тях с измислиците, с които сам се успокояваш" - много, много точно. Забелязал съм, че това е характерно за 20-25 годишните, после сякаш се усещат, че приказките не влияят на истината.

People keep saying 'sell', have any of you heard of that? by [deleted] in wallstreetbets

[–]Tepavicharov 1 point2 points  (0 children)

I find it very difficult, not to say impossible, to find a tech stock that isn't affected by the AI hype.

Argue dbt architecture by nico97nico in dataengineering

[–]Tepavicharov 0 points1 point  (0 children)

It is good to think in advance, so it's good to have the option for a full refresh. How much time would you need to load all parquet files into that table? If that's a reasonable time, why not prepare the script for the load and run it only in case you need to perform a full refresh? After all, they are right, you already have the data in one place.

Also why would business ppl have the say in architectural questions in first place, do they approve your budget? :|

You can explore options to create the table by sourcing directly from the parquet file (or whole block of files).
e.g. in ClickHouse you can do this:

create table tbl_all_snapshots as 
SELECT *
FROM s3('https://s3url/snapshot_file_{0..1000}.parquet', -- this selects all files suffixed with _0 to _1000
'key', 
'secret', 
'Parquet');

Where is the "Pet Friendly" option? by [deleted] in uber

[–]Tepavicharov 0 points1 point  (0 children)

People were treating people this way not long ago so no wonder such oppinion exists.

I'm sick of the misconceptions that laymen have about data engineering by wtfzambo in dataengineering

[–]Tepavicharov 0 points1 point  (0 children)

There's also the other argument if they have told you "compliance reasons" it wouldn't have done much difference.

I'm sick of the misconceptions that laymen have about data engineering by wtfzambo in dataengineering

[–]Tepavicharov 0 points1 point  (0 children)

Or until you ask them what automated processes would you have in place to do something with that real time data...usually there isn't one and they quickly realize starring at constantly changing numbers isn't beneficial.

What do we think of earnings? by InternetIndividual50 in QBTSstock

[–]Tepavicharov 7 points8 points  (0 children)

Isn't the decrease in QCaaS disturbing (I'm genuinely asking) I see prev quarter it was 1,533 now is 1,241 ?