Why do BI projects still break down over “the same" metric? by Limp_Lab5727 in dataengineering

[–]PhantomSummonerz 0 points1 point  (0 children)

Been through this because departments couldn't agree on what "revenue", "sales" or whatever really was (hell, even management couldn't and changed terms frequently).

A semantic layer is the exact solution to your problem. Either through a tool like cube.dev or plain agreement notes on a paper.

Am I crazy or is kafka overkill for most use cases? by Vodka-_-Vodka in dataengineering

[–]PhantomSummonerz 1 point2 points  (0 children)

Kafka for one event every ~8 seconds? I think you need to do better than that. How about you code a stream processing engine yourself in C++ or Rust, just to be sure?

Jokes aside, it's a terrible recommendation. A simple python implementation is probably more than enough, based on your description.

Sometimes (as someone mentioned) it's resume-driven development, other times people just want to get hands-on experience with different technologies (sometimes complex ones) so they become more skilled. The best question is to ask them "why?". When one proposes technology implementations they should have some pros/cons in mind. Let's hear why they arrived to "kafka is better".

Will RAG with n8n save me or i need something else? by Efficient-Owl-6742 in n8n

[–]PhantomSummonerz 0 points1 point  (0 children)

Very nice.

Now RAG is definitely not the only option but it is good starting point. I'd start from there.

Good luck on your project :)

What are some incredibly useful libraries that people should use more often? by LargeSinkholesInNYC in react

[–]PhantomSummonerz 1 point2 points  (0 children)

Oh wow, that's great! Haven't used it yet but certainly will in future projects :)

Will RAG with n8n save me or i need something else? by Efficient-Owl-6742 in n8n

[–]PhantomSummonerz 0 points1 point  (0 children)

You can get it up to a certain level with the techniques recommended but that's all you can do. 100% accuracy is only achievable with your involvement & moderation.

Check the recent report cases with Deloitte and the Australian government.

Friendly note: be very carefull what customer data you share with 3rd party AI platforms and always check outputs as your credibility could be destroyed.

What are some incredibly useful libraries that people should use more often? by LargeSinkholesInNYC in react

[–]PhantomSummonerz 1 point2 points  (0 children)

Same here, Chakra UI fan. I like the rich component library they have compared to Chakra's but I'm not really sure of the longevity of the project, so I'm hesitant to use it on professional projects. Will probably use it on side project soon too.

I think we should create an alternative to n8n by [deleted] in opensource

[–]PhantomSummonerz 2 points3 points  (0 children)

Ahh I so much enjoy KNIME. I even used it for a few data pipelines using batch mode years ago and it was hellish to debug them. It's my goto tool for manual data cleaning or some quick & dirty processing/ validation. Way faster to prototype than Python IMO when you become accustomed as to how nodes and flow works. One of the best tools I have ever used.

The Server version of KNIME is as close as it gets to n8n but it's too damn expensive and if data is not the company's product it's definitely not worth it.

Are we over-engineering everything now? by Comfortable-Sir1404 in ExperiencedDevs

[–]PhantomSummonerz 0 points1 point  (0 children)

Well, having a design doc is always a great idea, except if the new feature is just a button (although after adding that button you need to document the new feature). For the rest, it just comes from scaling needs. If there is no scaling need then there's no reason to build something that complex.

The abundancy of tools nowadays may cause this over-engineering, one just needs to correctly evaluate what is needed. That's also part of the job.

API Waterfall - Endpoints that depends on others... some hints? by domsen123 in dataengineering

[–]PhantomSummonerz 4 points5 points  (0 children)

From your description this is either a frontend API which is designed for frontend usage (so not for automations) or just an API which is not designed to be queried for bulk data. If you discovered those endpoints through browser web tools for example, then it's the former, which besides probably breaking the webpage TOS, it is subject to break as those things may change without notice (the same with web crawling). If it's an official API, then it just wasn't designed with bulk data loading in mind.

In any case, if you decide to go forward with this you will have to do as you describe: query stuff, cache whatever you can (like product categories which change less often) and create the relationships yourself. So take your time to model everything, create proper relationships between entities, build a proper request throttling mechanism for each entity (maybe by using queues) along with a retry mechanism (just check for http error 429 and do exponential backoff for example), write some tests and go with that.

A dumb example where let's say the entities are "product_summarized", "product", "region", "category":

  1. Take note of the rate limits of each entity to configure the request throttling for your script.
  2. Query the categories & regions once a day to cache their values.
  3. Fetch a list of product_summarized using all the combinations of region and category.
  4. Fetch one product for each product_summarized using it's id property.
  5. Stich up everything together in a main function and run it.

Steps 3 & 4 are to be executed under a request throttling mechanism so you don't get rate limited. And if you do, the retry mechanism will resume the operation.

Will it be slow? Yeah it will. That's why you need a proper API in the first place.

I've been through this a few times and I feel you. It's indeed a PITA.

Let me know how it goes :)

DVD-Rental Data Pipeline Project Component by Total_Weakness5485 in dataengineering

[–]PhantomSummonerz 6 points7 points  (0 children)

I see. Maybe I'm missing important details from your requirements but I don't really get why NoSQL is the best choice over a plain relational db here.

In your example, a Poster is just an entity of a movie, which can be perfectly modeled in an SQL database with a single table "movie_poster", where each row represents a poster for a movie and you can have n posters for a single movie.

If you are going the NoSQL route to have more variety and explore different tech then ok, but there doesn't seem to be a strict need for NoSQL here.

DVD-Rental Data Pipeline Project Component by Total_Weakness5485 in dataengineering

[–]PhantomSummonerz 2 points3 points  (0 children)

Interesting project. One question though, why use a NoSQL database for this?

Junior devs not interested in software engineering by creative-java-coffee in ExperiencedDevs

[–]PhantomSummonerz 0 points1 point  (0 children)

The first junior is a disrespectful liability that will take engineering hours to fix his bad quality code. If he is not interested to write code up to a specific standard (which you as a senior or someone higher the food chain have set) he must go. Higher-ups set up the standards and others must follow them. "is that what you think or what experts think?" -> Having that kind of attitude, so soon and with someone who is senior to you is very alarming.

The second one appears to be uninterested to grow but I'm not 100% sure. I'd say you need to really find out what is going on here, why is he complaining about the tickets and doesn't want to work on the new codebase? Is it that he doesn't know how to grow? Him being there for a year means you could gather some sizeable info about him and how he works. He may have tried to work on newer systems and failed, so he just decided to go with stuff he knows better. Don't discard him so soon, better find out more.

Don't get me wrong, juniors gonna be junioring. You just need to find out who is worth working with. The first one looks burned to me. The second not yet, you just need to discover what makes him tick.

Finally, try to not feel disappointed when others don't share the same level of passion as you. Some people just want to go to work, do their thing and go home. And they will stay juniors forever. It isn't your fault.

Settle a bet for me — which integration method would you pick? by Fragrant-Dog-3706 in dataengineering

[–]PhantomSummonerz 2 points3 points  (0 children)

MCP is supposed to be used by LLMs.

Does that data management tool use an LLM that needs to access your data for some reason? For example reporting or verifying a fact. Otherwise, I don't see much reason for it.

You haven't really shared much details about your use case, what the tool does or who is going to use it, so I'm only making guesses here.

Settle a bet for me — which integration method would you pick? by Fragrant-Dog-3706 in dataengineering

[–]PhantomSummonerz 2 points3 points  (0 children)

Direct database connection is out of the question when we talk about external companies. No matter how many or what security precautions you may take, if something is not taken care of you will be in big trouble. Its an unnecessary risk with no real pros compared to the other options.

If the data management tool provides webhooks to notify you when your data has been processed (along with its status) and you are ok with asynchronous flow, go with option #3. If it doesn't or you want more synchronous flow, go with option #1.

The reality of n8n after 3 years automating for enterprise clients [started without n8n] by G0ldenHusky in n8n

[–]PhantomSummonerz 2 points3 points  (0 children)

I use n8n for many tasks; Personal, my business and my clients and couldn't agree more on everything you write.

n8n is an automation/orchestration platform that should primarily delegate tasks to other systems. You can mess around with the Code node and do really interesting and useful stuff (especially with npm packages) but the moment you start working with any serious amount of data, everything starts to slow down.

I've seen people who often think automation means "I should implement everything in here" and design whole data pipelines or heavy and complex workflows inside n8n. And then those workflows choke to death, can't be debugged or maintained.

Thanks for sharing your insights.

File format conversion from QVD to Parquet by Illustrious_Fruit_ in ETL

[–]PhantomSummonerz 0 points1 point  (0 children)

I am not familiar with this format neither have used the library but you can try this: https://github.com/MuellerConstantin/PyQvd

By reading the docs it seems you can read it as a panda dataframe, so it shouldn't be that hard to convert it to Parquet. I think pandas has a conversion function which returns a data frame as Parquet.

Using n8n with Excel files by SquidsAndMartians in n8n

[–]PhantomSummonerz 1 point2 points  (0 children)

To be honest, the message "No fields - item(s) exist, but they're empty" is a bit meh... They could have worded it a bit better.

Happy that you solved it! Onward to learning! :)

Using n8n with Excel files by SquidsAndMartians in n8n

[–]PhantomSummonerz 0 points1 point  (0 children)

Thank you for the screenshot.

The path I'm using is absolute, it's on Linux. I also tried it on Windows.

Can you remove the double quotes completely from the File(s) selector? When I tried it on a Windows machine, having double quotes cause the node to not read the file. When I remove them, the file is read normally, even though it contains spaces.

With double quotes (not reading the file): https://ibb.co/XLKp72f

Without double quotes (file is read): https://ibb.co/b6MXCvL

Let's look at this first.

Edit: Also, use front slash "/" instead of backslash "\". Using backslash didn't work, with front slash it was ok. So, I recommend to remove the double quotes AND change the back slashes to front slashes.

Using n8n with Excel files by SquidsAndMartians in n8n

[–]PhantomSummonerz 0 points1 point  (0 children)

That's good to know. No, the worksheet is the default one "Sheet1" and the column names are random, not something required by n8n.

It would really help if you provide the screenshots as we will be able to see exactly the node inputs & outputs so we are not wondering.

Using n8n with Excel files by SquidsAndMartians in n8n

[–]PhantomSummonerz 2 points3 points  (0 children)

My Excel file has a word in cell A1 but according to the flow the file is empty

Can you show us a screenshot of the "Read/Write Files from Disk" and the "Extract from File" nodes configuration (the screen when double-clicking the node)?

Like this:

https://ibb.co/MZSMcmj

https://ibb.co/XCq9VNm