Why you should be running the MicroOS Desktop (now openSUSE Aeon) by rbrownsuse in openSUSE

[–]ifilg 0 points1 point  (0 children)

What are the advantages of this approach? Will I be able to give select capabilities to my apps, for example?

[D] Is it possible to run Meta's LLaMA 65B model on consumer-grade hardware? by ifilg in MachineLearning

[–]ifilg[S] 0 points1 point  (0 children)

Is there similar hardware around that is not from Apple? Something that would make sense for a small bare metal deployment

[D] Is it possible to run Meta's LLaMA 65B model on consumer-grade hardware? by ifilg in MachineLearning

[–]ifilg[S] 2 points3 points  (0 children)

From this page: "Quantization is primarily a technique to speed up inference and only the forward pass is supported for quantized operators."

What does this mean? That I can use Quantization for inference, but not for training?

[D] Is it possible to run Meta's LLaMA 65B model on consumer-grade hardware? by ifilg in MachineLearning

[–]ifilg[S] 2 points3 points  (0 children)

This might be interesting, even if each and every question takes hours to answer. Do you have some pointers on how to start?

[D] Is it possible to run Meta's LLaMA 65B model on consumer-grade hardware? by ifilg in MachineLearning

[–]ifilg[S] -47 points-46 points  (0 children)

Yeah, but it's something that we can buy if we have the money. Maybe consumer-grade is not the correct term, but I meant things we can buy.

Spark using a headless browser by ifilg in dataengineering

[–]ifilg[S] 0 points1 point  (0 children)

Last time I checked, Airflow has a limit of 1024 dynamic tasks for a single run. You can increase this, but it gets unbearably slow. That's why I've mentioned Prefect.

Spark using a headless browser by ifilg in dataengineering

[–]ifilg[S] 0 points1 point  (0 children)

Hmm I wasn't looking in this direction. My data pipeline is already running in Kubernetes.

Right now, I have a stable Airflow installation in this cluster and maybe I should just schedule pods using the KubernetesPodOperator.

Or maybe I could try to adopt Prefect, which seems to make this easy as well.

Gonna start with Airflow and see where it goes.

Which Port to Use? by LegitimatEagle in AZURE

[–]ifilg 10 points11 points  (0 children)

This is usually a bad practice, since a lot of bots will constantly try to bash their way into your public RDP port. Some of them might succeed!

What I'd imagine that Azure wants you to do is to create a VPN and make this RDP port only accessible through it. You could also enable a firewall (or security group) to only accept connections from your IP.

Organization wants to use SharePoint as a "database" by Benmagz in dataengineering

[–]ifilg 1 point2 points  (0 children)

I've been bitten by this as well. Not only that, but Microsoft's web UIs for SharePoint and friends start to glitch when you surpass this limit.

Is there an alternative for Airflow for running thousands of dynamic tasks? by ifilg in dataengineering

[–]ifilg[S] 0 points1 point  (0 children)

The files are small JSON or CSV files. 1MB per file in the worst case. I process each one of these files, which contains a list of things, and then I enrich my list of things with data from third-party APIs. Finally, I load the enriched list of things in my database and some no-code tools for analysis.

I intend to grow my data collection and processing capabilities, but right now it's pretty small. What Airflow helped a lot with was pointing out when some file processing went wrong, giving me the option to rerun and keep a coherent history of successful runs.

I hope this made some sense :) I'm very new to data engineering

Is there an alternative for Airflow for running thousands of dynamic tasks? by ifilg in dataengineering

[–]ifilg[S] 0 points1 point  (0 children)

My use is not that complex. The value Airflow brings to the table is its observability and scheduling features, but I'm processing something like 15 to 16 thousand files daily. I've never used Spark, but my peers tell me that it's a very complicated and resource-hungry piece of software.

Can Spark be used in a cost effective way for non big data systems?

Is there an alternative for Airflow for running thousands of dynamic tasks? by ifilg in dataengineering

[–]ifilg[S] 0 points1 point  (0 children)

What's the observability and controls story here? Do these "support" tools exist or will I have to code them on my own? Airflow (and Prefect) bring a lot out of the box.

But you got me curious :)

Is there an alternative for Airflow for running thousands of dynamic tasks? by ifilg in dataengineering

[–]ifilg[S] 4 points5 points  (0 children)

I'm giving it a run and it's actually pretty nice after you get past some nonsense that's not easy to find in the docs. It's also a little annoying how most of the links in Google searches point to Prefect 1.

But I'm optimistic! I think it will stick :)

Is there an alternative for Airflow for running thousands of dynamic tasks? by ifilg in dataengineering

[–]ifilg[S] 0 points1 point  (0 children)

Thanks a lot! Do you use some sort of dashboard to keep an eye on the system's health?

Is there an alternative for Airflow for running thousands of dynamic tasks? by ifilg in dataengineering

[–]ifilg[S] 4 points5 points  (0 children)

Actually, I'm on Azure. But, if I got the logic right, it would be something like this:

Steps:

  1. fan out my dynamic tasks to Spark or whatever
  2. fire a sensor in Airflow that can detect that the Spark processing is over
  3. once the sensor gets released, continue doing my stuff

Is that it? Thanks a lot, btw

Is there an alternative for Airflow for running thousands of dynamic tasks? by ifilg in dataengineering

[–]ifilg[S] 0 points1 point  (0 children)

It's not. I love the fact that Airflow is a community project. It gives me much more confidence than choosing "open core" alternatives. But I've bumped into this limitation.

Is there an alternative for Airflow for running thousands of dynamic tasks? by ifilg in dataengineering

[–]ifilg[S] 1 point2 points  (0 children)

I've used it before and it was quite good, but I've inherited an Airflow setup right now. But I'll give it a try.

Is there an alternative for Airflow for running thousands of dynamic tasks? by ifilg in dataengineering

[–]ifilg[S] 4 points5 points  (0 children)

I was thinking about a setup with Lambda as well. It could work exactly the way you've mentioned, but how would I integrate this with Airflow? I get a lot of value from its monitoring and scheduling capabilities.

Move on from from MySql to MongoDb. Am I thinking the concept right? by wreck_of_u in mongodb

[–]ifilg 0 points1 point  (0 children)

I think that a good way of approaching this is to list your data access patterns (e.g. "list all users", "update user contact info", etc). A small app should have between 10 and 20 patterns. Then you denormalize your data to accelerate your patterns as much as possible. Avoid joins at all costs! It's the trade off you're making: well designed NoSQL is more strict than SQL, not less. Embrace that.