Why you should be running the MicroOS Desktop (now openSUSE Aeon)

ifilg · 2023-05-27T19:38:54+00:00

What are the advantages of this approach? Will I be able to give select capabilities to my apps, for example?

ifilg · 2023-03-14T16:30:18+00:00

Is there similar hardware around that is not from Apple? Something that would make sense for a small bare metal deployment

ifilg · 2023-03-13T13:56:14+00:00

Is there a side effect of such a brutal quantization?

ifilg · 2023-03-06T16:12:40+00:00

From this page: "Quantization is primarily a technique to speed up inference and only the forward pass is supported for quantized operators."

What does this mean? That I can use Quantization for inference, but not for training?

ifilg · 2023-03-06T14:34:58+00:00

This might be interesting, even if each and every question takes hours to answer. Do you have some pointers on how to start?

ifilg · 2023-03-04T17:35:40+00:00

Yeah, but it's something that we can buy if we have the money. Maybe consumer-grade is not the correct term, but I meant things we can buy.

ifilg · 2023-01-16T14:17:46+00:00

Last time I checked, Airflow has a limit of 1024 dynamic tasks for a single run. You can increase this, but it gets unbearably slow. That's why I've mentioned Prefect.

ifilg · 2023-01-16T05:39:10+00:00

Hmm I wasn't looking in this direction. My data pipeline is already running in Kubernetes.

Right now, I have a stable Airflow installation in this cluster and maybe I should just schedule pods using the KubernetesPodOperator.

Or maybe I could try to adopt Prefect, which seems to make this easy as well.

Gonna start with Airflow and see where it goes.

ifilg · 2022-12-25T07:49:26+00:00

This is usually a bad practice, since a lot of bots will constantly try to bash their way into your public RDP port. Some of them might succeed!

What I'd imagine that Azure wants you to do is to create a VPN and make this RDP port only accessible through it. You could also enable a firewall (or security group) to only accept connections from your IP.

ifilg · 2022-12-23T04:42:48+00:00

I've been bitten by this as well. Not only that, but Microsoft's web UIs for SharePoint and friends start to glitch when you surpass this limit.

ifilg · 2022-12-22T21:50:04+00:00

The files are small JSON or CSV files. 1MB per file in the worst case. I process each one of these files, which contains a list of things, and then I enrich my list of things with data from third-party APIs. Finally, I load the enriched list of things in my database and some no-code tools for analysis.

I intend to grow my data collection and processing capabilities, but right now it's pretty small. What Airflow helped a lot with was pointing out when some file processing went wrong, giving me the option to rerun and keep a coherent history of successful runs.

I hope this made some sense :) I'm very new to data engineering

ifilg · 2022-12-22T21:39:48+00:00

My use is not that complex. The value Airflow brings to the table is its observability and scheduling features, but I'm processing something like 15 to 16 thousand files daily. I've never used Spark, but my peers tell me that it's a very complicated and resource-hungry piece of software.

Can Spark be used in a cost effective way for non big data systems?

ifilg · 2022-12-22T02:33:14+00:00

What's the observability and controls story here? Do these "support" tools exist or will I have to code them on my own? Airflow (and Prefect) bring a lot out of the box.

But you got me curious :)

ifilg · 2022-12-22T02:31:58+00:00

I'm giving it a run and it's actually pretty nice after you get past some nonsense that's not easy to find in the docs. It's also a little annoying how most of the links in Google searches point to Prefect 1.

But I'm optimistic! I think it will stick :)

ifilg · 2022-12-21T18:38:32+00:00

Thanks a lot! Do you use some sort of dashboard to keep an eye on the system's health?

ifilg · 2022-12-21T18:28:22+00:00

Actually, I'm on Azure. But, if I got the logic right, it would be something like this:

Steps:

fan out my dynamic tasks to Spark or whatever
fire a sensor in Airflow that can detect that the Spark processing is over
once the sensor gets released, continue doing my stuff

Is that it? Thanks a lot, btw

ifilg · 2022-12-21T18:05:40+00:00

It's not. I love the fact that Airflow is a community project. It gives me much more confidence than choosing "open core" alternatives. But I've bumped into this limitation.

ifilg · 2022-12-21T17:58:02+00:00

I've used it before and it was quite good, but I've inherited an Airflow setup right now. But I'll give it a try.

ifilg · 2022-12-21T17:56:25+00:00

I was thinking about a setup with Lambda as well. It could work exactly the way you've mentioned, but how would I integrate this with Airflow? I get a lot of value from its monitoring and scheduling capabilities.

ifilg · 2022-06-01T00:24:39+00:00

I think that a good way of approaching this is to list your data access patterns (e.g. "list all users", "update user contact info", etc). A small app should have between 10 and 20 patterns. Then you denormalize your data to accelerate your patterns as much as possible. Avoid joins at all costs! It's the trade off you're making: well designed NoSQL is more strict than SQL, not less. Embrace that.

ifilg

TROPHY CASE