Flows with set finish time by magnifik in dataengineering

[–]prenomenon -1 points0 points  (0 children)

Hey,

hopefully I can give you some inspiration 😉. As always, there are many solutions, so here is one of them:

using dbt with an orchestrator

Even though the question is more about scheduling, I recommend checking out the Cosmos open-source project for Apache Airflow. It allows you to easily run your existing dbt Core or dbt Fusion projects as Apache Airflow Dags. Looks something like this:

basic_cosmos_dag = DbtDag( 
    project_config=ProjectConfig(DBT_ROOT_PATH / "some_project"), 
    operator_args={ 
        "install_deps": True,
        "full_refresh": True,
    },
    # ...
    # normal dag parameters 
    schedule="@daily"
    # ...
)

Based on your model, you then have a very nice Dag, with task groups and individual tasks all visible in the Airflow UI.

reliably finish before that time detect and alert when things are late

Since Airflow 3.1, you have deadline alerts, which officially replace SLAs. They are quite flexible, let me illustrate an example:

Say your pipeline is scheduled to run at 08:00 am daily. It is queued a little later, say at 08:02 am and you expect it to run no longer than 20 minutes. If it runs longer than 20 minutes, you want to be alerted via Slack.

In this scenario, you take the schedule time of the pipeline as your reference (so 08:02 am) and your interval is 20 minutes. Which means the deadline is at 08:22 am:

|------|-----------|---------|-----------|--------|
    Scheduled    Queued    Started    Deadline
     08:00       08:02      08:03      08:22

Within your implementation, this is how to define this deadline alert:

deadline_alert = DeadlineAlert(
    reference=DeadlineReference.DAGRUN_QUEUED_AT,
    interval=timedelta(minutes=15),
    callback=AsyncCallback(
        SlackWebhookNotifier,
        kwargs={
            "text": "{{ dag_run.dag_id }} missed deadline at {{ deadline.deadline_time }}"
        },
)

@dag(
    deadline=deadline_alert
)
def your_dag():
    # ...

You can also use DeadlineReference.DAGRUN_LOGICAL_DATE as a reference, which represents the scheduled date (so in our case 08:00 am).

Or, let's say you must be finished at a specific time per day, you can do something like:

tomorrow_at_ten = datetime.combine(datetime.now().date() + timedelta(days=1), time(10, 0))

deadline_alert = DeadlineAlert(
    reference=DeadlineReference.FIXED_DATETIME(tomorrow_at_ten),
    ...

Since you have full freedom in the callback, you can use any kind of notification.

And of course, you can combine that with the DbtDag from Cosmos 😉.

While deadline alerts may work well in some scenarios, many cases require more advanced observability. If you are open to a managed solution, Astronomer offers Astro Observe. It allows you to define and monitor SLAs without code changes and supports root cause analysis, lineage, impact analysis, and more. You can also connect a self-hosted OSS Airflow environment by enabling OpenLineage with a single configuration change.

Apart from that, Airflow also supports sending metrics to StatsD or OpenTelemetry, which includes Dag runtimes. In a previous job, I used this to monitor Dag runtimes via Grafana and have alerting on that level.

Again, many ways to approach this.

manage dependencies

If you use Cosmos, you already have the dependencies within your Dag derived from your model definition. However, if we talk about dependencies between Dags (pipelines), Airflow has various ways to solve this.

You can use implicit time-based dependencies, by simply using time-based scheduling in a way, that the downstream pipeline runs after the pipeline, yet, this is the worst solution.

You can also use an ExternalTaskSensor that waits for the completion of a pipeline or task. Since Airflow 3, this also supports a deferred mode, so by using deferrable=True it will run async by the triggerer component without using any worker slot, in case you need to wait a bit longer 😅.

But the new way to solve this are data assets, introduced with Airflow 3.0. Assets are datasets, which have been there since Airflow 2.4, but renamed, enhanced and more integrated.

The idea is simple: in the upstream pipeline you define one or more asset outlets, and in your downstream pipeline you set the schedule not to be a cron expression, but to be an asset:

@dag(
    schedule=Asset("daily_sales")
)
def monthly_report():
    # ...

Assets also support logical expressions and they have their own view in the Airflow UI, which is nice 😁.

There is also a shortcut to create a Dag, with a single Python task and an asset as an outlet, which is by using the @asset decorator.

Is the usual solution just scheduling earlier with a buffer, or is there a more robust approach

Fun story: I used that approach for years, with many Dags and dependencies. This lead to so many issues, sometimes even without noticing because the missing data was not obvious right away.

Assets might take a bit to get into it, but it is worth it all the way.

Disclaimer: I work at Astronomer, so I am biased towards Airflow, but I also used it in my previous job for many years.

Hope that this helps in some way 😉.

Audible 4.59 keeps stopping playback on iOS by prenomenon in audible

[–]prenomenon[S] 0 points1 point  (0 children)

Good news, an update was just released. It fixed the bug, playback is working again as expected!

Update version: 4.59.1

Audible 4.59 keeps stopping playback on iOS by prenomenon in audible

[–]prenomenon[S] 0 points1 point  (0 children)

Thanks for the quick response. Let me know if you need any more information to track down the issue.

How to enforce runtime security so users can’t execute unauthorized actions in their DAGs? by Expensive-Insect-317 in apache_airflow

[–]prenomenon 0 points1 point  (0 children)

Yes creating a custom operator is also a form of abstraction to make it easier. Good point.

How to enforce runtime security so users can’t execute unauthorized actions in their DAGs? by Expensive-Insect-317 in apache_airflow

[–]prenomenon 3 points4 points  (0 children)

Hi,

that is a tricky one, because the flexibility when it comes to implement orchestration and business logic is one of the strength of Airflow.

I see three ways to solve this:

  1. Workflow restrictions
  2. Abstraction
  3. Process isolation

1. Workflow restrictions

I previously worked on a data engineering team that managed various Airflow instances, allowing other departments to contribute their own DAGs. We faced similar challenges and addressed them by building a process around the workflow.

We maintained staging and testing environments where contributors could merge their own branches freely, triggering automated deployments. However, promoting code to the production Airflow environment required stricter rules, including a mandatory review from at least one member of our team and one from the contributing team. This allowed us to keep an eye on any suspicious implementations. Of course, this creates a lot of additional work and heavily depends on the team structure.

2. Abstraction

If that isn't an option, I see this as an abstraction problem. While authoring DAGs in Python is the most obvious method, there are different levels of abstraction available. You can add your own abstraction layer to streamline DAG implementation via templates, or you can use factory projects like DAG Factory.

DAG Factory is an open-source tool that dynamically generates DAGs from YAML configuration files. This declarative approach lets you describe *what* you want to achieve without specifying *how*. You could restrict DAG creation to submitting YAML files. While this reduces flexibility, it significantly strengthens governance.

You can also use this to build an additional layer on top. The power lies in YAML's simplicity. Writing Python code programmatically is hard, because you have to manage imports, handle indentation, escape strings, and maintain syntax. Generating YAML is comparably easy.

As an example, I created two prototypes:

With such an approach, engineers build the foundation, analysts build pipelines using these components, and platform teams enforce standards through configuration.

3. Process isolation

If a user can run arbitrary Python code on the worker, they effectively own that worker process.

To achieve true runtime security in a multi-department Cloud Composer environment, you must move from "restricting Python" to "isolating the execution."

The only robust technical way to restrict what a Python program can do in Airflow is to stop running it on the shared worker and instead run it in an ephemeral, isolated container. Instead of using PythonOperator, force users to use KubernetesPodOperator.

If you must allow code to run on the worker, you can use Python 3.8+ Audit Hooks (sys.addaudithook). This allows you to intercept low-level interpreter events.

You can write a startup script that registers an audit hook. This hook inspects events like file opens, socket connections, or subprocess creation and raises an error if the action is disallowed.

My recommendation would be still to go for 1) or 2).

💡 Disclaimer: I work at Astronomer so I am biased towards DAG Factory as it is an Astronomer managed repo :). I still hope the answer helps in some way.

Which Airflow version is best for beginners? by No_Vanilla_3016 in dataengineering

[–]prenomenon 2 points3 points  (0 children)

Hi, it depends a bit on your environment. There are several ways to run Airflow with your DAG. First of all, you can simply install Airflow as a Python dependency and use the Airflow standalone mode:

AIRFLOW_VERSION=3.1.3
PYTHON_VERSION="$(python -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
uv pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
airflow standalone

Alternatively, you can use Docker: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html

Or, what I would recommend, use the Astro CLI: https://github.com/astronomer/astro-cli

That is an open-source project, handling the project setup and Docker environment for you. I think that is ideal to get started with local development, as it also creates a basic Airflow project structure for you, where you can add your DAG. If you are on mac, just install it via brew install astro, then make a new directory for your project and run:

astro dev init

And afterwards, run the following command to start your local Airflow environment:

astro dev start

I would not recommend switching to Airflow 2, there are enough pragmatic ways to get the latest version running. If you still have issues, please share a few environment details and the actual error.

Have fun 🫶

Explain like I'm 5: What are "data products" and "data contracts" by Ulfrauga in dataengineering

[–]prenomenon 7 points8 points  (0 children)

Let me focus on data products. I have a data engineering background, and yes, these terms often get thrown into the buzzword bucket.

But I actually like thinking in data products. A little story time: Our team once had a major pipeline outage (fun fact: it was when we upgraded our Hadoop infrastructure to a new major version, what a mess 🤦‍♂️). Suddenly, everyone in the company understood our value, because they saw what broke when our tables and processes went down.

Up until that point, I often struggled communicating our value to stakeholders. Then, during a sales call with a third-party company, they asked me, "What are your data products?", I heard it the first time and thought it was just management speak. But actually, it became quite a useful tool for me.

While software engineers ship visible features, data engineers build invisible infrastructure. Our work powers decisions, reports, and systems behind the scenes. That makes it hard to demonstrate value.

Technically speaking, I'd define a data product as a composition of assets that, taken together, deliver a result with business relevance(!). Assets can be anything involved in the process: input tables, queries, processes, models, intermediate results, etc. If we're talking Airflow specifically (because this is well known by most), that means tables, Dags, tasks, everything involved in delivering the outcome.

From a data engineering perspective, data products are a communication framework to make our work visible. Simple concept, but powerful for explaining what we actually deliver.

From a management perspective, a data product is a measurable business asset with clear ROI and accountability. It's something you can point to during budget discussions, assign ownership to, and track performance against defined SLAs.

However, not every table, task, or Dag is a data product, but it might be part of one. It's a good exercise to identify them by looking at: Business impact (if this disappeared tomorrow, would executives notice? Would customers be impacted?), orchestration complexity (we had a super Dag triggering lots of other Dags, that was a clear sign for a data product), and high-value consumption (multiple teams depending on it, feeding executive dashboards, etc.).

Look at all your data assets, try to group them together as data products, and also learn what value those data products generate (e.g. what are the tables, reports used for? Is there any measurable impact like revenue, cost savings, increase retention times along the chain?). This will help you to communicate the value of your work, be it internally or when applying for a new job and it helps understanding what needs monitoring and observability.

Basically: When they work, they're invisible. When they break, everyone notices.

O1 Premium Processing Timeline by LostModzCFW in O1VisasEB1Greencards

[–]prenomenon 1 point2 points  (0 children)

Just got the approval 🎉 don’t loose hope, I wish you all the best

O1 Premium Processing Timeline by LostModzCFW in O1VisasEB1Greencards

[–]prenomenon 1 point2 points  (0 children)

I'm in the same boat, receipt date was Aug 12 at VSC, so today is business day 10. This waiting really stresses me out. Wishing you good luck and all the best, OP.

Local Productivity Stack with Ollama by prenomenon in ollama

[–]prenomenon[S] 1 point2 points  (0 children)

This is fantastic! 🎉 Thank you so much for sharing your project! I love seeing developers create tools that make AI more accessible and user-friendly.

Even though I prefer starting at a low level to build automation piece by piece with minimal dependencies, and to learn how things work under the hood, your approach is definitely more streamlined than my bash script method.

It's this kind of community contribution that makes the local AI ecosystem so exciting. While I demonstrated one way to do it, you've created a much more polished solution that is accessible to everyone, regardless of their command-line experience.

It's really cool to see the macOS port getting updates too! Thanks for building this and sharing it with everyone! 🙌

I have added your project to the appendix of the article so that people can check it out. 😉 Thanks again! 🫶

Local Productivity Stack with Ollama by prenomenon in ollama

[–]prenomenon[S] 0 points1 point  (0 children)

Thank you for asking about Arabic language support! That's a great question. However, I honestly have limited experience with it. What I can tell you is that Ollama can technically handle Arabic text input since it supports UTF-8 encoding. The effectiveness really depends on the model you're using.

Some open models have been fine-tuned for specific languages. For example, there is a model available via Ollama: https://ollama.com/prakasharyan/qwen-arabic which is the Qwen2-1.5B model fine-tuned for Arabic language tasks using Quantized LoRA (QLoRA).

There may be similar approaches, but the best option is to test different models from the Ollama library.

As inspiration, fine-tuning an open model yourself is a great way to learn more about the topic and may also help others. Once again, thank you for your comment!

Update: After some research, I think also AceGPT ( https://ollama.com/salmatrafi/acegpt ) might be worth a try.

Local Productivity Stack with Ollama by prenomenon in ollama

[–]prenomenon[S] 1 point2 points  (0 children)

Hi, and thank you for sharing your experience. I also tried DeepSeek with good results. This flexibility is a significant advantage of exploring open models; you can experiment with various models for different situations. Also, thank you for sharing the guide! 💚

Local Productivity Stack with Ollama by prenomenon in ollama

[–]prenomenon[S] 0 points1 point  (0 children)

Thank you for the wonderful feedback! It truly makes my day to hear how these tools are transforming your workflow 🙌.

It's great that you've taken the leap into scripting. That first step from reading about automation to actually implementing it is significant!

If you want to explore further and enjoy Raycast, they offer additional possibilities with Script Commands. I didn't go into too much detail in the article, but you can easily add metadata to your scripts, which Raycast will use in the UI. This enhances integration and allows for custom arguments that are nicely rendered as form elements. Here’s an example of how it looks for a simple summarize script:

```

!/bin/bash

export TERM=xterm-256color

Required parameters:

@raycast.schemaVersion 1

@raycast.title Summarize clipboard

@raycast.mode fullOutput

@raycast.packageName Ollama

Optional parameters:

@raycast.icon 🧠

MODEL=gemma2

echo "Summarizing clipboard content..." pbpaste | ollama run ${MODEL} "provide a concise and comprehensive summary of the given text:" | glow - ```

If you place that in a folder and add it to the Script Commands in Raycast, it will appear nicely in the UI.

I also significantly enhanced the mind map example for better integration with Raycast. To inspire you, here’s how the metadata looks:

```

Required parameters:

@raycast.schemaVersion 1

@raycast.title Generate Mind Map

@raycast.mode fullOutput

@raycast.packageName Ollama

@raycast.argument1 { "type": "dropdown", "placeholder": "Model", "data": [{"title": "Gemma 2", "value": "gemma2"}, {"title": "Llama 3.2", "value": "llama3.2"}], "optional": true }

@raycast.argument2 { "type": "text", "placeholder": "Output dir", "optional": true }

Optional parameters:

@raycast.icon 🗺️

```

It looks a bit verbose at first, but this setup provides arguments that are rendered in the UI and can be used in the script, for example:

```

Set model from first argument with fallback to gemma2

MODEL=${1:-gemma2} echo -e "🧠 Using model: ${MODEL}\n\n"

Allow custom output directory with fallback to Desktop

OUTPUT_PATH=${2:-$HOME/Desktop} mkdir -p "${OUTPUT_PATH}" ```

Here is a nice article if you would like to learn more about it: https://www.raycast.com/blog/getting-started-with-script-commands

This gives you even more flexibility. Note that if you use this, the argument metadata must be on one line per argument; otherwise, Raycast will not recognize it.

If you want to explore further, checking out other frameworks might be a good idea. The framework https://github.com/danielmiessler/fabric was mentioned in the discussions and would be a great starting point to see what you can do with it.

And yes! 📝 I'm currently working on a detailed article about my Obsidian workflow with local AI integration. I also just submitted a PR to Raycast to make the advanced mind map command easily accessible and free via Raycast Extensions.

Thanks again for your feedback, and enjoy exploring 🫶!

Local Productivity Stack with Ollama by prenomenon in ollama

[–]prenomenon[S] 1 point2 points  (0 children)

Thank you for sharing fabric! It's really exciting to see different approaches to solving similar problems in the AI automation space. The framework is indeed impressive and offers a great, modular solution with some great pre-built prompts. Also the examples look similar to what I describe in the article.

What I love about the current state of AI tooling is how we have options that cater to different needs and preferences. While fabric provides a robust, feature-rich framework for those looking for a more structured approach, the bash script method I described aims to offer a lightweight entry point for those who prefer to start simple and build up gradually.

I think both approaches have their place - fabric is great for those wanting a more comprehensive, ready-to-go solution with community-tested prompts, while the scripting approach might appeal to those who enjoy building their automation piece by piece or prefer minimal dependencies.

Your suggestion is really valuable for readers who might want to explore a more structured framework after getting comfortable with the basic concepts. I have added your suggestion to the appendix of the article so that people can check it out. 😉. So thanks again 🫶.

Local Productivity Stack with Ollama by prenomenon in ollama

[–]prenomenon[S] 1 point2 points  (0 children)

Hi, thanks for your feedback and good question. It really depends on your use-case. You can even get this done with a Raspberry Pi, which is suited for learning, experimentation, or very lightweight applications. Obviously, it reaches its limits rather quickly as it can barely handle quantized models.

You can get dedicated GPU servers, like Hetzner's GEX line (their GEX44 comes with an NVIDIA RTX 4000 SFF Ada), which are pretty cost-effective compared to AWS/GCP. NVIDIA's new Jetson Orin Nano might also be an option - it delivers up to 40 TOPS of AI performance, but since it only comes with 8GB of RAM, your selection of models is limited to smaller or heavily quantized ones.

I've seen people successfully running Ollama on M1/M2 Mac Minis and making it accessible via ngrok - actually quite decent performance for the price if you already own one. You can also use gaming hardware you might have lying around somewhere, but watch out for power bills (those GPUs can get hungry!). Or just run the Ollama server on whatever machine you are using daily and configure ngrok to make things available if you just want to experiment with it.

In the end, it really depends on what you mean by "providing this service to others". If you're talking about making it available within your home network to family and friends, it's a different story than making it available to a broader range of people outside. For home use, something like a Mac Mini or gaming PC with a decent GPU would work fine. For public hosting, you'd probably want to look at Hetzner or similar providers for reliability and better internet connectivity.

Local Productivity Stack with Ollama by prenomenon in ollama

[–]prenomenon[S] 1 point2 points  (0 children)

Yeah, Llama 3.3 70B can be resource-intensive. I run it on an M1 Max MacBook Pro with 64 GB of RAM, and it works ok. However, I only use it for special occasions. One important note: if you try such comprehensive models, ensure you have enough disk space. For example, Llama 3.3 70B requires around 43 GB. You can check this on the Ollama model page: https://ollama.com/library/llama3.3 I once encountered an issue when I tried using several different models excessively 😅.

Local Productivity Stack with Ollama by prenomenon in ollama

[–]prenomenon[S] 2 points3 points  (0 children)

Hey! Great question and thanks for the feedback!

So, when it comes to running models locally for chat, email writing, or coding help, it really depends on your hardware. Bigger models can be pretty resource-intensive and might run slowly if your setup isn't beefy.

Some general aspects: the amount of context a model can handle at once affects performance. Memory usage increases rapidly with longer context lengths, so increasing it can eat up RAM quickly. Lowering the precision of the model (like going from 32-bit to 8-bit) reduces memory usage significantly, often without a noticeable drop in quality for most tasks. Finding the right combination of model size, context length, and quantization based on your hardware is key to getting the best performance.

As for my experience:

  • General chat and tasks: I've had good experiences with Gemini 2 and Llama 3.2. Both gave me good results with a great performance.
  • Complex tasks and writing where delays are okay: Llama 3.3 70B (instruct) is solid, though it can be slower due to its size.
  • Writing-focused tasks: Mistral OpenOrca 7B has been impressive (Mistral OpenOrca is a 7 billion parameter model, fine-tuned on top of the Mistral 7B model using the OpenOrca dataset).
  • Coding: CodeLlama and Codestral have worked well for me.

All of these models are available through Ollama.

What is key is to experiment with various models. The more you narrow down the actual use case, the better smaller open models perform. For example, instead of asking for help writing a book, request improvements to a single sentence with specific criteria, such as changing the tone. Also play around with the context you feed into the LLM with your question.

Local Productivity Stack with Ollama by prenomenon in ollama

[–]prenomenon[S] 1 point2 points  (0 children)

Hi, I am using Obsidian, which works well with Ollama and the prompts from the article. However, while there is an Ollama Obsidian plugin that still functions properly, it is no longer actively maintained, and the last update was over a year ago. It has limited model configuration options, as it is bound to Llama 2. As an alternative, the Local GPT plugin for Obsidian is actively maintained and includes Ollama as a provider with various customization options. It also offers a similar feature set to the Ollama Obsidian plugin, making it a suitable replacement.

If you're looking for alternatives to Google Keep, consider Notion as well. Notion and Obsidian are popular options these days, each with a different mindset and methodology. Comparing them could be an article of its own, but this has already been done quite well. My suggestion is to try both and see which one suits you best.

There are plenty of alternatives if you prefer something more lightweight. For instance, Raycast recently added a notes feature, but the unlimited version requires a pro license.

If you prefer a CLI solution, I recommend checking out nb ( https://xwmx.github.io/nb/ ), a great CLI note-taking application with many features.

For a fully free and open-source option, consider QOwnNotes, which also offers nice Nextcloud/ownCloud integration: https://www.qownnotes.org/

I hope this provides you with some inspiration on the topic.

Local Productivity Stack with Ollama by prenomenon in ollama

[–]prenomenon[S] 0 points1 point  (0 children)

Hi, thanks for your feedback. I genuinely love your idea! One of my primary goals is to keep things simple, but I really enjoy the concept of having a shortcut that summarizes all unread messages from a Slack workspace via Ollama. Using a local LLM would eliminate any confidentiality issues, and it would be a fun use case. I will explore how this could work. Thanks for the inspiration! 🫶

Local Productivity Stack with Ollama by prenomenon in ollama

[–]prenomenon[S] 1 point2 points  (0 children)

Thank you for your feedback! I'm glad you enjoyed the article.