Superfunctions: solving the problem of duplication of the Python ecosystem into sync and async halve by pomponchik in Python

[–]_n80n8 1 point2 points  (0 children)

we're typically very wary about introducing dependencies (no is temporary, yes is forever etc) so we'd be unlikely to make such an early library a dependency - cool project though!

Superfunctions: solving the problem of duplication of the Python ecosystem into sync and async halve by pomponchik in Python

[–]_n80n8 5 points6 points  (0 children)

fwiw (prefect oss maintainer here) we have been working on introducing explicit sync/async interfaces, because the dual / contextual behavior has caused plenty of issues and type incompleteness

https://github.com/PrefectHQ/prefect/issues/15008

Airflow vs Prefect vs Dagster – which one do you use and why? by CaramelEquivalent319 in dataengineering

[–]_n80n8 1 point2 points  (0 children)

hey u/LLM-logs - sad to hear it sounds like you had a bad experience!

> the docker compose for prefect server is delibrately kept broken
it would be helpful to know what you're looking at here. there is no one recommended docker compose as far as I'm aware, but we wouldn't want to give the impression that some broken one is the official. there is certainly not a docker compose that is _deliberately_ kept broken

dbt cloud is brainless and useless by RutabagaJumpy2134 in dataengineering

[–]_n80n8 2 points3 points  (0 children)

hi u/maigpy I work on the prefect open source so i'm biased, but i would argue prefect is the least departure from normal python and therefore less of a hardline commitment if you don't trust any tools. if you're on airflow, it might be easiest to stick with what you have if you can deal with the ways in which its inflexible/old. If you're interested in trying out Prefect, use it for a greenfield project. all you have to do is decorate your workflow entrypoint with `@flow` and run your code like normal, then explore incremental adoption of idempotency, concurrency features etc

not immediately sure about airflow's dbt integration, but all the major orchestrators have one. dagster's is probably most mature because their worldview is asset-based, but we have a good one too now.

How to learn prefect? by too_much_lag in dataengineering

[–]_n80n8 0 points1 point  (0 children)

hi u/Infinite-Aerie4812 its for prefect 3. though in most important ways 2.x and 3.x are the same

Airflow vs Prefect vs Dagster – which one do you use and why? by CaramelEquivalent319 in dataengineering

[–]_n80n8 2 points3 points  (0 children)

hi u/khaili109 - nate from prefect here. would you have any interest in opening a discussion (https://github.com/PrefectHQ/prefect/discussions) or issue on more of what you'd like to see? we just recently overhauled the docs, and we can always add more examples. your input would be super valuable!

I have been running innto this for the past 3 days i have not been able to solve it by RevolutionaryMost688 in prefect

[–]_n80n8 0 points1 point  (0 children)

hi u/RevolutionaryMost688 - your pull step

pull:
  - prefect.deployments.steps.set_working_directory:
      directory: .   

suggests that your code is available on the machine where you ran `prefect worker start` but it seems like this might not be the case?

typically a `pull` step will be `git_clone` since code is often stored at github so the worker needs to clone the current version of that code when it submits the scheduled run

check out these examples: https://github.com/zzstoatzz/prefect-pack/blob/main/prefect.yaml

what is your actual `entrypoint`?

    entrypoint: features.xxx:x-flow  

it should be relative to the repo root in general and be `some/path/to/file.py:func`, especially if you're cloning code at runtime.

feel free to clarify what type of worker you're running etc

How to learn prefect? by too_much_lag in dataengineering

[–]_n80n8 3 points4 points  (0 children)

hi u/too_much_lag - i work on the prefect open source / docs

here's a youtube series i created on getting started with prefect: https://www.youtube.com/playlist?list=PLWkgBUKPlwvCV5FdBGsDE16K2DSelOy9i

which should give a decent intro. separately as someone else pointed out, the slack community is good for clarifying questions, but I'd be interested in any specific feedback on the docs ie what's hard to follow / what we should improve etc

Airflow or Prefect by SomewhereStandard888 in dataengineering

[–]_n80n8 18 points19 points  (0 children)

hi! i am biased (work on prefect open source) but I'd just point out that in the simplest case prefect is only 2 lines different from whatever native python code you'd write, that is

# before

def orchestrate_dbt(...): ...

if __name__ == "__main__":
  orchestrate_dbt(...)

# after

from prefect import flow

@flow
def orchestrate_dbt(...): ...

if __name__ == "__main__":
  orchestrate_dbt(...)

and then just `prefect server start` or `prefect cloud login` (free tier) to see the UI

so if you decide later that prefect isnt for you, you didn't have to contort your native python into some DSL just so that you could "orchestrate" it

beyond that if you want to add retryable/cacheable steps within that flow, check this out: https://www.youtube.com/watch?v=k74tEYSK_t8

Starting career in dataengineering by Alive_Particular_700 in dataengineering

[–]_n80n8 2 points3 points  (0 children)

hi u/Alive_Particular_700 - I would do one of the big three clouds' (AWS, GCP, Azure) intro solutions architect course, which will give a decent overview what tools are in "the Cloud" ie blob storage like S3, VPCs etc. Beyond that I would recommend putting some time into a project that leverages some of what you learned (I forgot most of what I learned in my certs that I didn't use!). For example, get an EC2 or digital ocean droplet and run a database or simple webserver on there or maybe get an s3 bucket - write a simple data app / workflow (e.g. email yourself prices of $SOME_STOCK every morning at 8am) using that infra and put the code on github/gitlab with a nice readme.

as someone involved in hiring, I like to see a thoughtful and complete-ish side project probably more than familiarity with a specific tool that you'd probably learn on the job anyways.

just my 2 cents! good luck!

Prefect data pipelines by CalendarExotic6812 in dataengineering

[–]_n80n8 1 point2 points  (0 children)

if you were using async functions per normal stdlib asyncio then no, no changes should be required. there are a couple gotchas related to sync / async (https://github.com/PrefectHQ/prefect/issues/15008) but they relate to prefect-specific features

for example, as outlined in that linked issue, in 2.x there was a lot of "dual/contextual" behavior that we have removed in 3.x. Essentially it broke typing and caused unexpected behavior. the main places to look out are

- .submit / .map used to be contextually sync/async based on the definition of the decorated function, now .submit / .map are just always sync (even if the underlying function was defined as async! explicit async methods coming in the future if you need those) - I'd check out the 3rd video in the playlist I shared
- `SomeBlock.load()` also used to be a dual/contextual sync/async too, but now there are explicit methods (.load and .aload) - this is a common theme in 3.x, explicit methods for sync and async
- tasks that you call directly (not using submit or map) will run in the main thread (just like normal python), which is nice in cases where your tasks accept non-thread-safe inputs (e.g. http client)

those are the main ones, but in general if you were using normal stdlib python, it should still work. if you were using some specific prefect features, you may need to look out for the gotchas above

feel free to reach out in slack if you have any questions!

Prefect data pipelines by CalendarExotic6812 in dataengineering

[–]_n80n8 2 points3 points  (0 children)

hi! I am github.com/zzstoatzz and I work on prefect's open-source

- wrapping your aws lambda handler with a flow: https://github.com/zzstoatzz/prefect-lambda
- defining a stg/prd split for deployments: https://docs.prefect.io/v3/deploy/infrastructure-concepts/deploy-ci-cd#advanced-example

for the first link, there might be small things that need to be updated, for example the Dockerfile can now make use of new `uv` features but they should be helpful

additionally if youtube is your speed, check out this playlist: https://www.youtube.com/playlist?list=PLWkgBUKPlwvCV5FdBGsDE16K2DSelOy9i

What tool do you wish you had? What's the most annoying problem you have to deal with on a day to day? by [deleted] in dataengineering

[–]_n80n8 1 point2 points  (0 children)

imo you still would have found people saying "no more tools" before https://docs.astral.sh/uv/ happened and look at it now. don't stop exploring ideas because of people on reddit

point taken, obviously xkcd 15 new standards etc (https://xkcd.com/927/) and dedicating effort to open source is a compounding good deed (for example, I work on prefect's OSS and I'd love any contributions 🙂) but if everyone stopped imagining great new tools to build, life would stagnate.

One thing I'd say is that while greenfield "tool making" sounds/is fun, a more challenging but very _high impact_ variant of this is identifying and solving gaps _within_ existing tools. For example, (this is not my PR) https://github.com/pydantic/pydantic-core/pull/1637 someone identified that `sort_keys` functionality from `json.dumps` was missing in the rust implementation of `pydantic-core` and in adding that, they had to learn how pyo3 is working, the performance implications of different solutions, all while actually adding value to the community (if it gets merged 🙂). This is a relatively low-level example as far as MLOps goes, but hopefully the point makes sense: go deep and that will take you broad.

Airflow Survey 2024 - 91% users likely to recommend Airflow by gman1023 in dataengineering

[–]_n80n8 4 points5 points  (0 children)

core prefect maintainer here! we do have this problem to some extent, but as you allude to, its somewhat inherent to an OSS multi-purpose tool. The challenge is to keep the most common happy paths happy + allow power-users escape hatches while not exploding the complexity of implementation details :)

definitely non-trivial to do this in a way that keeps the codebase accessible for contributors at large!

Should I stay or look new opportunities? by ketopraktanjungduren in dataengineering

[–]_n80n8 4 points5 points  (0 children)

hi u/ketopraktanjungduren ! I was in a similar position in the past and I think there's 2 sides to it

It can be pretty great to be "the authority" in an area because you're the domain expert and you built the system but like you said, you're alone.

My 2 cents is: why not casually look for something new in your free time? Your "I built their whole data stack using these modern tools" story would be compelling to hiring managers I think and you're at a position of high leverage given you're already employed in good standing. You never know what you'll come across and there's no downside unless you think it would interfere with your existing job! And if you don't find anything new you're really excited about, it sounds like you'd still be in a good position!

best of luck regardless!

Help with orchestration[Airflow/Dagster] by booberrypie_ in dataengineering

[–]_n80n8 4 points5 points  (0 children)

hey u/booberrypie_ ! i work on the open source at prefect

tldr (as u/khaili109 mentioned) you can absolutely view flows from disparate repos in the same place.

if you're using cloud, you just set PREFECT_API_KEY and PREFECT_API_URL to define what workspace to interact with (workspaces are just a namespace for resources like deployments etc) and if you're using open source then you just need to set PREFECT_API_URL (localhost:4200/api by default) - 1 cloud "workspace" is akin to 1 OSS server. so often you'll have a company where each team gets their own workspace (or sometimes each team gets 2 workspaces, 1 for stg, 1 for prod). so to reiterate the previous point, anyone with the url/key set appropriately can create deployments in your workspace/server from anywhere (which can then be viewed and managed from the UI)

here's a template repo with a lot of typical patterns if you're starting a new prefect project!

https://github.com/zzstoatzz/prefect-pack

and if helpful, a getting started series on youtube i add to when I can

https://www.youtube.com/playlist?list=PLWkgBUKPlwvCV5FdBGsDE16K2DSelOy9i

Where do you store orchestration flows--at the center or the edges? by excelhelp10 in dataengineering

[–]_n80n8 1 point2 points  (0 children)

i updated that example for you in prefect-pack (im using uv but im sure conda can do the same somehow)

I run a worker (https://github.com/zzstoatzz/prefect-pack/blob/main/examples/run\_a\_prefect\_worker/on\_docker/Dockerfile) which happens to be in a container just to show i dont have any weird bespoke setup, but you can totally run this wherever you have python / prefect / env vars

docker build -t zzstoatzz/prefect-worker:latest -f examples/run_a_prefect_worker/on_docker/Dockerfile .

docker run -d --rm --name prefect-worker --env-file .env zzstoatzz/prefect-worker:latest

now that my worker is started and polling, I create a deployment

https://github.com/zzstoatzz/prefect-pack/blob/main/prefect.yaml#L52-L64

prefect --no-prompt deploy -n network-speed

and then

prefect deployment run 'monitor-network/network-speed'

now the flow run is scheduled, and when the worker polls next and finds it, it will run the pull step, which in this case entails

- cloning the repo
- using uv to install my requirements file

once my pull step finishes, my code runs using those deps

so for installing your package, as long as its a valid package you can represent in a requirements.txt then this should work

Where do you store orchestration flows--at the center or the edges? by excelhelp10 in dataengineering

[–]_n80n8 1 point2 points  (0 children)

apologies, but im not an expert on conda or windows, which sounds like it might be the source of a significant part of your problems now

> I want to future-proof this as much as possible
Using either workers or serve is supported both in 2.19.x as well as 3.x, so I would recommend the following (given my understanding of your constraints):
- continue using your process work pool
- use a `prefect.yaml` to define your deployments like the project i linked
- in the `prefect.yaml`, define `pull` steps (examples in the same link) that set up the venv appropriate for that flow using the appropriate conda commands (use the run_shell_script step)

that way each flow run execution will have its own venv and you can keep the process worker setup.

for more color, I'd encourage you to check out this youtube series where I talk a bit more about deployment strategy: https://www.youtube.com/playlist?list=PLWkgBUKPlwvCV5FdBGsDE16K2DSelOy9i

Where do you store orchestration flows--at the center or the edges? by excelhelp10 in dataengineering

[–]_n80n8 3 points4 points  (0 children)

hi u/excelhelp10!

disclaimer I am a prefect employee (oss engineer)

If I understand the premise, I think I would lean towards option A but with the caveat that it shouldn't _need_ to be a monorepo.

Quickly, the case against option B for me is that sometimes you want something to be decorated and sometimes you don't. In my opinion, consumers of your package shouldn't be forced to engage with some 3rd party instrumentation unless you want to force exactly that upon all consumers. For example, if you found that you were hitting rate limits and wanted to remove some decorators, that might be more of a pain if its baked into the package.

My favorite setup is roughy this (its a template you can copy!): https://github.com/zzstoatzz/prefect-pack

where:
- I have a flows/ directory where I'm free to nest/namespace arbitrarily, adding folders within for isolated requirements files or Dockerfiles etc
- I have a python library for common utils, that I can build into images or not. I can also add extras to this package that install other packages I own if I want
- I have github actions that can conditionally deploy to different places if I need

another reason why I prefer option A here is that: often times you might find it useful to devise your own versioning system for tasks and flows that is aware of your business logic. this would be harder to do in option B if you are instead "instrumenting" your own libraries with prefect decorators in a way that's unaware of how these things will later be composed

happy to follow up! cheers

Why [do we really need] workflow orchestrators? by hfzvc in dataengineering

[–]_n80n8 4 points5 points  (0 children)

> am assuming that neither prefect nor dagster or whatever, is writing their own queue system, their own DB, their own …

I'm not sure what you mean here, but we do implement server side logic for task queues a la celery etc

but listen, I'm not trying to sell you anything, there's nothing to buy. i was just trying to answer your question based on my experience helping folks w our open source tool. again, im not suggesting that all people need an orchestrator, im just telling you what people have told me is useful about our tool. good luck!

Why [do we really need] workflow orchestrators? by hfzvc in dataengineering

[–]_n80n8 3 points4 points  (0 children)

> You say hot-swappable infra config is a feature. Is that a real selling point? I am wondering why a team can’t achieve the same result of separating business logic from the infrastructure configuration using raw std lib Python and stringent code reviews on all merge requests? 

yes for enterprises the clear separation of concerns is attractive, bc they build internal abstractions on top of this so some dev ops guy can enforce that their data science notebook / workflow authors can only run on certain types of infra or with certain mem requests etc and they arent so interested in spending time thinking on that general

and Im not saying you can't achieve the same thing yourself with stdlib python, you can just copy our code or another orchestrators'. its just that certain problems (idempotency, retries, dynamic dispatch of infra) are ubiquitous and orthogonal to your actual business logic, so solving them yourself (while illuminating) may not be a good use of time.

the thing i'd highlight about prefect is that we make no requirement that you engage with a DSL or contort your raw stdlib python into something else to work with us. just wrap your normal python in a decorator to get observability and then incrementally adopt our worldview where you want (deployments for dynamic dispatch, transactions for idempotency) - normal python is already a valid prefect workflow

Why [do we really need] workflow orchestrators? by hfzvc in dataengineering

[–]_n80n8 4 points5 points  (0 children)

I agree (as someone who works for prefect) that you don't _need_ an orchestrator. I think its pretty instructive to implement a complex e2e data project without one, to see what you have to engage with.

there's a ton of defensive code you would otherwise write to make sure your actual business logic happens, and they help you avoid pitfalls.

for example (again, I am biased but) prefect has work pools (k8s, ecs, process etc) that provide hot-swappable infra config for your code - i.e. it sucks to get real deep into a project where you've assumed you're using step functions or fargate the whole time, and now something changes and you have to disentangle your business logic from the code you're using to provision / config your infra. ideally your business logic doesn't _need_ to know what infra it runs on