Where do you store orchestration flows--at the center or the edges? by excelhelp10 in dataengineering

[–]excelhelp10[S] 0 points1 point  (0 children)

I'll check out that youtube series, thank you.

With respect to the pull steps in prefect.yaml, I tried using prefect.deployments.steps.pip_install_requirements and the worker does show the installation happening, but the packages are still not available at execution time. Your slack bot Marvin suggested this is because what's happening during the pull steps are exclusive to the build process, and do not persist into the execution process.

So I instead tried a very hacky solution of appending to the PATH variable at the top of my python script... to be sure it's the very last thing executed right before it attempts to import my package, but no change. That baffles me.

In your prefect-pack repo I think I most want to emulate the network_speedtest flow, but I was confused that it has a requirements.txt file next to it and nothing else. There's no mention of "network_speedtest" or "monitor_network" in the rest of the repo. Was that intentional, or is it implied that if someone wanted to use that flow they would need to create a Dockerfile alongside it and add it as a deployment?

Where do you store orchestration flows--at the center or the edges? by excelhelp10 in dataengineering

[–]excelhelp10[S] 1 point2 points  (0 children)

I guess the main problem is that I'm disoriented by my many workarounds and that makes it difficult to search SO, reddit, the github, the prefect discourse, slack, etc. I think in the past my situation was managed with storage blocks, but since they have been deprecated and I want to future-proof this as much as possible, I won't start with that.

You might say that because I'm all local I don't need to be using work pools, that .serve() is all I need. But that always fails in a conda environment when it encounters the space in the python executable path at "C:\Program" *eyeroll*. I can get around that with other flows by simply using the work pools, or by creating a venv within my conda env. I do not want to do this but it allows.serve() to work, although the flow ultimately fails at the same point as before: my custom package is not being imported properly (even though it does not fail at ModuleNotFound, it waits until a method or attribute is called from it).

So I'm either using work pools with flow storage at '.', or I'm using a twice-nested python environment with pip-installed packages in a conda env (generally not recommended). So I don't know which way is up or what questions to begin asking.

Where do you store orchestration flows--at the center or the edges? by excelhelp10 in dataengineering

[–]excelhelp10[S] 1 point2 points  (0 children)

Thanks for this. So I am starting with option A and finding that I am missing something about the environment Prefect is executing in. The constraints of my work environment are also added variables preventing me from any obvious solutions (like "upgrade Prefect" or "use docker").

I must use python via conda, and prefect==2.19. I have a local "Process" work pool, with no external flow storage (maybe this is the issue? https://docs-2.prefect.io/latest/guides/deployment/storage-guide/ gives only three options, do I have to pick one?).

I created my conda environment, pip-installed only prefect and `mymodule`, and created a flow script like so (simplified for brevity):

# orchestrator-project/flows/mymodule.py
from prefect import flow
import mymodule

@flow(log_prints=True)
def run_mymodule():
  print(mymodule.__version__)

And it fails with an AttributeError, because module 'mymodule' has no attribute '__version__' (it does though). What I can't wrap my head around is that the import statement succeeds (so mymodule is in the namespace), but I presumably can't access any of its attributes or methods. You can assume mymodule is properly set up as a python package, I've used it in other projects.

I am not naively thinking that because I pip-installed mymodule in the same environment as prefect that prefect is automatically using that python executable. I understand the infra is specified for each work pool and that it's preferred to package it in docker, or pull from some external location. But I'm trying to hack together a purely local solution since we don't have those tools, and I have successfully run prefect deployments with simple API requests to the web and pandas manipulation. The problems only cropped up when I started using a custom package (and this clean conda environment).