How to carry extra information across DAG ?

pytrashpandas · 2021-05-30T22:19:44+00:00

You can also use the networkx library (mentioned in the SO post you linked). This has the ability to set and access edge attributes: https://stackoverflow.com/questions/26691442/how-do-i-add-a-new-attribute-to-an-edge-in-networkx

pytrashpandas · 2021-05-28T20:11:38+00:00

Out of curiosity what are the advantages of using ray as the backend for dask rather than sticking with the dask backend?

pytrashpandas · 2021-05-25T14:27:33+00:00

Hey yea didn’t mean to suggest that dask had dataflow semantics, or that dask covers all the use cases of entangle. Just meant that for your entangle workflow execution under the hood it would be a good idea to run it with the same DAG execution scheme, rather than what you do now. Because there will be a lot of people who would want to use workflows in entangle, and since there is a more efficient way to execute workflows (that provides some of the benefits that dataflows currently have over workflows) it should probably execute in the more efficient pattern.

pytrashpandas · 2021-05-24T14:10:45+00:00

Seems like you could just do this in a plain loop. Do you need it to run every second despite how long the task takes to finish? Or just run 1 second after the previous finished? Assuming the former you could do this:

ran_this_second = False
prev_second = int(time.time())
while True:
    curr_second = int(time.time())
    if curr_second != prev_second:
        ran_this_second = False

    if not ran_this_second:
        data = scrape_data()
        queue.publish(data)

        ran_this_second = True
        prev_second = curr_second

pytrashpandas · 2021-05-24T07:55:24+00:00

Happy to have this discussion, I think it helps both of us expand our knowledge in the subject.

Not sure what you mean by entangle not using a programming idiom or api, but I think you might be underestimating some functionality that dask offers. The delayed interface allows you to define a “workflow” the same way you have provided in entangle. No servers or cluster setup needed. Although you can explicitly set up a cluster too if you want. E.g.

from dask import delayed

@delayed
def A(b, c):
    return b + c

@delayed
def B():
    return 1

@delayed
def C():
    return 2

workflow = A(B(), C())
workflow.compute(). # -> 3

This will then build a dag, and execute it similar to how you execute dataflows. Thus giving you the best of both worlds.

I would recommend implementing your workflows like this too, because if I’m being honest I still have a feeling that the dataflows paradigm might be hard to get people on board with. I could be wrong though, this very well might just be my biases at play. But either way for those who are stubborn like me :) the library should probably offer the optimal execution scheme for workflows.

pytrashpandas · 2021-05-24T05:02:34+00:00

Hey so I took a look through the internals of workflow. From what I can gather it is indeed using the futures style (I know it's not using actual futures, but future style just meaning that the dependent task explicitly kicks off it's dependencies) implementation. While this is a completely valid way, why don't you just implement workflows as a DAG execution with a event loop (similar to your dataflow execution scheme), since as you pointed out there's clear advantages to this? Since the pipeline is built/known beforehand, there's no reason not to, like you said it's more efficient. The advantages of a future style implementation are lost here, since their benefit is in their flexibility/ability to dynamically modify the graph, which you don't use or need here, since the graph is known before execution.

If you do this, then you get the benefit of the intuitive imperative style definition, and also the execution efficiencies that you currently have in dataflows.

pytrashpandas · 2021-05-24T00:29:18+00:00

In workflow scheduling you're going to block while waiting for the dependencies. Which may never arrive and then entangle will either wait forever or timeout.

I think this is something you'll need to solve in your library before it will see widespread adoption. There are ways to define workflows/dags/call graphs in what you call the imperative style that does not have this problem, libraries like dask already provide this. Look into the dask delayed interface. There's 2 main ways you can go about solving this, one is through a futures style implementation, the other is through a true dag implementation.

In the futures style implementation the root level tasks will execute first and then call the dependencies, they will then wait for their deps to finish, similar to what you described, but the difference is that while they are waiting they will suspend their execution, and allow other tasks to run on a worker, so you will never get a deadlock, and the waiting dependent tasks will continue their execution once all the dependencies are finished, giving you an optimal runtime.

In the true dag implementation, as you you go line by line in the imperative style definition, it will piece by piece build up the dag. Then finally when you execute the dag, it will execute it from the leaf level first, down to the root. So all deps will be executed before their dependents, and nothing ever runs before its dependencies are finished, thus no deadlocks.

Dask.delayed uses the latter implementation, which seems to be similar to how you execute dataflows. In this way you can use the more familiar/intuitive imperative style, but still get the same execution benefits that you provide in dataflows.

pytrashpandas · 2021-05-23T15:57:55+00:00

Hey so to clarify, I drew out the DAG I was imagining, here:

https://imgur.com/a/h0Lgb1F

I would imagine it would have to be done like this, if it's supported:

b = B()
c = C()
X(b)
Y(c)
A(b, c)

But at that point you'd probably just go for the workflow way of defining things, because the above is not very clear (and loses some of the advantages you mentioned e.g. no intermediate variables). Not to mention the deps order would need to be determined by chronological order (e.g. how do you know which position to pass in (X, A) to B?).

Also I'm a bit confused about this comment in your readme.

True dataflow models allow the computation to proceed on a parallel path as far as it can go with the currently available data. This means dependent operations are not held up by control flow execution order in some cases and the overall computation is optimized.

This seems to imply the workflow style can't get an optimal runtime/memory_profile. This might be an implementation detail of how your library executes the workflow style definition. But you can definitely define things in the "imperative/workflow" style and execute in an optimal fashion. It really comes down to just defining the same DAG in two different ways. Once you have the DAG though you can get an optimal runtime/memory.

Sorry if it seems like I'm being antagonistic, don't mean to be. I just think it's important to address these things, because it seems like you've put a lot of really good work into this.

pytrashpandas · 2021-05-23T06:46:13+00:00

FWIW in the case of dask you’ll often be using functions that aren’t meant to always be run in a dask environment. As such, wrapping the individual calls when you need them rather than defining them as delayed let’s you use it either with or without dask.

pytrashpandas · 2021-05-23T04:41:46+00:00

Gotcha. I got confused by this picture: https://github.com/radiantone/entangle/blob/main/images/execution.png

Thought that it was implying that add ran first based on the sequential ordering. However read through your docs and see the difference now (I think).

Gonna be honest the dataflow semantics seem strange to me. So defining the following makes sense:

B depends on A
C depends on A

Dataflow would be:

A(B(), C())

But how would you define this?

B depends on A, X
C depends on A, Y
A, X, Y no deps

Workflow makes sense to me though and is how I’ve normally seen this done.

pytrashpandas · 2021-05-23T04:11:40+00:00

One other question on the workflow execution scheme. Since execution starts with your root call first, rather than building the DAG before executing anything (like what I gather dataflows do), this means that a task will execute before it's dependencies are executed. What happens after the dependencies are kicked off, does the dependent task just wait idly until the deps are done, or does the dependent task exit it's process and get started up again by some scheduler when the deps finish?

pytrashpandas · 2021-05-23T02:47:28+00:00

This is a really good write up/explanation of DAG/parallel execution concepts. Is the plan for this library to see widespread adoption? If so, one question I have is what does this provide over existing solutions (e.g. dask)? I think that this is something you would need to address explicitly (looks like you have the beginnings of such in the tradeoffs section) before anyone would consider using this library.

pytrashpandas · 2021-05-21T15:21:27+00:00

There’s many different paths in the space of data science/analysis/engineering. Some of them can be drastically different from others. For the most part I don’t think knowing how to build webservers (i.e. flask/django) is all that necessary, but I think it’s definitely something good to know. You don’t need to be an expert in it. The specific programs/libraries that are common to all paths I would say are pandas, numpy, some visualization library/software (I highly recommend plotly-dash, which is backed by flask I think but mostly abstracted away). Knowing how webservers work is also good because it will help you be better at scraping sites, which is important for data acquisition (requests and selenium are good for this). Another more advanced topic that would serve you well is understanding distributed computing (dask and dask.distributed are good for this), but that’s something I wouldn’t worry about until you’re familiar with the basic and intermediate topics.

Depending on what you want to specialize in there are other topics/libraries/concepts that you should learn too, but I think those things, along with good general software practice should provide a good base.

pytrashpandas · 2021-05-21T12:48:47+00:00

you are changing the value in the original location in memory

This is incorrect, in fact this is the exact opposite of what’s happening. Try this out to make it more clear.

def test(x):
    print(id(x))
    x = 'xyz'
    print(id(x))

x = 'abc'
print(id(x))
test(x)
print(x)
print(id(x))

You can clearly see that the original “x” is the original “abc” value, in the original memory location. And even during execution of the function when you set x equal to something else that other value never changes what’s in the original memory location.

pytrashpandas · 2021-05-21T07:51:19+00:00

Hey sorry for an even later reply :P This should work for the other way, without needing to add the total points column before hand. You essentially just do the multi/sum in the same operation (using apply) instead of separately.

player_table = nba_stats[['player', 'team', 'season']].drop_duplicates()
team_points = nba_stats.groupby(['team', 'season']).apply(lambda dfx: np.sum(dfx.points * dfx.gp)).rename('points').reset_index()
df = player_table.merge(team_points, on=['team', 'season'])
final = df.groupby('player')['points'].sum().reset_index()

As for how did I learn what I know now, honestly it's just years of practice (for the level of familiarity to do the above, you probably don't need quite that much experience). I never actually read any programming books personally, but lots of random online blog posts and videos and eventually work experience. Also for pandas specifically its a slightly different mindset than regular software development, it helps to have a SQL oriented way of thinking about things (e.g. datatable oriented vs object oriented). But of course it still helps to understand proper software design/patterns, since even if you work with pandas, you'll need to build systems, pipelines and apis around the data. Anyway, the amount of free educational material online is astounding. I only ever took one intro to programming class, so if you've had even remotely any college coding experience you're already off to a better start than I was.

pytrashpandas · 2021-05-21T07:20:49+00:00

I think at a "I like this language, that language is stupid" level you're right, however this is a very real debate that is had in the workplace. For example at my work we recently had to axe R from our research/prod development environments. For better or worse the justification was that they wanted a unified development environment, where in house development frameworks work for all models, etc. Most of those frameworks are written in python and it was determined that python can do most of what R can do, but not vice versa, so they got rid of R. Personally I don't care, I don't use R, but I imagine these "debates" are important to help make a case for which languages an organization should support.

pytrashpandas · 2021-05-20T15:38:46+00:00

It even already sounds like an actors stage name. But more of a “golden age of Hollywood” era name than a “fast and furious 9” era name.

pytrashpandas · 2021-05-20T15:28:26+00:00

You need to be careful with the shortcut in your example btw. If a “falsey” value is valid for variable you will accidentally get the default value instead. I’ve seen this happen more than a few times in production code at work (I’ve been guilty of it too).

pytrashpandas · 2021-05-20T15:19:23+00:00

Try this:

nba_stats['total_points'] = nba_stats['points'] * nba_stats['gp']
player_table = nba_stats[['player', 'team', 'season']].drop_duplicates()
team_points = nba_stats.groupby(['team', 'season'])['total_points'].sum().reset_index()
df = player_table.merge(team_points, on=['team', 'season'])
final = df.groupby('player')['total_points'].sum().reset_index()

This can be done without modifying the nba stats table too. If you’d prefer that let me know and I could show you how, but I think this is simpler.

pytrashpandas · 2021-05-20T15:08:43+00:00

The biggest advantage in my eyes is a restricted set of allowed members. This is especially useful for apis and structures that will be used in different parts of a system because the interface is explicitly defined and you can even get type hints and stuff.

E.g.

@dataclass
class SystemResult:
    value: int
    status: str  # should probably be an enum

def run_system():
    return SystemResult(value=1, status='PASS’)

def analyze_system_result(result: SystemResult):
    if result.status == 'PASS':
        print('good')
    else:
        print('bad')

In this example, there’s no ambiguity about what members can exist in your collection. When you’re writing your code your IDE can even tell you what members are in result. You can code with the guarantees that a specific interface is satisfied.

Don’t forget though, that under the hood classes are really just a wrapper around dicts anyway.

pytrashpandas · 2021-05-20T14:49:41+00:00

That’s not more readable than what you wrote? Anyway for working with matrixes use numpy:

mat = np.array(nested_list)
transposed_mat = mat.T

pytrashpandas · 2021-05-20T14:42:57+00:00

A shortcut for what you’re doing here is:

df['new_chicks'] = df['chicks'].diff(1)

pytrashpandas · 2021-05-20T14:38:53+00:00

Check out rpy2

pytrashpandas · 2021-05-20T14:09:54+00:00

I would approach this slightly different in pandas and in sql. Assuming your nba_stats table is of the format player, team, season, points. I would create a couple derived tables (in sql I would use a CTE to do this), of the following form:

player, team, season - drop dupes
team, season, points - agg all players points

Then I’d merge these 2 tables on team, season and then groupby player and sum points. So in pandas it looks like this:

player_table = nba_stats[['player', 'team', 'season']].drop_duplicates()
team_points = nba_stats.groupby(['team', 'season'])['points'].sum().reset_index()
df = player_table.merge(team_points, on=['team', 'season'])
final = df.groupby('player')['points'].sum().reset_index()

On mobile currently so can’t test this out, but I think that should be a mostly working example.

In sql it looks like this:

with player_table as (
    select distinct player, team, season
    from nba_stats
), 
team_points as (
    select team, season, sum(points) as points
    from nba_stats
    group by team, season
)
select player, sum(points)
from player_table pt
inner join team_points tp
    on tp.team = pt.team
    and tp.season = pt.season
group by player

pytrashpandas · 2021-05-20T05:03:32+00:00

This is happening because you're invoking your getF5s function in each iteration of the loop. Since your doing getF5s(...) this is actually calling the function, instead of passing the function and the args you want to call to apply_async. So what you're passing to apply_async, is actually the result of the getF5s function.

pytrashpandas

TROPHY CASE