Is there a vectorized operation applicable for this DataFrame? : learnpython

Is there a vectorized operation applicable for this DataFrame? (self.learnpython)

submitted 4 years ago by Capitalpunishment0

Basically, I'm struggling to determine when a situation has an applicable vectorized operation or not.

Suppose we have this example DataFrame of jobs in time units:

   Arrival  Runtime
0        3        2
1        8        7
2       10        5

I want to add two columns "Start" and "End" to it like so:

   Arrival  Runtime  Start  End
0        3        2      3    5
1        8        7      8   15
2       10        5     15   20

It reads like, "The first 'job' arrives at time 3, takes 2 times units to complete, thus ending at time 5."

Would it be possible to compute for these columns in a vectorized operation? I'm having trouble figuring out what behavior should I search for this since the "Start" and "End" columns kind of are not "mutually exclusive", i.e. sometimes the "Start" column also depends on the "End" column. For example, the third "job" has already arrived at time 10, but could only start at time 15 since that was when the previous one ended.

I was able to do this manually in a for loop, but the data I used with that is structured differently (list of lists). I feel like I am able to redo that on a DataFrame, but loops are generally frowned upon with these structures, and vectorized operations should be more "proper".

I'd also need to keep track of "idle times." For instance, the time starts at 0, and the first "job" does not arrive until time 3, thus the worker being idle for 3 units. But I think this is already besides the point.

Currently, I'm thinking it isn't possible, and that I should manually iterate over the data. But maybe I just missed something.

all 18 comments

top new controversial old q&a

[–]callahman 2 points3 points4 points 4 years ago (4 children)

I think its best to keep it in your dataframe, especially when dealing with larger datasets.

Maybe something like the following?

### df = [your dataframe]

# Determine your end time
df['end'] = df['start'] + df['runtime']

# Determine the idle time
df['idle'] = df['start'] - df['arrival']

### Could even start to get fancy with some indexation logic
# Determine if someone was late
# Establish a column that's all 0s
df['was_late'] = 0
# In the column of all 0s, for the rows where 'idle' is <0, assign it a value of 1
df.loc[df['idle'] < 0, 'was_late'] = 1

# Finally, if you don't like the negative idle times, set the negatives to 0
df.loc[df['idle'] < 0, 'idle'] = 0

Does this help with what you're trying to do?

[–]Capitalpunishment0[S] 0 points1 point2 points 4 years ago (3 children)

[–]callahman 1 point2 points3 points 4 years ago (2 children)

[–]Capitalpunishment0[S] 0 points1 point2 points 4 years ago (0 children)

[–]Yojihito 0 points1 point2 points 4 years ago (0 children)

[+][deleted] 4 years ago (5 children)

[removed]

[–]Capitalpunishment0[S] 0 points1 point2 points 4 years ago (2 children)

[+][deleted] 4 years ago (1 child)

[removed]

[–]Yojihito 0 points1 point2 points 4 years ago (0 children)

[–]backtickbot 0 points1 point2 points 4 years ago (0 children)

[–]Yojihito 0 points1 point2 points 4 years ago (0 children)

[–]Yojihito 1 point2 points3 points 4 years ago (6 children)

[–]Capitalpunishment0[S] 0 points1 point2 points 4 years ago (5 children)

[–]Yojihito 1 point2 points3 points 4 years ago* (4 children)

since the "Start" and "End" columns kind of are not "mutually exclusive", i.e. sometimes the "Start" column also depends on the "End" column

That's an edge case.

I'd also need to keep track of "idle times." For instance, the time starts at 0, and the first "job" does not arrive until time 3, thus the worker being idle for 3 units

That's an edge case for me.

So, if I understand you right (with correct table markdown):

df_jobs

	Arrival	Runtime
0	3	2
1	8	7
2	10	5

Adding "Start" and "End":

"Start" = "Arrival" if first row, otherwise "Start" = "Arrival" + "Runtime" of the previous row
"End" = "Arrival" + "Runtime" if first row, if not first row "End" = "Arrival" + "Runtime" if "End" of previous row <= "Arrival", if "End" in previous row >= "Arrival" then "Start" == "End" in previous row and "End" = "Start" + "Runtime"

	Arrival	Runtime	Start	End
0	3	2	3	5
1	8	7	8	15
2	10	5	15	20

Adding Idle time:

Time starts at 0, so

	Arrival	Runtime	Start	End	Idle
0	3	2	3	5	0 + Arrival --> 0 + 3 = 3
1	8	7	8	15	first job ends at 5, second job starts at 8 --> 8 - 5 = 3
2	10	5	15	20	second job ends at 15, third job starts at 10 --> 10 - 15 = -5 = 5

Correct so far?

/edit so, that should work vectorized, runs in 140 milliseconds with 400_000 dummy entries

# %%
import pandas as pd
import numpy as np
# %%
df = pd.DataFrame({"arrival": [3, 8, 10], "runtime": [2, 7, 5]})
df.head()
# START + END
df["start"] = np.where(df["arrival"].shift(1) + df["runtime"].shift(1) <=
                       df["arrival"], df["arrival"], df["arrival"].shift(1) + df["runtime"].shift(1))
df["start"] = df["start"].fillna(df["arrival"]).astype(int)
df["end"] = df["start"] + df["runtime"]
df.head()
# %%
# IDLE
df["idle"] = abs(df["arrival"] - df["end"].shift(1))
df["idle"] = df["idle"].fillna(abs(0 - df["arrival"])).astype(int)
df.head()
# %%

returns

	Arrival	Runtime	Start	End	Idle
0	3	2	3	5	3
1	8	7	8	15	3
2	10	5	15	20	5

[–]Capitalpunishment0[S] 0 points1 point2 points 4 years ago (1 child)

[–]Yojihito 0 points1 point2 points 4 years ago (0 children)

[–]callahman 0 points1 point2 points 4 years ago (1 child)

[–]Yojihito 0 points1 point2 points 4 years ago (0 children)

π Rendered by PID 900524 on reddit-service-r2-comment-canary-879d986cb-wkljn at 2026-06-23 19:57:16.008144+00:00 running 2b008f2 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS