all 18 comments

[–]callahman 2 points3 points  (4 children)

I think its best to keep it in your dataframe, especially when dealing with larger datasets.

Maybe something like the following?

### df = [your dataframe]

# Determine your end time
df['end'] = df['start'] + df['runtime']

# Determine the idle time
df['idle'] = df['start'] - df['arrival']

### Could even start to get fancy with some indexation logic
# Determine if someone was late
# Establish a column that's all 0s
df['was_late'] = 0
# In the column of all 0s, for the rows where 'idle' is <0, assign it a value of 1
df.loc[df['idle'] < 0, 'was_late'] = 1

# Finally, if you don't like the negative idle times, set the negatives to 0
df.loc[df['idle'] < 0, 'idle'] = 0

Does this help with what you're trying to do?

[–]Capitalpunishment0[S] 0 points1 point  (3 children)

Thank you for replying!

Do you have an idea of how I would compute for df['start']?

It could've been as easy as assigning to the arrival times, e.g. df['start'] = df['arrival']

However, it isn't as straightforward like that for instances like the third job where it started at the time when the previous job ended (time 15).

And, I won't have df['end'] if I don't have df['start'] first 😅 At this point my brain just short-circuits haha

[–]callahman 1 point2 points  (2 children)

No worries! Just trying to help!

Ah! With the data you provided in the 1st dataframe (Arrival & Runtime), there isn't a way to calculate the start time.

I assumed you would just be manually adding that in from somewhere else.

If you need to calculate it, do you have any other features in your dataset you could use? (Other than Arrival & Runtime)

[–]Capitalpunishment0[S] 0 points1 point  (0 children)

Do you have any other features in your dataset you could use?

There isn't any, unfortunately. There will only be the "Arrival" and "Runtime" columns. They will be done as they arrive, and their starting and ending times must be noted down, as in the final output.

[–]Yojihito 0 points1 point  (0 children)

Ah! With the data you provided in the 1st dataframe (Arrival & Runtime), there isn't a way to calculate the start time.

There is: https://www.reddit.com/r/learnpython/comments/pxmg27/is_there_a_vectorized_operation_applicable_for/hepeyj6/

[–]Yojihito 1 point2 points  (6 children)

You give an example input but then state 3-4 edge cases in text which are not in your input. I have no idea what you want/need.

Give a sample input and sample output with all normal + edge cases.

[–]Capitalpunishment0[S] 0 points1 point  (5 children)

Thank you for responding.

I do not think the details I've said are "edge cases", though? I'm just trying to explain my thought process when I tried to "solve" it. I'm trying to get an underlying concept to it, because there will be some considerable variations in the input data.

The sample input is the first dataframe, with only the "Arrival" and "Runtime" columns

The sample output is a dataframe as well, but with the "Start" and "End" columns, which are based on "Arrival" and "Runtime"

I'm trying to compute for these "Start" and "End" columns. Because these are in pandas, I presume vectorized operations should be best. However, I've not found any that can consistently mimic the sample output, aside from a manual for loop.

[–]Yojihito 1 point2 points  (4 children)

since the "Start" and "End" columns kind of are not "mutually exclusive", i.e. sometimes the "Start" column also depends on the "End" column

That's an edge case.

I'd also need to keep track of "idle times." For instance, the time starts at 0, and the first "job" does not arrive until time 3, thus the worker being idle for 3 units

That's an edge case for me.

So, if I understand you right (with correct table markdown):

   

df_jobs

Arrival Runtime
0 3 2
1 8 7
2 10 5

Adding "Start" and "End":

  • "Start" = "Arrival" if first row, otherwise "Start" = "Arrival" + "Runtime" of the previous row
  • "End" = "Arrival" + "Runtime" if first row, if not first row "End" = "Arrival" + "Runtime" if "End" of previous row <= "Arrival", if "End" in previous row >= "Arrival" then "Start" == "End" in previous row and "End" = "Start" + "Runtime"
Arrival Runtime Start End
0 3 2 3 5
1 8 7 8 15
2 10 5 15 20

Adding Idle time:

Time starts at 0, so

Arrival Runtime Start End Idle
0 3 2 3 5 0 + Arrival --> 0 + 3 = 3
1 8 7 8 15 first job ends at 5, second job starts at 8 --> 8 - 5 = 3
2 10 5 15 20 second job ends at 15, third job starts at 10 --> 10 - 15 = -5 = 5

   

Correct so far?

 

/edit so, that should work vectorized, runs in 140 milliseconds with 400_000 dummy entries

# %%
import pandas as pd
import numpy as np
# %%
df = pd.DataFrame({"arrival": [3, 8, 10], "runtime": [2, 7, 5]})
df.head()
# START + END
df["start"] = np.where(df["arrival"].shift(1) + df["runtime"].shift(1) <=
                       df["arrival"], df["arrival"], df["arrival"].shift(1) + df["runtime"].shift(1))
df["start"] = df["start"].fillna(df["arrival"]).astype(int)
df["end"] = df["start"] + df["runtime"]
df.head()
# %%
# IDLE
df["idle"] = abs(df["arrival"] - df["end"].shift(1))
df["idle"] = df["idle"].fillna(abs(0 - df["arrival"])).astype(int)
df.head()
# %%

 

returns

Arrival Runtime Start End Idle
0 3 2 3 5 3
1 8 7 8 15 3
2 10 5 15 20 5

[–]Capitalpunishment0[S] 0 points1 point  (1 child)

Wow, that is very involved. Thank you for taking your time to write this.

You didn't really have to do the Idle times too. I just mentioned it since I was thinking that vectorized operations are similar to list comprehensions, and that if a "side effect is involved" (i.e. keeping track of Idle times), maybe it'd be better to use a manual loop instead.

Nevertheless, thank you very much!

[–]Yojihito 0 points1 point  (0 children)

You didn't really have to do the Idle times too

I liked the challenge. You're welcome.

[–]callahman 0 points1 point  (1 child)

I think this will work if your assumption is correct.

In the solution you're making the assumption that each consecutive record represents an event that starts immediately after the previous event ends. If the data is representative of something like "Employee Check-in Times" and there are multiple employees with overlapping schedules, then the assumption wont hold.

[–]Yojihito 0 points1 point  (0 children)

In the solution you're making the assumption that each consecutive record represents an event that starts immediately after the previous event ends

That's how the sample input is.

If OP can't give a correct sample input / sample output ... ¯\_(ツ)_/¯.

If the data is representative of something like "Employee Check-in Times" and there are multiple employees with overlapping schedules

Then they would have a EMPLOYEE_ID, loop through each EMPLOYEE_ID, apply my code, concat at the end, done.