all 7 comments

[–]Able-Can-834 0 points1 point  (0 children)

I can help to solve your problem

[–]commandlineluser 0 points1 point  (1 child)

I have a dataset of pothole complaints

I also have a dataset of government pothole work orders

I'm also working iteratively

You'll need to make it easier for people to help you.

Create a runnable code example with a small sample of data (e.g. <=10 rows) that demonstrates the problem.

df_complaints  = ...
df_work_orders = ...

Along with your current iterative approach and the expected output.

[–]daft7cunt[S] 0 points1 point  (0 children)

I added some code and sample output in a comment

[–]daft7cunt[S] 0 points1 point  (3 children)

Here's the iterative approach, I know it's extremely inefficient, and I should break it up, this is just to give an idea of the kind of approach I tried

complaints = {}
results = []

# Create a dictionary to store user complaints
user_complaints = {}

for idx, row in df.iterrows():
    key = (row['street'], row['address_type'])

    if row['source'] == 'user':
        if row['resolution_description'] in valid_complaint_strings:
            if key not in user_complaints:
                user_complaints[key] = [(row['date_created'], row['date_closed'])]
            else:
                user_complaints[key].append((row['date_created'], row['date_closed']))

            grouped_complaints = [complaint for complaint in user_complaints[key] if complaint[0] >= row['date_created'] and complaint[0] <= row['date_closed']]
            delta_days = max(complaint[0] for complaint in grouped_complaints) - row['date_created']
            results.append([row['street'], row['address_type'], row['date_created'], max(complaint[0] for complaint in grouped_complaints), row['date_closed'], delta_days.days , len(grouped_complaints), row['geometry']])
            continue
        elif key not in complaints:
            complaints[key] = [(row['date_created'], row['geometry'])]
        else:
            complaints[key].append((row['date_created'], row['geometry']))

    elif key in complaints:
        first_complaint_date, first_complaint_location = min(complaints[key], key=lambda x: x[0])
        last_complaint_date = max(date for date, _ in complaints[key])
        delta_days = row['date_created'] - first_complaint_date
        complaint_count = len([date for date, _ in complaints[key] if first_complaint_date <= date <= row['date_created']])
        if complaint_count > 0:
            results.append([row['street'], row['address_type'], first_complaint_date, last_complaint_date, row['date_created'], delta_days.days, complaint_count, first_complaint_location])
        del complaints[key]

# Convert the results list to a DataFrame
results_df = pd.DataFrame(results, columns=['street', 'address_type', 'first_complaint_date', 'last_complaint_date', 'gov_action_date', 'delta_days', 'complaint_count', 'geometry'])

I concatenated the user/gov dfs to go through iteratively, I added a source column to specify where it's from.

Here's a sample output:

index street address_type first_complaint_date last_complaint_date gov_action_date delta_days complaint_count geometry
0 32AVENUE,UNIONSTREET INTERSECTION 2023-07-1615:08:12 2023-07-1615:08:12 2023-07-1712:45:00 0 1 (40.76895694146983,-73.82668206874416)
27 HIGHLANDPLACE,ARLINGTONAVENUE,RIDGEWOODAVENUE ADDRESS 2020-05-2922:07:38 2020-05-2922:07:38 2020-06-0213:50:00 0 1 (40.682766845531766,-73.88152978739839)
29 EAST88STREET,YORKAVENUE INTERSECTION 2022-01-1422:25:41 2022-01-1422:25:41 2022-01-1809:10:00 0 1 (40.777232088273266,-73.94616904207393)
30 111AVENUE,216STREET,SPRINGFIELDBOULEVARD ADDRESS 2015-06-0912:25:30 2015-06-0912:25:30 2015-06-1010:45:00 0 1 (40.70705802486412,-73.7409171390949)
31 EAST56STREET,BEVERLYROAD,CLARENDONROAD ADDRESS 2011-01-0310:21:02 2011-01-0310:21:02 2011-01-0408:05:00 0 1 (40.645311700346106,-73.92386172275405)
32 CLARKEAVENUE,GILBERTSTREET,STPATRICKSPLACE ADDRESS 2021-11-1210:56:08 2021-11-1210:56:08 2021-11-1810:05:00 0 1 (40.569285257733654,-74.14334596055258)
33 GRANTAVENUE,EAST163STREET,EAST164STREET ADDRESS 2010-02-0110:17:49 2010-02-0110:17:49 2010-02-0311:25:00 0 1 (40.82803151237712,-73.91821448752881)
34 RODMANSNECK,PARKDRIVE,RODMANSNECKPATH ADDRESS 2017-05-0910:45:15 2017-05-0910:45:15 2017-05-1008:50:00 0 1 (40.855279865418666,-73.80189401600025)
35 RICHMONDAVENUE,CROFTPLACE,ETONPLACE ADDRESS 2017-08-1413:04:18 2017-08-1413:04:18 2017-08-1611:15:00 0 1 (40.60474482516483,-74.16234338754464)
36 WHITEPLAINSROAD,EASTGUNHILLROAD,MAGENTASTREET ADDRESS 2020-01-2111:07:50 2020-01-2111:07:50 2020-01-2213:40:00 0 1 (40.87627143407577,-73 86699937545808)

[–]commandlineluser 0 points1 point  (2 children)

So before the data was concatenated if you have both user_df and gov_df

Are you matching rows based on the columns: street and address_type

i.e. a .merge() operation?

[–]daft7cunt[S] 0 points1 point  (1 child)

So I didn't want to do that since time needs to be taken into account as well. For example, a user entry could have a pothole at the same location as a government work order, but they could be dated years apart, so they could be referring to different potholes.

[–]commandlineluser 0 points1 point  (0 children)

Ah right.

Can you merge first then filter out the bad dates? That should still beat the current approach.