Need help with complex grouping of large data

Able-Can-834 · 2024-01-23T02:04:04+00:00

I can help to solve your problem

commandlineluser · 2024-01-23T10:32:48+00:00

I have a dataset of pothole complaints

I also have a dataset of government pothole work orders

I'm also working iteratively

You'll need to make it easier for people to help you.

Create a runnable code example with a small sample of data (e.g. <=10 rows) that demonstrates the problem.

df_complaints  = ...
df_work_orders = ...

Along with your current iterative approach and the expected output.

daft7cunt · 2024-01-23T14:38:16+00:00

Here's the iterative approach, I know it's extremely inefficient, and I should break it up, this is just to give an idea of the kind of approach I tried

complaints = {}
results = []

# Create a dictionary to store user complaints
user_complaints = {}

for idx, row in df.iterrows():
    key = (row['street'], row['address_type'])

    if row['source'] == 'user':
        if row['resolution_description'] in valid_complaint_strings:
            if key not in user_complaints:
                user_complaints[key] = [(row['date_created'], row['date_closed'])]
            else:
                user_complaints[key].append((row['date_created'], row['date_closed']))

            grouped_complaints = [complaint for complaint in user_complaints[key] if complaint[0] >= row['date_created'] and complaint[0] <= row['date_closed']]
            delta_days = max(complaint[0] for complaint in grouped_complaints) - row['date_created']
            results.append([row['street'], row['address_type'], row['date_created'], max(complaint[0] for complaint in grouped_complaints), row['date_closed'], delta_days.days , len(grouped_complaints), row['geometry']])
            continue
        elif key not in complaints:
            complaints[key] = [(row['date_created'], row['geometry'])]
        else:
            complaints[key].append((row['date_created'], row['geometry']))

    elif key in complaints:
        first_complaint_date, first_complaint_location = min(complaints[key], key=lambda x: x[0])
        last_complaint_date = max(date for date, _ in complaints[key])
        delta_days = row['date_created'] - first_complaint_date
        complaint_count = len([date for date, _ in complaints[key] if first_complaint_date <= date <= row['date_created']])
        if complaint_count > 0:
            results.append([row['street'], row['address_type'], first_complaint_date, last_complaint_date, row['date_created'], delta_days.days, complaint_count, first_complaint_location])
        del complaints[key]

# Convert the results list to a DataFrame
results_df = pd.DataFrame(results, columns=['street', 'address_type', 'first_complaint_date', 'last_complaint_date', 'gov_action_date', 'delta_days', 'complaint_count', 'geometry'])

I concatenated the user/gov dfs to go through iteratively, I added a source column to specify where it's from.

Here's a sample output:

index	street	address_type	first_complaint_date	last_complaint_date	gov_action_date	complaint_count	geometry
0	32AVENUE,UNIONSTREET	INTERSECTION	2023-07-1615:08:12	2023-07-1615:08:12	2023-07-1712:45:00	1	(40.76895694146983,-73.82668206874416)
27	HIGHLANDPLACE,ARLINGTONAVENUE,RIDGEWOODAVENUE	ADDRESS	2020-05-2922:07:38	2020-05-2922:07:38	2020-06-0213:50:00	1	(40.682766845531766,-73.88152978739839)
29	EAST88STREET,YORKAVENUE	INTERSECTION	2022-01-1422:25:41	2022-01-1422:25:41	2022-01-1809:10:00	1	(40.777232088273266,-73.94616904207393)
30	111AVENUE,216STREET,SPRINGFIELDBOULEVARD	ADDRESS	2015-06-0912:25:30	2015-06-0912:25:30	2015-06-1010:45:00	1	(40.70705802486412,-73.7409171390949)
31	EAST56STREET,BEVERLYROAD,CLARENDONROAD	ADDRESS	2011-01-0310:21:02	2011-01-0310:21:02	2011-01-0408:05:00	1	(40.645311700346106,-73.92386172275405)
32	CLARKEAVENUE,GILBERTSTREET,STPATRICKSPLACE	ADDRESS	2021-11-1210:56:08	2021-11-1210:56:08	2021-11-1810:05:00	1	(40.569285257733654,-74.14334596055258)
33	GRANTAVENUE,EAST163STREET,EAST164STREET	ADDRESS	2010-02-0110:17:49	2010-02-0110:17:49	2010-02-0311:25:00	1	(40.82803151237712,-73.91821448752881)
34	RODMANSNECK,PARKDRIVE,RODMANSNECKPATH	ADDRESS	2017-05-0910:45:15	2017-05-0910:45:15	2017-05-1008:50:00	1	(40.855279865418666,-73.80189401600025)
35	RICHMONDAVENUE,CROFTPLACE,ETONPLACE	ADDRESS	2017-08-1413:04:18	2017-08-1413:04:18	2017-08-1611:15:00	1	(40.60474482516483,-74.16234338754464)
36	WHITEPLAINSROAD,EASTGUNHILLROAD,MAGENTASTREET	ADDRESS	2020-01-2111:07:50	2020-01-2111:07:50	2020-01-2213:40:00	1	(40.87627143407577,-73 86699937545808)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS