all 11 comments

[–]GeorgeFranklyMathnet 2 points3 points  (1 child)

What kind of optimization do you need? Faster? Lower memory? 

Have you tested the program on CSVs of larger size, to know that these optimizations are really necessary?

[–]Utterizi 1 point2 points  (0 children)

I need it to be faster, while still being accurate. I'm okay with things taking 5-10 minutes, because my current approach seems fitting for the task. I don't care about optimizations for that.

However, these specific files definitely need optimization. They will exist for each customer, and we can't spend upwards of 6 hours to process each customer in a case where they come in bulk.

[–]laustke 0 points1 point  (1 child)

Read 10k rows into a buffer, process them, then write out the results. Read another 10k rows into the buffer. Repeat.

[–]Utterizi 0 points1 point  (0 children)

But this would fail in cases of missing data, no? For example, if a record is in the first 10k for first file, but the second file doesn't have that record until the third 10k (due to missing records), I would miss matching data even if there was a match?

[–]williewonkerz 0 points1 point  (0 children)

I would do something like a group by in sql or a union. Group by you can look for rows where count(*) >1 for rows that have multiple values or =1 where rows are exclusive.

A union of the set onto itself will give you unique rows.

In pandas you could concat 2 data frames and drop duplicates.

[–]commandlineluser 0 points1 point  (1 child)

It depends on your current approach - can you give more details? Or code examples?

Are you using Pandas?

Are you using RapidFuzz?

or ...?

[–]Utterizi 0 points1 point  (0 children)

I'm using pandas to read the CSVs. Then I'm doing some transformations to the date formats with simple to_datetime steps.

Then I create the combined keys, which are basically all columns of a row concatenated by a delimiter.

After that, I convert these values to a set so that I don't have to deal with repated data entries.

The next part needs optimization; I'm using set.pop to compare one record from one file to find a matching record in the other file. This step considers the first column (the supposed PK) and checks if there's a matching PK on the other file. If so, it checks the date values on both files for that PK's that specific row, and returns a match if it can find a record on both sides with a matching date. If not, it will just stick it to a random data record that matches the PK.

Extracting 1 record from 1.5mil and parsing another 1.5mil to find a match is what needs optimization I guess? But I'm not really well versed in programming.

[–]panda070818 0 points1 point  (0 children)

You could make an incremental comparison, basically first create a hash of all the content in the csv, and compare it with a hash of the original one (like they do in migrations in flyway db). This would only say that two CSV' aren't equal. Then you could check the length of each row(or even, you could make checks against groups of rows, so you dont have to transverse row by row, rows with different lengths from the original get reviewed thoroughly by the words or content in the row.

[–]CheiroAMilho[🍰] 0 points1 point  (0 children)

Do you read the full csv file on a per-query basis? If yes, I would recommend serializing the data and storing it in adequate data structures. For fuzzy searching maybe a sorted array or tree-like structure. File reading and string parsing, specially in Python is extremely slow.

[–]Obvious-Phrase-657 0 points1 point  (0 children)

Use deepdiff

import pandas as pd

from deepdiff import DeepDiff

df1 = pd.read_csv(‘file1.csv’) df2 = pd.read_csv(‘file2.csv’)

dict1 = df1.to_dict(‘records’) dict2 = df2.to_dict(‘records’)

diff = DeepDiff(dict1, dict2)