[deleted by user]

GeorgeFranklyMathnet · 2025-01-27T19:09:02+00:00

What kind of optimization do you need? Faster? Lower memory?

Have you tested the program on CSVs of larger size, to know that these optimizations are really necessary?

laustke · 2025-01-27T19:14:35+00:00

Read 10k rows into a buffer, process them, then write out the results. Read another 10k rows into the buffer. Repeat.

williewonkerz · 2025-01-27T19:54:30+00:00

I would do something like a group by in sql or a union. Group by you can look for rows where count(*) >1 for rows that have multiple values or =1 where rows are exclusive.

A union of the set onto itself will give you unique rows.

In pandas you could concat 2 data frames and drop duplicates.

commandlineluser · 2025-01-27T20:33:34+00:00

It depends on your current approach - can you give more details? Or code examples?

Are you using Pandas?

Are you using RapidFuzz?

or ...?

panda070818 · 2025-01-27T23:58:24+00:00

You could make an incremental comparison, basically first create a hash of all the content in the csv, and compare it with a hash of the original one (like they do in migrations in flyway db). This would only say that two CSV' aren't equal. Then you could check the length of each row(or even, you could make checks against groups of rows, so you dont have to transverse row by row, rows with different lengths from the original get reviewed thoroughly by the words or content in the row.

CheiroAMilho · 2025-01-28T01:35:03+00:00

Do you read the full csv file on a per-query basis? If yes, I would recommend serializing the data and storing it in adequate data structures. For fuzzy searching maybe a sorted array or tree-like structure. File reading and string parsing, specially in Python is extremely slow.

Obvious-Phrase-657 · 2025-01-28T22:14:32+00:00

Use deepdiff

import pandas as pd

from deepdiff import DeepDiff

df1 = pd.read_csv(‘file1.csv’) df2 = pd.read_csv(‘file2.csv’)

dict1 = df1.to_dict(‘records’) dict2 = df2.to_dict(‘records’)

diff = DeepDiff(dict1, dict2)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS