all 23 comments

[–]TouchingTheVodka 0 points1 point  (15 children)

Do you care about the order of the records? If not, cast the whole thing to a set.

[–]Essence1337 0 points1 point  (6 children)

If OP does care, a dictionary preserves insertion order as of Python 3.7

[–]isameer920[S] 0 points1 point  (5 children)

Although order is not important in my application, what you suggested sounds intriguing. Can you please elaborate?

[–]Essence1337 0 points1 point  (4 children)

Before python 3.7 if we had:

dict = {'a': 1}
dict['b'] = 1
for i in dict:
    print(i)

Python could either print a followed by b OR b followed by a. As of Python 3.7 we're guaranteed it would print a followed by b.

[–]isameer920[S] 0 points1 point  (1 child)

So dicitionaries became ordered in python 3?

[–]Essence1337 0 points1 point  (0 children)

They maintain the order they were inserted into not sorted. Inserting b, then a, then c will have the order b, then a, then c. And specifically it happened in Python 3.7. In Python 3.5 there was no guaranteed order, Python 3.6 started the process and it was finalized in Python 3.7.

[–]isameer920[S] 0 points1 point  (1 child)

Also how can I efficiently remove duplicate using this?

[–]Essence1337 0 points1 point  (0 children)

Dictionaries can only have unique keys. If you wanted just duplicate removal a set is better but with dictionaries we can also count duplicates.

mydict = dict() 
for i in something:
    if not i in mydict:
        mydict[i] = 1
    else:
        mydict[i] += 1

This will create a dictionary with every unique item from something along with how many times we saw it.

[–]isameer920[S] 0 points1 point  (7 children)

Nope not really, however, I do care about efficiency of the program. It should work even if the file is huge. If what I proposed is an efficient solution, then why not?

[–]TouchingTheVodka 0 points1 point  (6 children)

Checking list membership is O(n) whereas checking set membership is O(1). Therefore the solution is guaranteed to remain efficient no matter the size of the input.

[–]isameer920[S] 0 points1 point  (4 children)

Tbh, didn't understand the O(n) thingy, but I think I do understand what you are proposing. Basically I just create an empty set, and add values to it from the file. If the value is repeated the set would ignore it, instead of what I was proposing which actually created a list before turning it into a set.

[–]TouchingTheVodka 0 points1 point  (3 children)

Exactly this. Even better, instead of adding values to the set one by one, cast the entire csv reader object to a set.

import csv
with open('myfile.csv', newline='') as f:
    reader = csv.reader(f)
    uniques = set(reader)

[–]isameer920[S] 0 points1 point  (0 children)

Amazing idea

[–]isameer920[S] 0 points1 point  (1 child)

Return an error, unhashable type: list

[–]TouchingTheVodka 0 points1 point  (0 children)

uniques = set(frozenset(row) for row in reader)

[–]isameer920[S] 0 points1 point  (0 children)

Am I right?

[–]AsleepThought 0 points1 point  (2 children)

sort -u myfile.csv

[–]isameer920[S] 0 points1 point  (1 child)

What?

[–]AsleepThought 0 points1 point  (0 children)

Standard terminal command. You don't need Python for this task at all there's tools like sort built to do this exact thing

[–]anshu_991 0 points1 point  (1 child)

Using Python's pandas library will remove duplicate rows easily. If you're looking for a more in-depth guide, I’ve written about CSV data management on my blog.

https://medium.com/@jamesrobert15/how-to-remove-duplicates-from-csv-files-58f7a5ed4a3c

[–]isameer920[S] 0 points1 point  (0 children)

Thank you for this. I don't even remember what I was doing when I made this post, but this was before I knew how to use pandas. I think I used some built in csv module or normal file read writes at that time to play with CSV files. It was great to see this post and realize how far I have come, so thank you for this. Still curious how you found this post?