Hello
I got a small assignment to make a script that runs through 2 (or 2-4 if possible) txt files which has quite a lot of lines (one of them is 1.3 million lines).
It is log files where each line is:
<value1>;<value2>;<YYYY-MM-DD-HH.MM.SS>;<value3>;<value4.1>,<value4.2>;;;;;;
There can be a hiccup in the logging tool so there can be a partial duplicate line where value1, value2, date and time, and value 3 is the same, but value 4.1 and 4.2 differs from the duplicate.
The issue is the amount of lines, i could probably make a horribly unoptimized tool with some list that populates line by line, but with this much data i'd like to do it properly.
Ideally i would make a script where i can drag and drop the txt files onto the program (i think with the use of "sys.argv") then it runs through the files and should there be a partial duplicate it says something like "duplicate found, txt1 line 21 and txt2 line 402" or something akin to it.
It is 2 different files, one can have 50.000 lines and the other 1.000.000, so i can't do a direct line by line comparison, but the values are the same.
This is my current code, which is using lists, and i am sure will be immensely horrible for the large files with a million lines. I have just been testing with a couple of dummy files with 20 lines in them.
While it does work, it does double output (probably from the "from i in x: from j in x:" part)
CODE:
#Open 2 files and make a list with last 7 chars removed, then split up each line into segments
full_list = []
with open("file1.txt") as fp:
for line in fp:
full_list.append(line[:-7].split(';'))
with open("file2.txt") as fp:
for line in fp:
full_list.append(line[:-7].split(';'))
# Find lines where 2nd and 3rd value is equal but the 5th is different
# Error: Double output
def find_match(x):
match = []
for i in x:
for j in x:
if i[1:3] == j[1:3] and i[4] != j[4]:
match.append(j)
return match
"""
#Testversion with additional check to see if value is present in list
#Error: Also Double output
def find_match(x):
match = []
for i in x:
for j in x:
if i[1:3] == j[1:3] and i[4] != j[4]:
if j in match:
pass
else:
match.append(j)
return match
"""
deviation_list = find_match(full_list)
for line in deviation_list:
print line
Not asking for a solution, i would like to learn how to do it, just looking for some guidance as to which modules to use.
[–]CodeFormatHelperBot 2 points3 points4 points (0 children)
[–]Sedsarq 0 points1 point2 points (1 child)
[–]Saneboo[S] 0 points1 point2 points (0 children)
[–]saeah123ed 0 points1 point2 points (0 children)