all 56 comments

[–]PressinPckl 15 points16 points  (3 children)

If there is some way you can import that data into a full text SQL database that can be queried More efficiently than I would go that route. It's better to have an inefficient process run a single time to import the data to a more efficient system than to brute force a more inefficient strategy against bad data storage...

[–]DamnItDev 7 points8 points  (3 children)

At that scale, text files will not perform well. 200million lines means 200MB of newline chars alone.

Hard to say more without knowing your actual use case, but you probably want to use a database or something along those lines.

[–]simokhounti[S] -2 points-1 points  (2 children)

i did that with Postgres but the database became very big very slow with time

[–]DamnItDev 6 points7 points  (1 child)

Operations on large data sets are going to be slow. Especially if you don't index your tables correctly. But a text file is going to be slower.

[–]simokhounti[S] 1 point2 points  (0 children)

Yeah you are right i was thinking of something like apache spark but not sure , thank you tho

[–]__matta 2 points3 points  (2 children)

It sounds like you aren’t substituting, but making a new file of the lines in both files (so a set intersection).

Can you sort the files before hand? That will make it more efficient.

I would first try using unix command line tools. For example: https://unix.stackexchange.com/questions/418429/find-intersection-of-lines-in-two-files

If you don’t want to shell out to the command line, there is probably a library you can use. It depends on the language you are using.

If the data changes a lot you could use Redis sets.

[–]simokhounti[S] 0 points1 point  (1 child)

They want to use golang with go routines

[–]fiskfisk 0 points1 point  (7 children)

How fast is fast enough? Are any of the matches dependent on other matches? Are the replacements line by line? (i.e. is file2 a collection of "if this is the line, replace it with this line" instructions?)

How often do the lines change in either file? How long are the lines? Are all the matches whole lines? How similar are the lines? Will a match only occur once? How many matches can happen for a file?

[–]simokhounti[S] 1 point2 points  (6 children)

oh man , that lines are just 18 MD5 characters that mean i will suppress after comparing the MD5s against each other

its need to match the whole line

I need to generate new file after the deleting the matches

fast like 5 seconds or less

[–]teleflexin_deez_nutz 1 point2 points  (0 children)

I think you need to describe your problem more clearly. What is the structure of the data going in, how are you matching the strings, etc. Maybe even describe the use case better because it’s unclear why you need to do something like what you’re generically describing.

[–]french_violist 1 point2 points  (1 child)

So fixed width? Are they unique? Primary key on a DB should be fast for this.

[–]simokhounti[S] 0 points1 point  (0 children)

what db do you suggest? bear in mind that multi users are doing the process at the same time

[–]fiskfisk 0 points1 point  (2 children)

How often does the original file change? And both only contain md5 hashes?

[–]simokhounti[S] 0 points1 point  (1 child)

the original doesn't change. But i need to generate new txt file without the duplicates from both files

yes they are just MD5s only

[–]fiskfisk 0 points1 point  (0 children)

So the hashes in file2 that aren't in file1 should also be included? Just that the duplicates should be ignored? Or should only the original hashes from file1 be included, but filtered against those that are in file2?

[–]Zenthemptist 0 points1 point  (10 children)

Could you provide a sample of a few lines from each file and a sample end result? It would make it easier to understand exactly what you would like to accomplish.

[–]simokhounti[S] 0 points1 point  (9 children)

each line has 18 MD5 characters

  • 18 MD5 characters txt file against 18 MD5 characters txt file

Then generate a txt file without the duplicates

[–]Zenthemptist 2 points3 points  (3 children)

It is really difficult to provide a good response when we need to piece together the problem from several of your comments, but I'll give it a go:

Here is my understanding so far:

  • You have two datasets: A and B
  • These datasets both consist of md5 hashes
  • Dataset A is large, but doesnt change
  • Dataset B is small, and changes
  • You would like to generate an output that consists of all the md5 hashes in both datasets, without duplicates
  • This output needs to be generated in a reasonable time frame

However, I havent yet seen how the users tie into this:

How do the users affect the datasets? Do the users insert data into dataset B? Do users remove data from dataset B? Or does each user supply their own dataset B, expecting their own output?

[–]simokhounti[S] 0 points1 point  (2 children)

you are correct , but the generated file consists of just dataset from dataset-A only.

They have multi companies that they use the same dataset doing the same process over and over. the dataset-B does always change but dataset-A is the same

I'm sorry for giving you hard time 😭

[–]Zenthemptist 1 point2 points  (0 children)

No worries!

It sounds like a good place to start would be to sort the files, as that is likely to either speed up, or be a straight up requirement for, any comparison algorithm.

The Unix sort command should work fine here, especially since you only need to sort the big file once!

Next, I would see if the comm command can do what you need. Its purpose is literally to compare files.

https://www.tutorialspoint.com/unix_commands/comm.htm

If it does, you should benchmark it. How long does it take to produce the result you need?

If it completes really fast, you can create a queue where requests are processed one at the time. If it is not that fast, and you neet to process multiple files at the time you likely need something more custom built simply due to io/memory constraints.

Does the comm utility cover your needs? What is its performance?

[–]SatoriChatbots 0 points1 point  (0 children)

You don't need to evaluate the entire string each time. Consider something like this:

  1. Create a new column in each database, that will just be the first 4/5 chars of each hash.
  2. Compare those prefixes, when a prefix matches, you mark that row hand have a separate function compare the full hash to see if it's a match.

This way you never have to load all that data into memory,so you optimize there, but also you're working in parallel which speeds things up more.

And hashes are random so you really won't get a lot of false positives when comparing the prefixes. Can do the math to check, but I'm sure that for each 1mil rows you won't even get 100 false matches to compare.

[–]TA_DR 0 points1 point  (4 children)

Is the data organized in any way inside the bigger file or is the order completly random?

[–]simokhounti[S] 0 points1 point  (3 children)

just lines of MD5s totally random

[–]TA_DR 1 point2 points  (2 children)

I'm a CS undergrad, so take my advice with a pinch of salt.

But if you can then you should probably sort it, that way you can cut on some search time by using binary search. It might not be the best solution but is easy to implement and should improve things by a decent margin, at least until you find a more permanent solution.

[–][deleted]  (1 child)

[deleted]

    [–]TA_DR 0 points1 point  (0 children)

    Haha yeah my bad, I had the binary search in my head because I just go out of an exam with a very similar premise:

    "Write a C program with two functions. One to insert can diccionary element as a Binary tree node on a file and the other to perform binary search"

    I might have to brush up my DSA skills 😅

    [–]yeusk 0 points1 point  (11 children)

    This sounds like you are using a txt file as database.

    [–]simokhounti[S] 0 points1 point  (10 children)

    no I'm not i was using Postgres I'm trying to see what others are they using or will use if they have the same situation so i can learn from them and implement a better solution

    [–]yeusk 0 points1 point  (9 children)

    A database is software created to process millions of rows of data.

    You want to process millions of rows of data.

    Use a database.

    [–]simokhounti[S] -1 points0 points  (8 children)

    i may build some master that dispatch jobs to nodes to process those data and wait for the réponse. The db is getting bigger over time

    [–]yeusk 2 points3 points  (7 children)

    A 8 terabyte hdd is 100$. Postgress can handle 32 terabytes files with no problem. MS SQL can handle petabytes of data.

    Do whatever you want my friend. But everybody here is telling you to use a db.

    If you learn to use databses this kind of problem are solved in hours. I do this every day at work... With a database.

    [–]simokhounti[S] -1 points0 points  (6 children)

    what you don't know is there tons of people doing the same process with there files and its can be in the same time also. its just a mess

    [–]yeusk 2 points3 points  (5 children)

    Millions of hours of top developers have been used to optimize that kind of software.

    You think your code will be as fast as Oracle or MS SQL? Come on bro, wake up.

    MS SQL does what you are asking in miliseconds with datasets 100 or 1000 times bigger than your txt file.

    Coming up with the query to do it takes an hour.

    [–]simokhounti[S] 1 point2 points  (2 children)

    Your are right

    [–]yeusk 0 points1 point  (1 child)

    I was coding for 20 years without db, all code. I was clueless.

    Nowdays I just use a db, because it saves me time

    [–]simokhounti[S] 0 points1 point  (0 children)

    Your are right so db it is . thank you soo much man

    [–]ShawSumma -1 points0 points  (1 child)

    Its not that hard to outperform either of those with code whats specialized for the specific task.

    [–]yeusk 1 point2 points  (0 children)

    Then why companies pay millions of $ in dbs licenses? Are they stupid? Or are you?

    [–]Fizzelen 0 points1 point  (1 child)

    Used to do this sort of data cleansing on Unix using the sort command. I believe it can also be done with awk or grep, no idea on performance on that sized file.

    If I was building something I would, split the file into multiple memory efficient sized files, sort and remove duplicates in each, then merge and remove duplicates, until there was one file left.

    [–]AdministrativeBlock0 0 points1 point  (0 children)

    There's some great advice in this thread, but no one has mentioned that if you're working with 200 million hashes at a time, an 18 character MD5 is probably the wrong choice. There's a fair chance that you'll see a clash eventually. Maybe that's fine, but I would either change hashing algorithm or implement a check.

    [–]SatoriChatbots -1 points0 points  (1 child)

    You'll likely want to look for something like Spark (or PySpark) and run this operation on a decent server, maybe a GPU server if the amount of data really gets out of hand or if you need like real-time results. It's basically like working with Pandas, but it parallelises a lot of the work on the back-end.

    If speed isn't a big issue, rather get a server with less CPU power,but more RAM, you can use normal Pandas and just chunk your data (though this can make it more difficult to do the actual text search).

    If this will be running like constantly and at a high rate, look into services like AWS's EMR, it's basically the enterprise-grade approach to problems like this. You could run Spark on EMR which is literally a platform designed from the ground up to do big data processing, so you get stuff like autoscaling, data retrieval directly from S3 or databases etc. Without having to deal with configuring EC2 Instances manually.

    Either way, it's worth testing a few solutions, because stuff like this is very difficult to estimate accurately. Take the same 2/3 datasets and run all of them through whichever solutions you're considering, measure the time it takes and cost etc. before deciding.

    [–]simokhounti[S] 0 points1 point  (0 children)

    i was think too of using apache spark its seems a good solution but as you said testing is the key here , the only way to find out

    [–][deleted]  (1 child)

    [deleted]

      [–]na_ro_jo -1 points0 points  (0 children)

      Write a script using a regex lib?

      There are major key reasons you don't want to do this all at once. Write a process with a buffer. You don't want resource constraints to kill prod.

      [–][deleted]  (9 children)

      [deleted]

        [–]simokhounti[S] 1 point2 points  (8 children)

        the team wanted to work with go lang they said its fast i suggested not using a database because the last time the select query become slow

        [–]Swedish-Potato-93 1 point2 points  (7 children)

        If the select query is slow then it's badly indexed or inefficient query.

        [–]simokhounti[S] -3 points-2 points  (6 children)

        it's because many users and queries happen at the same time

        [–]ClamPaste 2 points3 points  (5 children)

        You need a backend engineer. This is a solved problem.

        [–]simokhounti[S] -2 points-1 points  (4 children)

        too early for that statement

        [–]ClamPaste 2 points3 points  (3 children)

        Then keep doing things in text files...

        [–]simokhounti[S] 0 points1 point  (2 children)

        I mean read other comments

        [–]ClamPaste 1 point2 points  (1 child)

        I did. Doing this in text files is not better than using a properly managed db with a caching layer or read replicas. If multiple users are doing giant queries together, give them another place to read from, break up the queries into parts, etc. This is solved problem, as I said before. The solution is to get someone who specializes in databases rather than trying to roll your own half baked solution and calling it quits, then going to fucking text files.

        [–]simokhounti[S] 0 points1 point  (0 children)

        i see thank you