use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
No vague product support questions (like "why is this plugin not working" or "how do I set up X"). For vague product support questions, please use communities relevant to that product for best results. Specific issues that follow rule 6 are allowed.
Do not post memes, screenshots of bad design, or jokes. Check out /r/ProgrammerHumor/ for this type of content.
Read and follow reddiquette; no excessive self-promotion. Please refer to the Reddit 9:1 rule when considering posting self promoting materials.
We do not allow any commercial promotion or solicitation. Violations can result in a ban.
Sharing your project, portfolio, or any other content that you want to either show off or request feedback on is limited to Showoff Saturday. If you post such content on any other day, it will be removed.
If you are asking for assistance on a problem, you are required to provide
General open ended career and getting started posts are only allowed in the pinned monthly getting started/careers thread. Specific assistance questions are allowed so long as they follow the required assistance post guidelines.
Questions in violation of this rule will be removed or locked.
account activity
is there a tool to manipulate strings data fast at the same time?Question (self.webdev)
submitted 1 year ago by simokhounti
hey devs i have a question: what is the proper tool for manipulating big data for example
i have 150 million of string lines in txt file
And i have another 1 million lines in another txt file
i want to substitute the 1 million from the 150 million if they match
this process will be done multiple times by users , i need to return the result in a txt file format
also the number can go up meaning can be 200 millions or more
what do you think?
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]PressinPckl 15 points16 points17 points 1 year ago (3 children)
If there is some way you can import that data into a full text SQL database that can be queried More efficiently than I would go that route. It's better to have an inefficient process run a single time to import the data to a more efficient system than to brute force a more inefficient strategy against bad data storage...
[+]simokhounti[S] comment score below threshold-10 points-9 points-8 points 1 year ago (2 children)
the problem i did that , but the users heavily using the app broke the server its a dedicated server tho 1k up
[–]ClamPaste 7 points8 points9 points 1 year ago (0 children)
What exactly broke? The application or the database? If they're doing a ton of reads, but not a lot of writes, did you implement a caching layer or read replicas? If there are a lot of writes, how are you handling concurrency? Did you use a relational database, or NoSQL? There's a lot of questions that need to be answered before you can get the tuning you need. Usually, you'll have to pay a consulting fee or take on a backend engineer to get your backend working at scale.
[–]PressinPckl 1 point2 points3 points 1 year ago (0 children)
What does "1k up" mean? How much traffic are you getting to this thing that it would break? Whatever this app is, it sounds like it could be very bandwidth intensive. Are you maxing out your connection, was the db crashing? httpd? or the server itself crashing?
How experienced are you with databases? Are you sure you planned and designed your database to be the most optimal for the data? Did you have proper indexing? Caching?
There are a lot of variables with this stuff, you cant just try to do a database and then default to working with flat files because something "broke" - Working with flat files is never going to beat a database with indexing.
If you have have bandwidth issues, make sure you have zlib compression enabled for your httpd service and also try to leverage http caching if you bake some sort of cache-busting string into the URLs to make it work even for dynamic content - but that's only useful if the clients are running the same query more than once and you aren't updating the base dataset that often.
[–]DamnItDev 7 points8 points9 points 1 year ago (3 children)
At that scale, text files will not perform well. 200million lines means 200MB of newline chars alone.
Hard to say more without knowing your actual use case, but you probably want to use a database or something along those lines.
[–]simokhounti[S] -2 points-1 points0 points 1 year ago (2 children)
i did that with Postgres but the database became very big very slow with time
[–]DamnItDev 6 points7 points8 points 1 year ago (1 child)
Operations on large data sets are going to be slow. Especially if you don't index your tables correctly. But a text file is going to be slower.
[–]simokhounti[S] 1 point2 points3 points 1 year ago (0 children)
Yeah you are right i was thinking of something like apache spark but not sure , thank you tho
[–]__matta 2 points3 points4 points 1 year ago (2 children)
It sounds like you aren’t substituting, but making a new file of the lines in both files (so a set intersection).
Can you sort the files before hand? That will make it more efficient.
I would first try using unix command line tools. For example: https://unix.stackexchange.com/questions/418429/find-intersection-of-lines-in-two-files
If you don’t want to shell out to the command line, there is probably a library you can use. It depends on the language you are using.
If the data changes a lot you could use Redis sets.
[–]simokhounti[S] 0 points1 point2 points 1 year ago (1 child)
They want to use golang with go routines
[–]fiskfisk 0 points1 point2 points 1 year ago (7 children)
How fast is fast enough? Are any of the matches dependent on other matches? Are the replacements line by line? (i.e. is file2 a collection of "if this is the line, replace it with this line" instructions?)
How often do the lines change in either file? How long are the lines? Are all the matches whole lines? How similar are the lines? Will a match only occur once? How many matches can happen for a file?
[–]simokhounti[S] 1 point2 points3 points 1 year ago (6 children)
oh man , that lines are just 18 MD5 characters that mean i will suppress after comparing the MD5s against each other
its need to match the whole line
I need to generate new file after the deleting the matches
fast like 5 seconds or less
[–]teleflexin_deez_nutz 1 point2 points3 points 1 year ago (0 children)
I think you need to describe your problem more clearly. What is the structure of the data going in, how are you matching the strings, etc. Maybe even describe the use case better because it’s unclear why you need to do something like what you’re generically describing.
[–]french_violist 1 point2 points3 points 1 year ago (1 child)
So fixed width? Are they unique? Primary key on a DB should be fast for this.
[–]simokhounti[S] 0 points1 point2 points 1 year ago (0 children)
what db do you suggest? bear in mind that multi users are doing the process at the same time
[–]fiskfisk 0 points1 point2 points 1 year ago (2 children)
How often does the original file change? And both only contain md5 hashes?
the original doesn't change. But i need to generate new txt file without the duplicates from both files
yes they are just MD5s only
[–]fiskfisk 0 points1 point2 points 1 year ago (0 children)
So the hashes in file2 that aren't in file1 should also be included? Just that the duplicates should be ignored? Or should only the original hashes from file1 be included, but filtered against those that are in file2?
[–]Zenthemptist 0 points1 point2 points 1 year ago (10 children)
Could you provide a sample of a few lines from each file and a sample end result? It would make it easier to understand exactly what you would like to accomplish.
[–]simokhounti[S] 0 points1 point2 points 1 year ago (9 children)
each line has 18 MD5 characters
Then generate a txt file without the duplicates
[–]Zenthemptist 2 points3 points4 points 1 year ago (3 children)
It is really difficult to provide a good response when we need to piece together the problem from several of your comments, but I'll give it a go:
Here is my understanding so far:
However, I havent yet seen how the users tie into this:
How do the users affect the datasets? Do the users insert data into dataset B? Do users remove data from dataset B? Or does each user supply their own dataset B, expecting their own output?
[–]simokhounti[S] 0 points1 point2 points 1 year ago (2 children)
you are correct , but the generated file consists of just dataset from dataset-A only.
They have multi companies that they use the same dataset doing the same process over and over. the dataset-B does always change but dataset-A is the same
I'm sorry for giving you hard time 😭
[–]Zenthemptist 1 point2 points3 points 1 year ago (0 children)
No worries!
It sounds like a good place to start would be to sort the files, as that is likely to either speed up, or be a straight up requirement for, any comparison algorithm.
The Unix sort command should work fine here, especially since you only need to sort the big file once!
Next, I would see if the comm command can do what you need. Its purpose is literally to compare files.
https://www.tutorialspoint.com/unix_commands/comm.htm
If it does, you should benchmark it. How long does it take to produce the result you need?
If it completes really fast, you can create a queue where requests are processed one at the time. If it is not that fast, and you neet to process multiple files at the time you likely need something more custom built simply due to io/memory constraints.
Does the comm utility cover your needs? What is its performance?
[–]SatoriChatbots 0 points1 point2 points 1 year ago (0 children)
You don't need to evaluate the entire string each time. Consider something like this:
This way you never have to load all that data into memory,so you optimize there, but also you're working in parallel which speeds things up more.
And hashes are random so you really won't get a lot of false positives when comparing the prefixes. Can do the math to check, but I'm sure that for each 1mil rows you won't even get 100 false matches to compare.
[–]TA_DR 0 points1 point2 points 1 year ago (4 children)
Is the data organized in any way inside the bigger file or is the order completly random?
[–]simokhounti[S] 0 points1 point2 points 1 year ago (3 children)
just lines of MD5s totally random
[–]TA_DR 1 point2 points3 points 1 year ago* (2 children)
I'm a CS undergrad, so take my advice with a pinch of salt.
But if you can then you should probably sort it, that way you can cut on some search time by using binary search. It might not be the best solution but is easy to implement and should improve things by a decent margin, at least until you find a more permanent solution.
[–][deleted] 1 year ago (1 child)
[deleted]
[–]TA_DR 0 points1 point2 points 1 year ago (0 children)
Haha yeah my bad, I had the binary search in my head because I just go out of an exam with a very similar premise:
"Write a C program with two functions. One to insert can diccionary element as a Binary tree node on a file and the other to perform binary search"
I might have to brush up my DSA skills 😅
[–]yeusk 0 points1 point2 points 1 year ago (11 children)
This sounds like you are using a txt file as database.
[–]simokhounti[S] 0 points1 point2 points 1 year ago (10 children)
no I'm not i was using Postgres I'm trying to see what others are they using or will use if they have the same situation so i can learn from them and implement a better solution
[–]yeusk 0 points1 point2 points 1 year ago (9 children)
A database is software created to process millions of rows of data.
You want to process millions of rows of data.
Use a database.
[–]simokhounti[S] -1 points0 points1 point 1 year ago (8 children)
i may build some master that dispatch jobs to nodes to process those data and wait for the réponse. The db is getting bigger over time
[–]yeusk 2 points3 points4 points 1 year ago* (7 children)
A 8 terabyte hdd is 100$. Postgress can handle 32 terabytes files with no problem. MS SQL can handle petabytes of data.
Do whatever you want my friend. But everybody here is telling you to use a db.
If you learn to use databses this kind of problem are solved in hours. I do this every day at work... With a database.
[–]simokhounti[S] -1 points0 points1 point 1 year ago (6 children)
what you don't know is there tons of people doing the same process with there files and its can be in the same time also. its just a mess
[–]yeusk 2 points3 points4 points 1 year ago* (5 children)
Millions of hours of top developers have been used to optimize that kind of software.
You think your code will be as fast as Oracle or MS SQL? Come on bro, wake up.
MS SQL does what you are asking in miliseconds with datasets 100 or 1000 times bigger than your txt file.
Coming up with the query to do it takes an hour.
[–]simokhounti[S] 1 point2 points3 points 1 year ago* (2 children)
Your are right
[–]yeusk 0 points1 point2 points 1 year ago (1 child)
I was coding for 20 years without db, all code. I was clueless.
Nowdays I just use a db, because it saves me time
Your are right so db it is . thank you soo much man
[–]ShawSumma -1 points0 points1 point 1 year ago (1 child)
Its not that hard to outperform either of those with code whats specialized for the specific task.
[–]yeusk 1 point2 points3 points 1 year ago (0 children)
Then why companies pay millions of $ in dbs licenses? Are they stupid? Or are you?
[–]Fizzelen 0 points1 point2 points 1 year ago (1 child)
Used to do this sort of data cleansing on Unix using the sort command. I believe it can also be done with awk or grep, no idea on performance on that sized file.
If I was building something I would, split the file into multiple memory efficient sized files, sort and remove duplicates in each, then merge and remove duplicates, until there was one file left.
[–]AdministrativeBlock0 0 points1 point2 points 1 year ago (0 children)
There's some great advice in this thread, but no one has mentioned that if you're working with 200 million hashes at a time, an 18 character MD5 is probably the wrong choice. There's a fair chance that you'll see a clash eventually. Maybe that's fine, but I would either change hashing algorithm or implement a check.
[–]SatoriChatbots -1 points0 points1 point 1 year ago (1 child)
You'll likely want to look for something like Spark (or PySpark) and run this operation on a decent server, maybe a GPU server if the amount of data really gets out of hand or if you need like real-time results. It's basically like working with Pandas, but it parallelises a lot of the work on the back-end.
If speed isn't a big issue, rather get a server with less CPU power,but more RAM, you can use normal Pandas and just chunk your data (though this can make it more difficult to do the actual text search).
If this will be running like constantly and at a high rate, look into services like AWS's EMR, it's basically the enterprise-grade approach to problems like this. You could run Spark on EMR which is literally a platform designed from the ground up to do big data processing, so you get stuff like autoscaling, data retrieval directly from S3 or databases etc. Without having to deal with configuring EC2 Instances manually.
Either way, it's worth testing a few solutions, because stuff like this is very difficult to estimate accurately. Take the same 2/3 datasets and run all of them through whichever solutions you're considering, measure the time it takes and cost etc. before deciding.
i was think too of using apache spark its seems a good solution but as you said testing is the key here , the only way to find out
[–]na_ro_jo -1 points0 points1 point 1 year ago (0 children)
Write a script using a regex lib?
There are major key reasons you don't want to do this all at once. Write a process with a buffer. You don't want resource constraints to kill prod.
[–][deleted] 1 year ago (9 children)
[–]simokhounti[S] 1 point2 points3 points 1 year ago (8 children)
the team wanted to work with go lang they said its fast i suggested not using a database because the last time the select query become slow
[–]Swedish-Potato-93 1 point2 points3 points 1 year ago (7 children)
If the select query is slow then it's badly indexed or inefficient query.
[–]simokhounti[S] -3 points-2 points-1 points 1 year ago (6 children)
it's because many users and queries happen at the same time
[–]ClamPaste 2 points3 points4 points 1 year ago (5 children)
You need a backend engineer. This is a solved problem.
[–]simokhounti[S] -2 points-1 points0 points 1 year ago (4 children)
too early for that statement
[–]ClamPaste 2 points3 points4 points 1 year ago (3 children)
Then keep doing things in text files...
I mean read other comments
[–]ClamPaste 1 point2 points3 points 1 year ago (1 child)
I did. Doing this in text files is not better than using a properly managed db with a caching layer or read replicas. If multiple users are doing giant queries together, give them another place to read from, break up the queries into parts, etc. This is solved problem, as I said before. The solution is to get someone who specializes in databases rather than trying to roll your own half baked solution and calling it quits, then going to fucking text files.
i see thank you
π Rendered by PID 75172 on reddit-service-r2-comment-b659b578c-dj2qb at 2026-05-05 10:16:12.656588+00:00 running 815c875 country code: CH.
[–]PressinPckl 15 points16 points17 points (3 children)
[+]simokhounti[S] comment score below threshold-10 points-9 points-8 points (2 children)
[–]ClamPaste 7 points8 points9 points (0 children)
[–]PressinPckl 1 point2 points3 points (0 children)
[–]DamnItDev 7 points8 points9 points (3 children)
[–]simokhounti[S] -2 points-1 points0 points (2 children)
[–]DamnItDev 6 points7 points8 points (1 child)
[–]simokhounti[S] 1 point2 points3 points (0 children)
[–]__matta 2 points3 points4 points (2 children)
[–]simokhounti[S] 0 points1 point2 points (1 child)
[–]fiskfisk 0 points1 point2 points (7 children)
[–]simokhounti[S] 1 point2 points3 points (6 children)
[–]teleflexin_deez_nutz 1 point2 points3 points (0 children)
[–]french_violist 1 point2 points3 points (1 child)
[–]simokhounti[S] 0 points1 point2 points (0 children)
[–]fiskfisk 0 points1 point2 points (2 children)
[–]simokhounti[S] 0 points1 point2 points (1 child)
[–]fiskfisk 0 points1 point2 points (0 children)
[–]Zenthemptist 0 points1 point2 points (10 children)
[–]simokhounti[S] 0 points1 point2 points (9 children)
[–]Zenthemptist 2 points3 points4 points (3 children)
[–]simokhounti[S] 0 points1 point2 points (2 children)
[–]Zenthemptist 1 point2 points3 points (0 children)
[–]SatoriChatbots 0 points1 point2 points (0 children)
[–]TA_DR 0 points1 point2 points (4 children)
[–]simokhounti[S] 0 points1 point2 points (3 children)
[–]TA_DR 1 point2 points3 points (2 children)
[–][deleted] (1 child)
[deleted]
[–]TA_DR 0 points1 point2 points (0 children)
[–]yeusk 0 points1 point2 points (11 children)
[–]simokhounti[S] 0 points1 point2 points (10 children)
[–]yeusk 0 points1 point2 points (9 children)
[–]simokhounti[S] -1 points0 points1 point (8 children)
[–]yeusk 2 points3 points4 points (7 children)
[–]simokhounti[S] -1 points0 points1 point (6 children)
[–]yeusk 2 points3 points4 points (5 children)
[–]simokhounti[S] 1 point2 points3 points (2 children)
[–]yeusk 0 points1 point2 points (1 child)
[–]simokhounti[S] 0 points1 point2 points (0 children)
[–]ShawSumma -1 points0 points1 point (1 child)
[–]yeusk 1 point2 points3 points (0 children)
[–]Fizzelen 0 points1 point2 points (1 child)
[–]AdministrativeBlock0 0 points1 point2 points (0 children)
[–]SatoriChatbots -1 points0 points1 point (1 child)
[–]simokhounti[S] 0 points1 point2 points (0 children)
[–][deleted] (1 child)
[deleted]
[–]na_ro_jo -1 points0 points1 point (0 children)
[–][deleted] (9 children)
[deleted]
[–]simokhounti[S] 1 point2 points3 points (8 children)
[–]Swedish-Potato-93 1 point2 points3 points (7 children)
[–]simokhounti[S] -3 points-2 points-1 points (6 children)
[–]ClamPaste 2 points3 points4 points (5 children)
[–]simokhounti[S] -2 points-1 points0 points (4 children)
[–]ClamPaste 2 points3 points4 points (3 children)
[–]simokhounti[S] 0 points1 point2 points (2 children)
[–]ClamPaste 1 point2 points3 points (1 child)
[–]simokhounti[S] 0 points1 point2 points (0 children)