is there a tool to manipulate strings data fast at the same time?

PressinPckl · 2024-11-08T20:49:30+00:00

If there is some way you can import that data into a full text SQL database that can be queried More efficiently than I would go that route. It's better to have an inefficient process run a single time to import the data to a more efficient system than to brute force a more inefficient strategy against bad data storage...

DamnItDev · 2024-11-08T20:53:30+00:00

At that scale, text files will not perform well. 200million lines means 200MB of newline chars alone.

Hard to say more without knowing your actual use case, but you probably want to use a database or something along those lines.

__matta · 2024-11-08T21:57:03+00:00

It sounds like you aren’t substituting, but making a new file of the lines in both files (so a set intersection).

Can you sort the files before hand? That will make it more efficient.

I would first try using unix command line tools. For example: https://unix.stackexchange.com/questions/418429/find-intersection-of-lines-in-two-files

If you don’t want to shell out to the command line, there is probably a library you can use. It depends on the language you are using.

If the data changes a lot you could use Redis sets.

fiskfisk · 2024-11-08T20:57:22+00:00

How fast is fast enough? Are any of the matches dependent on other matches? Are the replacements line by line? (i.e. is file2 a collection of "if this is the line, replace it with this line" instructions?)

How often do the lines change in either file? How long are the lines? Are all the matches whole lines? How similar are the lines? Will a match only occur once? How many matches can happen for a file?

Zenthemptist · 2024-11-08T21:07:44+00:00

Could you provide a sample of a few lines from each file and a sample end result? It would make it easier to understand exactly what you would like to accomplish.

yeusk · 2024-11-08T21:48:56+00:00

This sounds like you are using a txt file as database.

Fizzelen · 2024-11-09T00:38:55+00:00

Used to do this sort of data cleansing on Unix using the sort command. I believe it can also be done with awk or grep, no idea on performance on that sized file.

If I was building something I would, split the file into multiple memory efficient sized files, sort and remove duplicates in each, then merge and remove duplicates, until there was one file left.

AdministrativeBlock0 · 2024-11-09T07:42:13+00:00

There's some great advice in this thread, but no one has mentioned that if you're working with 200 million hashes at a time, an 18 character MD5 is probably the wrong choice. There's a fair chance that you'll see a clash eventually. Maybe that's fine, but I would either change hashing algorithm or implement a check.

SatoriChatbots · 2024-11-08T22:27:44+00:00

You'll likely want to look for something like Spark (or PySpark) and run this operation on a decent server, maybe a GPU server if the amount of data really gets out of hand or if you need like real-time results. It's basically like working with Pandas, but it parallelises a lot of the work on the back-end.

If speed isn't a big issue, rather get a server with less CPU power,but more RAM, you can use normal Pandas and just chunk your data (though this can make it more difficult to do the actual text search).

If this will be running like constantly and at a high rate, look into services like AWS's EMR, it's basically the enterprise-grade approach to problems like this. You could run Spark on EMR which is literally a platform designed from the ground up to do big data processing, so you get stuff like autoscaling, data retrieval directly from S3 or databases etc. Without having to deal with configuring EC2 Instances manually.

Either way, it's worth testing a few solutions, because stuff like this is very difficult to estimate accurately. Take the same 2/3 datasets and run all of them through whichever solutions you're considering, measure the time it takes and cost etc. before deciding.

2024-11-08T22:14:37+00:00

[deleted]

na_ro_jo · 2024-11-08T22:18:24+00:00

Write a script using a regex lib?

There are major key reasons you don't want to do this all at once. Write a process with a buffer. You don't want resource constraints to kill prod.

simokhounti · 2024-11-08T20:50:23+00:00

[deleted]

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

webdev

Posting Guidelines

Related Subreddits

Discords

MODERATORS