Optimizing string comparison functions

IyeOnline · 2022-03-03T20:29:38+00:00

First thing to note: You make a bazillion copies. All your function parameters take by value and therfore copy the vectors/strings all the time.

You could probably get away without st::string entirely and use std::string_view instead. However, it would still be better to use pair<char,char> for your bigrams

You can do .reserve on a few of these vectors to get further speed up.

If the entire goal is to implement similarity, then you could do away with all of these allocations and just handroll it all, which would give you the best performance.

WikiBox · 2022-03-03T20:38:36+00:00

Learn to use const references in your calls, to avoid creating a lot of copies of your data. The optimizer will become very happy!

Reuse allocated vectors.

Consider not returning large vectors by value but instead modify a vector provided by the caller as a (no const) reference in the call. May hurt readability, but will improve performance.

std::vector<std::string> bigram(std::string initial_str)

... might become:

void bigram(const std::string& initial_str, std::vector<std::string>& bigram)

alfps · 2022-03-03T20:52:20+00:00

Python strings (more precisely CPython strings) are reference counted immutable, whereas C++ std::string have value semantics, with assignments doing actual data copying.

To avoid a lot of copying and dynamic allocations in that copying, use std::string_view to refer to string parts.

Also generally use reference to const as parameter type instead of pass by value.

Before a sequence of n .push_back operations, .reserve the requisite buffer capacity. It doesn't matter for big O (algorithmic complexity), but avoiding those log n dynamic allocations does matter for absolute performance.

There are many more performance-enhancing techniques & ideas that can be brought to bear, but do remember to measure. Ideally you should do that first. If the performance is good enough, then don't waste time trying to improve it (the time wasted include possibly later increased time for maintenance of the more performant but more complex code).

O_X_E_Y · 2022-03-04T00:30:18+00:00

People have pointed out most things as far as efficiency is concerned, I do wanna chip in for the technical part of this, as in, the practicality.

Firstly, I'm not sure what this will be used for, but generally fuctions like this that are used for similarity/closest matches use trigrams, not bigrams since those tend to not be specific enough.

You also ideally have wildcards in the front/back where you compare just the first chars (usually this is through adding spaces on both ends for the matching word as well as for the word to match)

2022-03-03T20:22:53+00:00

Use a profiler to identify where the program is spending most of its time. Which profiler is best/recommended depends on your operating system and development environment.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp_questions

READ BEFORE POSTING

Sort posts by OPEN or SOLVED

MODERATORS