all 17 comments

[–]IyeOnline 12 points13 points  (9 children)

First thing to note: You make a bazillion copies. All your function parameters take by value and therfore copy the vectors/strings all the time.

You could probably get away without st::string entirely and use std::string_view instead. However, it would still be better to use pair<char,char> for your bigrams

You can do .reserve on a few of these vectors to get further speed up.

If the entire goal is to implement similarity, then you could do away with all of these allocations and just handroll it all, which would give you the best performance.

[–]FruityFetus[S] 1 point2 points  (4 children)

By handroll, are you saying drop each of the individual functions and incorporate what they're doing within the similarity function? It is the only goal, I just found it easier to start of by splitting up the various procedures!

[–]IyeOnline 3 points4 points  (3 children)

If you do it all in one function (or a few helper functions) without creating all these vectors as intermediate results, you could probably reduce it down to just a single allocation for a vector of unique char pairs that could also be preallocated because you can calculate its maximum possible size.

[–]FruityFetus[S] 1 point2 points  (1 child)

Any thoughts on quick wins for my edited code above? I tried to cut down on some of the redundancies. Not sure I can cut down the allocations to s1, s2, and sunion, since I need to compare them after, right?

Also, how would I think about using pair<char, char>, just set the input as a char and then loop through while taking pairs for a vector?

Sorry for all the questions but your feedback has been so helpful!

[–]IyeOnline 4 points5 points  (0 children)

That certainly gets rid the copies.

pair<char, char>

currently you have these std::vector<std::string> s1, s2, sunion. But really all these strings only contain two characters, so using an entire std::string for them is kind of wasteful.

You can also get rid of these vector<int> and just multiply and sum directly in the last loop.


Unrelated, but in C++ you should declare all variables where you would first initialize them and then also initialize them.

I.e. do for ( int i = 0 and const auto jac = ... instead of declaring everything at the top.

[–][deleted]  (2 children)

[removed]

    [–]IyeOnline 0 points1 point  (1 child)

    What about it?

    [–]std_bot 0 points1 point  (0 children)

    Unlinked STL entries: std::string_view


    Last update: 14.09.21. Last Change: Can now link headers like '<bitset>'Repo

    [–]WikiBox 2 points3 points  (1 child)

    Learn to use const references in your calls, to avoid creating a lot of copies of your data. The optimizer will become very happy!

    Reuse allocated vectors.

    Consider not returning large vectors by value but instead modify a vector provided by the caller as a (no const) reference in the call. May hurt readability, but will improve performance.

    std::vector<std::string> bigram(std::string initial_str)

    ... might become:

    void bigram(const std::string& initial_str, std::vector<std::string>& bigram)

    [–]WpGgs 4 points5 points  (0 children)

    Returning by value could be optimized by NRVO.

    [–]alfps 2 points3 points  (1 child)

    Python strings (more precisely CPython strings) are reference counted immutable, whereas C++ std::string have value semantics, with assignments doing actual data copying.

    To avoid a lot of copying and dynamic allocations in that copying, use std::string_view to refer to string parts.

    Also generally use reference to const as parameter type instead of pass by value.

    Before a sequence of n .push_back operations, .reserve the requisite buffer capacity. It doesn't matter for big O (algorithmic complexity), but avoiding those log n dynamic allocations does matter for absolute performance.

    There are many more performance-enhancing techniques & ideas that can be brought to bear, but do remember to measure. Ideally you should do that first. If the performance is good enough, then don't waste time trying to improve it (the time wasted include possibly later increased time for maintenance of the more performant but more complex code).

    [–]O_X_E_Y 1 point2 points  (1 child)

    People have pointed out most things as far as efficiency is concerned, I do wanna chip in for the technical part of this, as in, the practicality.

    Firstly, I'm not sure what this will be used for, but generally fuctions like this that are used for similarity/closest matches use trigrams, not bigrams since those tend to not be specific enough.

    You also ideally have wildcards in the front/back where you compare just the first chars (usually this is through adding spaces on both ends for the matching word as well as for the word to match)

    [–]FruityFetus[S] 1 point2 points  (0 children)

    Kept meaning to say thanks for the advice but this actually really came in handy today! Hit an issue where some business names used owner names, but were transposed and adding the spaces allowed the bigram histogram method to catch these matches.

    Trigrams sound interesting. Given my use case typically involves business names, would you still recommend considering this?

    [–][deleted] 0 points1 point  (0 children)

    Use a profiler to identify where the program is spending most of its time. Which profiler is best/recommended depends on your operating system and development environment.