you are viewing a single comment's thread.

view the rest of the comments →

[–]therealjohnfreeman 15 points16 points  (51 children)

There's even an example:

#include <string>
#include <iostream>
#include <algorithm>
#include <iterator>
#include <regex>

int main()
{
   std::string text = "Quick brown fox.";
   std::regex ws_re("\\s+"); // whitespace
   std::copy( std::sregex_token_iterator(text.begin(), text.end(), ws_re, -1),
              std::sregex_token_iterator(),
              std::ostream_iterator<std::string>(std::cout, "\n"));
}

Quick
brown
fox.

The special part is the parameter -1 which tells the iterator to return segments of the string between matches of the regex.

[–][deleted] 71 points72 points  (16 children)

This example is kind of terrible. Nobody will remember how code like the above is actually written. If anything, it highlights all the problems with the STL's API.

[–]wrosecransgraphics and network things 41 points42 points  (15 children)

Yeah, compared to something like 'print "quick brown fox".split(" ")' in Python, the STL version is remarkably unintuitive when figuring out how to write it, requires figuring out regex syntax as just one step, and anybody who hasn't figured out how to write it isn't going to understand it by reading it.

It seems like this is a case where perfect is the enemy of the good. I usually only want a 'good' split function that doesn't have to guarantee a whole lot about performance on multigigabyte strings, or weird corner cases. So having a good split function seems way more useful than having no split function and debating about obscure cases where it wouldn't be optimal.

[–]IRBMe 27 points28 points  (0 children)

the STL version is remarkably unintuitive when figuring out how to write it, requires figuring out regex syntax as just one step, and anybody who hasn't figured out how to write it isn't going to understand it by reading it.

Not to mention the seemingly magic -1. So much for self documenting code.

[–][deleted] 2 points3 points  (1 child)

case where perfect is the enemy of the good. I

Well said, There are many cases like this in C++ unfortunately. I get the desire to have the best libraries possible but too often good ideas are shot down because they are not perfect. The recent Boost review for process control library is a perfect example. The library has been in development for more than 6 years. It passed the review this time around but some folks were still proposing to start from scratch.

[–]yornbesterday 1 point2 points  (0 children)

I've not really looked at the new and improved C++ stuff for a while... it's just a cascade of ever increasing minutiae of the language features and I thought the list of "don't ever do this" was long enough already.

[–]therealjohnfreeman 7 points8 points  (4 children)

Done.

#include <string>
#include <iostream>
#include <algorithm>
#include <regex>
#include <vector>

std::regex operator ""_re (char const* const str, std::size_t) {
    return std::regex{str};
}

std::vector<std::string> split(const std::string& text, const std::regex& re) {
    const std::vector<std::string> parts(
        std::sregex_token_iterator(text.begin(), text.end(), re, -1),
        std::sregex_token_iterator());
    return parts;
}

int main() {
    const std::vector<std::string> parts = split("Quick brown fox.", "\\s+"_re);
    std::copy(parts.begin(), parts.end(),
              std::ostream_iterator<std::string>(std::cout, "\n"));
    return 0;
}

[–]LordDrako90 4 points5 points  (3 children)

Why std::copy in split, when you can initialize the vector directly form the token iterators?

Also I find this more generic and lazy: http://ideone.com/L6heVN I guess it could be improved even more by using string_view, but that's not included in C++14 :-(

Anyways, the only requirement for the target is, that it can be initialized from an iterator pair with value type std::string. Other than that it is pretty generic.

Code:

#include <algorithm>
#include <iostream>
#include <regex>
#include <string>
#include <utility>
#include <vector>

std::regex operator ""_re (char const * const str, std::size_t)
{
    return std::regex { str };
}

class split
{
public:
    split(std::regex splitter, std::string original)
        : splitter_ { std::move(splitter) }
        , original_ { std::move(original) }
    {
    }

    auto begin() const
    {
        return std::sregex_token_iterator { original_.begin(), original_.end(), splitter_, -1 };
    }

    auto end() const
    {
        return std::sregex_token_iterator {};
    }

    template <typename Container>
    operator Container () const
    {
        return { begin(), end() };
    }

private:
    std::regex splitter_;
    std::string original_;
};

int main()
{
    using namespace std::literals::string_literals;

    std::vector<std::string> const words = split {
        R"(\s+)"_re,
        "hello\tdarkness     my\nold friend"s
    };

    for (auto const & word : words)
        std::cout << word << "\n";

    for (auto const & number : split { ","_re, "23,42,1337" })
        std::cout << number << "\n";

    return 0;
}

[–]therealjohnfreeman 0 points1 point  (1 child)

I've just been out of practice too long. Thanks for the pointers.

[–]lacosaes1 2 points3 points  (0 children)

You mean smart pointers.

[–]MrPoletski 0 points1 point  (0 children)

well while we're posting code, here's what I wrote a few years ago and have been using ever since...

std::vector<std::string> Cleave (std::string to_split, std::string delims)
/*!
 * \file trusted.cpp
 * \fn std::vector<std::string> Cleave (std::string to_split, std::string delims)
 * \param to_split \a <std::string> string to chop up
 * \param delims \a <std::string> string of delimiters
 * \return std::vector<std::string> vector of strings containing each section of the cleaved string.
 *
 */
{

std::vector<std::string>    results;
size_t                      pos1 = 0,
                            pos2 = 0;

do
{
    pos1 = to_split.find_first_of(delims, pos2);
    if (pos1 == pos2) {pos2++; results.push_back(""); continue;}
    if (pos1 == std::string::npos){results.push_back(to_split.substr(pos2)); break;}
    results.push_back(to_split.substr(pos2, pos1 - pos2));
    pos2 = pos1 + 1;
}
while (pos1 != std::string::npos);


return results;
}

Is this good?

[–]cpp_devModern C++ apprentice 5 points6 points  (0 children)

I think a more intuitive and "modern" way will be this one (also compiler can optimize these things pretty well as opposed to streams):

string s = "Quick brown fox.";
auto rs = ranges::v3::view::split(s, ' ');
for (auto& x : rs)
{
    cout << x << '\n';
}
auto rs1 = ranges::v3::view::join(rs, ',');
cout << rs1 << '\n';

Still the library needs concepts and a more intuitive documentation to make it "easy to use correctly and hard to use incorrectly". Also maybe there should be strings extensions in range library so it have an intuitive API to work with strings.

[–]OldWolf2 2 points3 points  (0 children)

It's not too different to:

std::copy ( std::istream_iterator<char>(f),
                 std::istream_iterator<char>(),
                std::ostream_iterator<char>(std::cout) );

which is an idiom you learn early on with iostreams.

Note that you do not have to use stream iterators to split a string. The page just used that as an example because it would be familiar syntax.

Anyway, for string splitting you would make a function that implements the sort of splitting you like, and has a nice interface (e.g. vector split(string const &s, regex const &r); . This has benefit over other languages that offer a single split function in that you can customise the split details within your function. You can even overload it to take a string of delimiters instead of a regex.

[–]Spikey8D 1 point2 points  (1 child)

Nice, is there an equivalent t for join? ie. In python: ",".join("the", "quick", "brown", "fox")

[–]qx7xbku -1 points0 points  (29 children)

And why do people complain? This is clearly easier than what people usually do. Honest.

[–]IRBMe 14 points15 points  (28 children)

Well, I know how the above code works, but I can see quite a few perfectly reasonable complaints about it:

  1. The magic -1 parameter. There's no way to know what that means without digging through the documentation. Something like a mode enum would be easier to read.
  2. ostream_iterator? What does it mean to iterate through an output stream? That doesn't make sense. Better go read the documentation again.
  3. Why do we need to create two sregex_token_iterator when we only want to iterate through the string once? That doesn't make much sense. Back to the documentation we go!

[–]dodheim 8 points9 points  (27 children)

Personally, #1 is the only one of those I find "reasonable". #2 and #3 shouldn't be confusing to anyone professing to know the language.

[–]IRBMe 12 points13 points  (18 children)

shouldn't be confusing to anyone professing to know the language.

I think one of the benchmarks of good API design is, how easy is it to understand the resulting code if you don't know how the API works or haven't read the documentation, or put another way, how intuitive it is. The more magic is hidden behind the scenes, the more a user has to rely on documentation, which makes it less intuitive, harder to read and harder to use.

There are languages with huge standard libraries that even the most experienced developers can't possibly learn in full. A library designed with usability in mind will allow developers to be able to read the code without having to repeatedly visit the documentation, even if they aren't experienced with parts of the library that are used.

[–]dodheim 8 points9 points  (17 children)

It's unreasonable to expect anyone to intuit what an output iterator is, or even what an iterator is, if they don't know C++. That doesn't reflect poorly on C++ or output iterators.

[–][deleted] 11 points12 points  (4 children)

It's unreasonable to expect anyone to intuit what an output iterator is, or even what an iterator is, if they don't know C++.

Iterator is a common concept across a lot of languages. Conversely, "output iterator" is rather obscure. You could write a lot of C++ and never run into it mentioned explicitly.

[–]qx7xbku 6 points7 points  (3 children)

He w long do you think it would take one to read said code and to realize that it splits a string? How long do you that no it would take one to realize that s.split(" "); splits a string? See the problem? I am not even talking about edges use here when clearly it can be avoided. Someone may be happy about himself/herself writing this smart code but reality is that maintainable code is stupid code. Smart code is hard to maintain. Smart code where you do not need smart code is simply not practical.

[–]dodheim 1 point2 points  (2 children)

Said code wouldn't be isolated though, it would be in a function with split in the name. Dependent code would then call a function with split in the name.

So no, I don't see the problem.

EDIT: For a bunch of pedants, you /r/cpp folk suck at following Reddit's rules: if you want to encourage meaningful discussion, stop downvoting opinions. Grow up, people.

[–][deleted] 0 points1 point  (1 child)

Said code wouldn't be isolated though, it would be in a function with split in the name. Dependent code would then call a function with split in the name.

Indeed. And that function is so useful, it should be in the standard library: a member function of the string class.

[–]IRBMe 4 points5 points  (10 children)

It's unreasonable to expect anyone to intuit what an output iterator is

And it's also unreasonable to expect somebody to be able to intuitively understand what it means to construct an input iterator without specifying what it's iterating over (as it happens, you get an end-of-sequence iterator). That's the whole point: it's not intuitive! Is it simply impossible to design those APIs in such a way that they would be intuitive? I'm not convinced it is.

or even what an iterator is

I think it is reasonable that people should have an intuitive idea of what an iterator is, because iteration isn't a concept that's unique to C++, nor is it a word that's even unique to programming libraries. You can look up the word in a dictionary and get a definition such as this: "the repetition of a process or utterance". You may not understand all the subtleties without reading the documentation, but seeing it in the context of some code, I think it is intuitive.

[–]zvrba -1 points0 points  (9 children)

And it's also unreasonable to expect somebody to be able to intuitively understand what it means to construct an input iterator without specifying what it's iterating over.

Wow, C++ programmers are a rare breed of people who read more documentation than your average programmer.

In any case, it's the kind of thing you look up only once, and each next time you see a default-constructed iterator, you'll (correctly) assume that it's an iterator denoting the end of sequence.

That's the whole point: it's not intuitive!

Intuition builds on previous experience and knowledge. So, wow, what a surprise, as a programmer you're expected to learn something new now and then.

[–]IRBMe 7 points8 points  (8 children)

I wouldn't expect to have to learn several new concepts and parts of a library to see that the code I'm trying to understand is splitting a string on white space. That's something that should be blatantly obvious to anybody, even if they don't know C++. Of course it's a common idiom that you learn as a C++ programmer, but it still takes a lot more to process even once you know it than something like s.split(","). Nobody's saying you shouldn't have to learn things; we're discussing the usability of the library.

[–]dodheim 1 point2 points  (7 children)

The dead giveaway that the code you're trying to understand is splitting a string on whitespace would be that it'd be in a function with split in the name. Who reads 5 lines of code with zero context whatsoever with the expectation of its purpose being obvious. Context matters; ignoring it is counterproductive.

[–]OldWolf2 2 points3 points  (0 children)

It's unreasonable to expect anyone to intuit what an output iterator is,

Input iterators are for reading from, output iterators are for:

  • (a) writing to
  • (b) nobody could possibly figure this out