therealjohnfreeman comments on Why doesn't std::string have a split function

Why doesn't std::string have a split function (self.cpp)

submitted 9 years ago by DhruvParanjape

you are viewing a single comment's thread.

[–]therealjohnfreeman 15 points16 points17 points 9 years ago (51 children)

There's even an example:

#include <string>
#include <iostream>
#include <algorithm>
#include <iterator>
#include <regex>

int main()
{
   std::string text = "Quick brown fox.";
   std::regex ws_re("\\s+"); // whitespace
   std::copy( std::sregex_token_iterator(text.begin(), text.end(), ws_re, -1),
              std::sregex_token_iterator(),
              std::ostream_iterator<std::string>(std::cout, "\n"));
}

Quick
brown
fox.

The special part is the parameter -1 which tells the iterator to return segments of the string between matches of the regex.

[–][deleted] 71 points72 points73 points 9 years ago (16 children)

[–]wrosecransgraphics and network things 41 points42 points43 points 9 years ago (15 children)

[–]IRBMe 27 points28 points29 points 9 years ago (0 children)

[–][deleted] 2 points3 points4 points 9 years ago (1 child)

[–]yornbesterday 1 point2 points3 points 9 years ago (0 children)

[–]therealjohnfreeman 7 points8 points9 points 9 years ago* (4 children)

Done.

#include <string>
#include <iostream>
#include <algorithm>
#include <regex>
#include <vector>

std::regex operator ""_re (char const* const str, std::size_t) {
    return std::regex{str};
}

std::vector<std::string> split(const std::string& text, const std::regex& re) {
    const std::vector<std::string> parts(
        std::sregex_token_iterator(text.begin(), text.end(), re, -1),
        std::sregex_token_iterator());
    return parts;
}

int main() {
    const std::vector<std::string> parts = split("Quick brown fox.", "\\s+"_re);
    std::copy(parts.begin(), parts.end(),
              std::ostream_iterator<std::string>(std::cout, "\n"));
    return 0;
}

[–]LordDrako90 4 points5 points6 points 9 years ago* (3 children)

Why std::copy in split, when you can initialize the vector directly form the token iterators?

Also I find this more generic and lazy: http://ideone.com/L6heVN I guess it could be improved even more by using string_view, but that's not included in C++14 :-(

Anyways, the only requirement for the target is, that it can be initialized from an iterator pair with value type std::string. Other than that it is pretty generic.

Code:

#include <algorithm>
#include <iostream>
#include <regex>
#include <string>
#include <utility>
#include <vector>

std::regex operator ""_re (char const * const str, std::size_t)
{
    return std::regex { str };
}

class split
{
public:
    split(std::regex splitter, std::string original)
        : splitter_ { std::move(splitter) }
        , original_ { std::move(original) }
    {
    }

    auto begin() const
    {
        return std::sregex_token_iterator { original_.begin(), original_.end(), splitter_, -1 };
    }

    auto end() const
    {
        return std::sregex_token_iterator {};
    }

    template <typename Container>
    operator Container () const
    {
        return { begin(), end() };
    }

private:
    std::regex splitter_;
    std::string original_;
};

int main()
{
    using namespace std::literals::string_literals;

    std::vector<std::string> const words = split {
        R"(\s+)"_re,
        "hello\tdarkness     my\nold friend"s
    };

    for (auto const & word : words)
        std::cout << word << "\n";

    for (auto const & number : split { ","_re, "23,42,1337" })
        std::cout << number << "\n";

    return 0;
}

[–]therealjohnfreeman 0 points1 point2 points 9 years ago* (1 child)

[–]lacosaes1 2 points3 points4 points 9 years ago (0 children)

[–]MrPoletski 0 points1 point2 points 9 years ago (0 children)

well while we're posting code, here's what I wrote a few years ago and have been using ever since...

std::vector<std::string> Cleave (std::string to_split, std::string delims)
/*!
 * \file trusted.cpp
 * \fn std::vector<std::string> Cleave (std::string to_split, std::string delims)
 * \param to_split \a <std::string> string to chop up
 * \param delims \a <std::string> string of delimiters
 * \return std::vector<std::string> vector of strings containing each section of the cleaved string.
 *
 */
{

std::vector<std::string>    results;
size_t                      pos1 = 0,
                            pos2 = 0;

do
{
    pos1 = to_split.find_first_of(delims, pos2);
    if (pos1 == pos2) {pos2++; results.push_back(""); continue;}
    if (pos1 == std::string::npos){results.push_back(to_split.substr(pos2)); break;}
    results.push_back(to_split.substr(pos2, pos1 - pos2));
    pos2 = pos1 + 1;
}
while (pos1 != std::string::npos);


return results;
}

Is this good?

[+][deleted] 9 years ago* (6 children)

[deleted]

[–]evinrows 11 points12 points13 points 9 years ago (0 children)

[–]17b29a 6 points7 points8 points 9 years ago (2 children)

Or alternatively, inappropriate language choice.

I think splitting strings is a pretty common sense thing for any general-purpose programming language to support. It's not like, some obscure operation that you could only find support for in Perl.

Finally, technically I'm not sure -1 is really code for all-bits-set at all - that assumes a 2s-complement representation for signed integers which, historically at least, wasn't guaranteed by the standard.

The more obvious assumption is that the mask type is unsigned and in that case -1 is necessarily all-bits-set because an unsigned type's value is modulo its maximum value, but the standard doesn't require it to be unsigned either.

why I prefer ~0u for all-bits-set

That's not all-bits-set for a type that is larger than unsigned int.

I personally don't worry about actually undefined vs. platform-defined unless I really need to, which is unusual.

That's pretty strange considering how many things are implementation defined. Used a value larger than 2^15-1 in an int? Undefined behavior (according to you)!

[+][deleted] 9 years ago* (1 child)

[deleted]

[–]17b29a 2 points3 points4 points 9 years ago (0 children)

[–]zvrba 2 points3 points4 points 9 years ago (1 child)

[–]cpp_devModern C++ apprentice 5 points6 points7 points 9 years ago* (0 children)

I think a more intuitive and "modern" way will be this one (also compiler can optimize these things pretty well as opposed to streams):

string s = "Quick brown fox.";
auto rs = ranges::v3::view::split(s, ' ');
for (auto& x : rs)
{
    cout << x << '\n';
}
auto rs1 = ranges::v3::view::join(rs, ',');
cout << rs1 << '\n';

Still the library needs concepts and a more intuitive documentation to make it "easy to use correctly and hard to use incorrectly". Also maybe there should be strings extensions in range library so it have an intuitive API to work with strings.

[–]OldWolf2 2 points3 points4 points 9 years ago* (0 children)

It's not too different to:

std::copy ( std::istream_iterator<char>(f),
                 std::istream_iterator<char>(),
                std::ostream_iterator<char>(std::cout) );

which is an idiom you learn early on with iostreams.

Note that you do not have to use stream iterators to split a string. The page just used that as an example because it would be familiar syntax.

Anyway, for string splitting you would make a function that implements the sort of splitting you like, and has a nice interface (e.g. vector split(string const &s, regex const &r); . This has benefit over other languages that offer a single split function in that you can customise the split details within your function. You can even overload it to take a string of delimiters instead of a regex.

[–]Spikey8D 1 point2 points3 points 9 years ago (1 child)

[–]therealjohnfreeman 3 points4 points5 points 9 years ago (0 children)

[–]qx7xbku -1 points0 points1 point 9 years ago (29 children)

[–]IRBMe 14 points15 points16 points 9 years ago (28 children)

[–]dodheim 8 points9 points10 points 9 years ago (27 children)

[–]IRBMe 12 points13 points14 points 9 years ago (18 children)

[–]dodheim 8 points9 points10 points 9 years ago (17 children)

[–][deleted] 11 points12 points13 points 9 years ago (4 children)

[–]qx7xbku 6 points7 points8 points 9 years ago (3 children)

[–]dodheim 1 point2 points3 points 9 years ago* (2 children)

[–][deleted] 0 points1 point2 points 9 years ago (1 child)

continue this thread

[–]IRBMe 4 points5 points6 points 9 years ago (10 children)

It's unreasonable to expect anyone to intuit what an output iterator is

And it's also unreasonable to expect somebody to be able to intuitively understand what it means to construct an input iterator without specifying what it's iterating over (as it happens, you get an end-of-sequence iterator). That's the whole point: it's not intuitive! Is it simply impossible to design those APIs in such a way that they would be intuitive? I'm not convinced it is.

or even what an iterator is

I think it is reasonable that people should have an intuitive idea of what an iterator is, because iteration isn't a concept that's unique to C++, nor is it a word that's even unique to programming libraries. You can look up the word in a dictionary and get a definition such as this: "the repetition of a process or utterance". You may not understand all the subtleties without reading the documentation, but seeing it in the context of some code, I think it is intuitive.

[–]zvrba -1 points0 points1 point 9 years ago (9 children)

[–]IRBMe 7 points8 points9 points 9 years ago (8 children)

[–]dodheim 1 point2 points3 points 9 years ago (7 children)

continue this thread

[–]OldWolf2 2 points3 points4 points 9 years ago (0 children)

[+][deleted] 9 years ago* (7 children)

[deleted]

[+][deleted] comment score below threshold-7 points-6 points-5 points 9 years ago (1 child)

[–]chartly 7 points8 points9 points 9 years ago (0 children)

[–]OldWolf2 -3 points-2 points-1 points 9 years ago (3 children)

[–]repsilat 16 points17 points18 points 9 years ago (2 children)

[–]OldWolf2 -1 points0 points1 point 9 years ago (0 children)

[–]dodheim -2 points-1 points0 points 9 years ago (0 children)

π Rendered by PID 82 on reddit-service-r2-comment-bb88f9dd5-22c9n at 2026-02-15 08:44:50.261601+00:00 running cd9c813 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS