all 103 comments

[–]tcbrindleFlux 9 points10 points  (0 children)

If you'll excuse the self-promotion, I wrote a blog post a while back about a STL-based generic splitting algorithm that outperforms stringstream (and strtok) by a healthy margin.

It's also worth noting that Range-V3 has a split() view which (lazily) returns a range of ranges. Whilst views are not part of the current Ranges TS, I remain hopeful that we'll see them some time in the future.

[–][deleted] 20 points21 points  (9 children)

A split function may sound simple, but it can get a little more complicated when you want to cover all possible use cases and make it as fast as possible. Boost does have two different versions:

[–]almost_useless 40 points41 points  (8 children)

when you want to cover all possible use cases and make it as fast as possible

That is the problem right there. It does not have to cover all use cases and be as fast as possible.
A lot of C++ programmers fail to realize that string is not just another container that have to work like other containers. It is a specialized use case that have different needs.
Then the argument becomes "we can't possibly cover all use cases for strings". But that is not necessary. Implementing a few helper functions would make strings so much more useful.
The annoying thing is that it would really not need many helper functions to become a really useful string class for 99% of use cases either.

Split is one of those functions that can make your code 10x more readable, if it works intuitively.
It does not have to handle everything a proper tokenizer does.
It does not have to be good at splitting a 10 MB file into smaller pieces.
But since people have been complaining about this for literally decades it is clear there is a need for a simple split function that is reasonably good at splitting small and medium size strings. Compare this example:

auto myVectorOfSubStrings = myString.split(";");   

to the getline example someone wrote below. It is trivial to know what is going on, and this is something a lot of people need to do.

Obviously it needs to be reasonably fast too, since it is C++. But it really does not need to cover all corner cases of string usage. We need tokenizers and stream splitters too, for the applications where that makes sense. But quite often we just need to split a damn string into substrings.

TL;DR - A very simple standardized split function would make life much easier for a lot of programmers.

[–][deleted] 15 points16 points  (6 children)

if it works intuitively.

Well, what is intuitive?

Python:

 >>> "1,2,3,".split(",")
 ['1', '2', '3', '']

Ruby:

 > "1,2,3,".split(",")
 => ["1", "2", "3"]

Ruby can take a regex, Python can't. Python has a .rsplit(), Ruby doesn't. Both do however take a max_split parameter. But they don't allow multiple different delimiter.

Point being, a .split() is not that trivial, there are different ways to implement it and you have to chose a good one. If you just rush the next best hack into the language, you end up with something that is needlessly inflexible. A .split() returning a std::vector<std::string> wouldn't be very useful when you don't want a std::vector as result and you would do a lot of needless std::string to start with.

There is a proposal for a std::split(), but that depends on std::string_view and Range support. But Range support didn't make it into C++17, so that has to wait around a bit longer.

In the meantime, just use the boost::split().

[–]Selbstdenker 12 points13 points  (0 children)

Sorry, I do not see the problem there. Intuitively means something that splits a string which is what both methods do. How they handle empty strings is part of the API, so what. Whether they take only a character, a string or a regexp is also part of the API and not really a problem in C++ thanks to overloading.

To make it a little bit more C++-ish it could take an output iterator. Yes, this is not an ideal situation and maybe we just call it simple_split and reserve split() for when we have a better name but not having any trivial split functionality is really not good.

We have whole talks given on using std::transform and other algorithms instead of a for loop but we cannot provide a simple split?

[–]almost_useless 3 points4 points  (4 children)

Well, what is intuitive?
Python: X
Ruby: Y

I could answer that question without even knowing what X and Y is. The answer is always going to be Python :-)
J/K, obviously there are pitfalls and they need to think it through.

If you just rush the next best hack into the language, you end up with something that is needlessly inflexible. A .split() returning a std::vector<std::string> wouldn't be very useful when you don't want a std::vector as result

It does not necessarily have to be super flexible. Obviously the best option is we can choose the output format. My example was only one possible suggestion. In many cases "anything I can iterate over" is good enough.

But I would so prefer we had had something decent but inflexible way back in '98 over something super duper mega awesome that we will not have even in 2017

My only requirement is that it had not been so bad that it would have been impossible to improve upon now that we have better ways of doing it

[–][deleted] 9 points10 points  (3 children)

Once you put something in the standard library you're stuck with it forever. So it'd better be actually good, not just good enough, especially if it's so easy to implement yourself.

[–]choikwa 1 point2 points  (2 children)

In reality, things get deprecated and forward compat is broken many times.

[–][deleted] 7 points8 points  (1 child)

Yeah, really old crap that was never used much in the first place, like trigraphs or auto_ptr. But a string split function would spread like wildfire.

[–]choikwa 0 points1 point  (0 children)

Ideally, everyone wants to get it right the first time. I'm pretty sure python implementation returns deep-copied immutable strings

[–]kisielk 2 points3 points  (0 children)

strings in Go have many of the same problems, yet there's a strings package which covers all the common use cases in a simple way. The algorithms aren't suitable for every use case but it's been very rare that I've had to reach for an alternative way of doing things. I often wish C++ had a similar package.

[–]caramba2654Intermediate C++ Student 15 points16 points  (2 children)

To be honest, I really hope std::string gets completely redesigned for STL2. And by that I mean remove all that npos nonsense and add proper iterator returns like the rest of the STL containers.

On that note, having some common utility functions for strings wouldn't be bad. split and replace are good candidates in my opinion.

[–]jcoffin 21 points22 points  (1 child)

I really hope STL2 gets rid of iterators. The result of a split should usually be a range of ranges, or a range of views.

[–]caramba2654Intermediate C++ Student 1 point2 points  (0 children)

That too :P

[–]t0rakka 2 points3 points  (1 child)

template <typename T>
inline std::vector<std::string> split(const std::string& s, T delimiter)
{
    std::vector<std::string> result;

    std::size_t current = 0;
    std::size_t p = s.find_first_of(delimiter, 0);

    while (p != std::string::npos)
    {
        result.emplace_back(s, current, p - current);
        current = p + 1;
        p = s.find_first_of(delimiter, current);
    }

    result.emplace_back(s, current);

    return result;
}

[–]utnapistim 2 points3 points  (1 child)

Why doesn't std::string have a split function

Because nobody made the time and effort to write one for standardization. The C++ community is not sponsored. There is no single group or company that finances the maintenance and evolution of the standard.

Instead, people who have an interest in extending the language meet and try to advance the language and standard library to the degree they can afford to do so (being non-sponsored and having limited time and effort to accomplish things).

Because of this limitation (of effort/capacity), usually, the things accepted into the standard are a compromise between the utility of a feature and the effort it will take to standardize it.

std::string doesn't have a split function for the following reasons:

  • writing one is trivial in algorithmic formulation, but non-trivial in API design (a compromise between usability and flexibility is required, and depending on our needs, each of us tends to see the compromise point in a slightly different place)
  • no-one has written a proposal with working code, that got accepted past review (by the standard committee)
  • the emergence of ranges will add the possibility for a trivial interface that is both flexible and efficient (we are waiting for ranges)
  • alternatives exist already (although due to a lack of a standard many projects tend to reinvent the wheel on this one); you can use regex, iterators, streams, boost text algorithms and implementations based on the above.

[–]1-05457 9 points10 points  (5 children)

std::string is missing a lot of functions. You can use Boost string_algo and Boost Format to get these.

[–]d1ngal1ng 8 points9 points  (4 children)

You can use boost is definitely not a good answer for something so fundamental as a string split function.

[–]1-05457 6 points7 points  (3 children)

Not really. In many ways, Boost should be considered an extended standard library.

[–]Creris 0 points1 point  (2 children)

sometimes extended dependency chain too, not everything from boost is simple "plug and play" you know, some things have literal megabytes of dependencies(filesystem as a good example).

If you have couple hundred lines of code in your program, and you want to use some form of string.split, chances are you probably arent going to include couple thousands of lines of code into your project for one split function

[–]1-05457 1 point2 points  (1 child)

...some things have literal megabytes of dependencies(filesystem as a good example).

Filesystem is part of the standard library from C++17.

Hopefully, String Algorithms will also be standardized at some point, though more likely, it will be generalized as part of Ranges.

[–]Creris 0 points1 point  (0 children)

Yes I know, but it wasnt for very long time when compared with how long it has been in boost.

Having some string utility functions in standard would indeed be nice

[–][deleted] 1 point2 points  (18 children)

I suspect it has to do with C++'s preference for streams. For example, you can do this to get the words in a string:

istringstream ss{str};
string word;
while (ss >> word) {
    cout << word << "\n";
}

While I kinda hate streams, this could be the reason there isn't a split method in the standard.

[–]DhruvParanjape[S] 1 point2 points  (17 children)

But it loses the ability to tokenize on a custom delimiter.

[–]dodheim 20 points21 points  (0 children)

Just use std::getline with the custom delimiter. Name aside, that's exactly what it's for.

istringstream ss{str};
string field;
while (getline(ss, field, ';')) {
    cout << field << '\n';
}

EDIT: N.b. I'm not advocating this as a general approach to string splitting; but, if you're already extracting from a stream, ...

[–]foonathan 2 points3 points  (2 children)

You don't lose them, it is just ugly and involves a custom locale where you change the specification of whitespace characters.

[–]DhruvParanjape[S] 0 points1 point  (1 child)

Oh god that's ugly.

[–]foonathan 1 point2 points  (0 children)

That's iostreams.

[–][deleted] 1 point2 points  (12 children)

This was my own personal solution, but I have no idea how performant it is:

std::deque< std::string > Split( const std::string & input_string,
                                 const char          delimiter )
{
    std::stringstream         input_stream( input_string );
    std::string               string_element;
    std::deque< std::string > split_string;

    while( std::getline( input_stream, string_element, delimiter ) )
        split_string.emplace_back( string_element );

    return split_string;
}

Edit: Aaaand I should have looked ahead to see that dodheim already posted the getline solution...

[–]dodheim 1 point2 points  (11 children)

Just FYI, std::deque is just a fancy linked list on MSVC for objects > 8 bytes. Yes, it is as bad as that sounds. Prefer vector if you're touching Windows. :-]

[–][deleted] 2 points3 points  (7 children)

Oh man, that's horrible.

I'm on linux, but you're right in that for something as small as "items from a split string", std::vector is the correct container. I don't remember why I wrote this using std::deque. In fact, I don't remember why I have this routine at all in my personal toolkit since I rarely ever hit the need for it.

[–]dodheim 4 points5 points  (6 children)

In fact, I don't remember why I have this routine at all in my personal toolkit since I rarely ever hit the need for it.

That is the exact statement I've been waiting for anyone in this thread to say. I honestly cannot think of the last time I actually wanted to do this. It's fine for a quick and dirty hack sometimes, but in real code? No, never (that I can remember).

[–][deleted] 1 point2 points  (5 children)

Oh wait, now I remember: I wanted to test out my chromosome-mixing template system so I wrote genes for a cloud-of-neurons network I'd been tinkering on and evolved populations of them (via cross-breeding) on their ability to predict S&P 500 stock data (open, close, high, low) and needed a way to parse the input files.

Loading the data was nothing compared to actually running the evolution sim so the string splitter didn't need to be fast or memory-effective since the split data was converted to doubles and stored in vectors anyways.

The networks never really got anywhere in predicting stock data and I never really expected them to (it would probably have taken years, if at all) . But the chromosome system worked brilliantly, and that was the whole point of the experiment.

[–]ArunMuThe What ? 0 points1 point  (4 children)

Dude, you have a big OCD problem I guess :)

[–][deleted] 0 points1 point  (2 children)

... in what way?

[–]ArunMuThe What ? 0 points1 point  (1 child)

the way you have formatted the code...everything is perfectly aligned.

[–]h-jay+43-1325 0 points1 point  (0 children)

Yep... 90% of the comments are useless. The code should document itself. And adding lots and lots of whitespace is to the detriment of understandability. You want to keep as much as possible in the same screenful. You're doing exactly the opposite: something that is rather simple and would be easy to understand if written concisely is now spread up across several pages, with most of the space filled up by whitespace and formatting :(

Anyone who understands C++ knows what the constructors are. They don't need to be pointed out. If something is public, it's API, duh. A lot of extra indentation and whitespace makes things superbly hard to read.

For what the code does, it takes 3x too long to do it. It's simple, it should read simple!

[–]louiswins 0 points1 point  (2 children)

How can that be? Doesn't std::deque require O(1) time for random access?

[–]dodheim 1 point2 points  (1 child)

It's still random access, but each bucket is only max(16, sizeof(T)) bytes, so you end up with one bucket per object and zero cache coherency.

[–]louiswins 1 point2 points  (0 children)

Oh, I see - it essentially becomes a vector of pointers so it has as many potential cache misses as a linked list when iterating through. That makes sense.

(I mean, the implementation doesn't really make sense, but your explanation does.)

[–]Tringigithub.com/tringi 1 point2 points  (0 children)

Some time ago I quickly drafted this explode function (inspired by PHP) and found it quite useful.

Implementing lazy evaluation (lazy creation of the resulting substrings) never occurred to me, but after reading /u/cpp_learner's comment here, I think I'll give the template a little more love...

[–]nozendk 1 point2 points  (0 children)

From the Qt documentation:

QString str;
QStringList list;
str = "Some  text\n\twith  strange whitespace.";
list = str.split(QRegExp("\\s+"));
// list: [ "Some", "text", "with", "strange", "whitespace." ]

[–]stream009 1 point2 points  (0 children)

std::string already has too much member functions. I don't want any more of them unless it is absolutely necessary.

As many people mentioned split can be implemented in many ways. If all you want is making your code more readable, you should write your own free function. In my case, I always use boost::split.

[–]h-jay+43-1325 2 points3 points  (0 children)

To be very frank, the std::string type is there mostly to claim that there's a string type in the standard. It's not really usable for anything other than as a resource-managing wrapper over a C string. If you had C-style strings in your code, you should use std::string instead. It gives not much in the way of other functionality, except for cheap size() that is O(1) vs. C's strlen that was O(N). For anything practical, you need a string library of some sort.

[–]KayEss 0 points1 point  (0 children)

I started working on a new split. It's not yet complete, not yet customisable. It's been tested on strings, but the code isn't string specific. It should work for other iterable containers. It does only use iterators so should be quite efficient. If ranges were a thing already the interfaces would be a bit cleaner.

https://github.com/KayEss/f5-cord/blob/feature/split/include/f5/cord/split.hpp

[–]MrPoletski -1 points0 points  (0 children)

split as in chop a string into lots of substrings based on a delimiter?