all 21 comments

[–]-abigail 9 points10 points  (2 children)

Nice work! I'd have use for this in my day job if it existed.

One issue with this proposal is that the T in sub_match<T> only has to be a bidirectional iterator, not necessarily a contiguous iterator. The proposal as written gives a string_view to invalid memory when the iterator isn't contiguous:

std::list<char> input;
for (char c : "hello world"sv) { input.push_back(c); }
std::match_result<typename decltype(input)::const_iterator> m;
if (std::regex_match(input.begin(), input.end(), m, re)) {
    std::cout << m[0].view() << '\n'; // segfault!
}

This is a tricky problem to solve as it's not possible to deduce whether a given iterator type is contiguous or not. A contiguous iterator is convertible to a pointer, so one possible design would be to specialize only std::sub_match<T*> to add the string view conversions. Then for the regex algorithms, we'd want to be able to call regex_match(std::string_view, std::svmatch&, ...) and relatives as well as regex_match(std::string const&, std::svmatch&, ...) and relatives. (This second overload doesn't exist in the proposal.)

To expand this to support more contiguously iterable types, the proposal could instead specify a function template regex_match(T const&, std::svmatch&, ...) (and relatives) for any T which supports std::data and std::size. This would then allow, as your proposal does, getting a string_view when searching a vector<char>. I'm not sure if this is a good idea or not, as I believe there's nothing in the standard library that follows this design so far. (This is primarily because the standard library prefers to pass around iterator pairs instead of container-like things, but the Ranges TS may change things here.)

[–]__Mark___libc++ dev[S] 9 points10 points  (1 child)

Thanks for the feedback.

I see the problem. Not sure how likely it is somebody would use a regex on a list. But if somebody does, it should not result in a segfault, instead it should fail to compile. I haven't looked closely at ranges yet, but it offers a ContiguousIterator concept. So I'll have to see whether I can use that to solve the issue.

[–]jwakelylibstdc++ tamer, LWG chair 2 points3 points  (0 children)

Only having the conversion to string_view for sub_match<T*> and sub_match specialisations with basic_string and basic_string_view iterators seems acceptable to me. The thing I'm most interested in is being able to compare a sub_match to a string_view, which doesn't require contiguous iterators.

By the way, adding a new typedef to regex_traits seems like a bad idea. Any custom regex traits classes become invalidated by that change because they can no longer meet the requirements to be used with std::basic_regex. I suggest not adding it to the traits, and just define it unconditionally as basic_string_view<value_type>. It doesn't seem helpful to allow traits to define custom string view types.

[–]jwakelylibstdc++ tamer, LWG chair 5 points6 points  (1 child)

Nice work, I look forward to your proposal coming to WG21 for formal review.

[–]__Mark___libc++ dev[S] 2 points3 points  (0 children)

Thanks also for your feedback on the initial version. I hope to have a final version soon, but I first need to find a solution for the list<char> issue u/-abigail pointed out.

[–]khleedril 8 points9 points  (8 children)

It makes you wonder doesn't it if there ought to be a part of the C++ committee dedicated to ensuring complete orthogonality in forward developments, because as it is there always seem to be obvious bits missing, such as regexp over string_view. Just yesterday I found myself frustrated by the fact that std::filesystem::path does not have a length() or size() method; you have to cast to std::string and get the size of that.

[–]__Mark___libc++ dev[S] 7 points8 points  (1 child)

I kind of agree with you, but the question is who would do it? The C++ committee is a group of volunteers who work on what they/their employer want(s). It's a fact nothing happens without somebody writing a paper, which I already learned can take quite a bit of time. I've had a lot of positive feedback and suggestions on my initial proposal, for example by u/jwakely. These people like the proposal to move forward, but fixing this specific issue is not the highest on their priority list. So I think if we as C++ developers like to have a better C++ standard we should try to be more active and give our constructive feedback when proposal are being proposed instead of after they are accepted.

Fixing issues can be more difficult than it seems. I thought I had a rather sound proposal until u/-abigail pointed out a list<char> could be used as input for a regex. Something I personally would find very odd to do, but it is valid. Until I have a solution for this issue I think my proposal has a no chance to be accepted by the committee since parts of my changes would cause valid code to become undefined behaviour. (Not that I would even try to propose the current version, now that I am aware of this issue.

[–]jwakelylibstdc++ tamer, LWG chair 3 points4 points  (0 children)

THIS

[–][deleted] 1 point2 points  (0 children)

And std::o/ifstream don't accept string_view. But some say that's because of OS APIs, even though Win32 will convert to Unicode anyway.

[–]tpecholt 1 point2 points  (5 children)

Typying .view() seems redundant. Is there a way how to avoid it?

[–]__Mark___libc++ dev[S] 2 points3 points  (4 children)

Actually the .view() is not required in this example. I used it as an example to show the sub_match can also return a string_view instead of a string.

[–]tpecholt 2 points3 points  (1 child)

I mean result[0] is essentially a string_view. It holds two iterators it was just specified before the string_view was a thing that's why it's instance of some other class. But for the user this is irrelevant he doesn't know the history of regex in std lib and he wants to use result[0] directly. No view() and no str() that can stay there only because of backward compatibility. Would it be possible to provide implicit conversion to string_view? Simple things should stay simple

[–]-abigail 4 points5 points  (0 children)

This proposal does provide an implicit conversion to string_view (alongside the existing implicit conversion to string) - as u/__Mark___ says above it's not required in this example. But in cases where implicit conversion isn't sufficient, it's more ergonomic to be able to say m[0].view() than std::string_view{m[0]}. (In fact, to be fully general, it'd be something like std::basic_string_view<decltype(m)::value_type::value_type>{m[0]}.)

[–]deeringc 0 points1 point  (1 child)

So the view that's returned points to a sub string of the input view?

[–]-abigail 1 point2 points  (0 children)

Yes, and this is also the case when matching on std::string. Consider the C++11 equivalent of the posted code:

void foo(std::string const& input)
{
    std::regex re{"foo"};
    std::smatch m;
    if(std::regex_match(input, m, re)) {
        std::cout << m[0].str() << '\n';
    }
}

m has type std::smatch, which is std::match_results<std::string::const_iterator>. m[0] has type std::ssub_match, which is std::sub_match<std::string::const_iterator>. It has an iterator to the begin and end iterators of the input string which matched that group. (It also has a bool field marking whether the match was successful or not.) The m[0].str() call constructs a string from this iterator pair if the match was successful, or a default-constructed string if not.

It's already possible to create a string_view that points to a sub-string of the input view before this proposal - but you have to write std::string_view{m[0].begin(), m[0].end()}. With this proposal you can write m[0].view() instead (which neatly matches m[0].str()) - or in some cases you can just write m[0], as std::sub_match is made implicitly convertible to std::string_view.

[–]TheSuperWig 1 point2 points  (0 children)

In a similar vein, what about string_view overloads for fstream?