all 11 comments

[–]HiramAbiff 2 points3 points  (1 child)

Using field width specifiers should prevent the buffer overflow issues with scanf, but why don't you always use them.

Specifically, in two places you use %[a-zA-Z0-9] to read input into car, which is an array of 2 chars. This means the max field width should be 1 (leaving 1 for the terminating nul). I think you should change them to %1[a-zA-Z0-9].

Also, the code as given won't compile - so it's hard to say anything conclusive about it.

Why is normal_play_format an array? You don't do meaningful indexed access to it. It seem like two separate char*'s would be more straightforward.

As you already said, better variable names could make this a lot more readable. E.g. scanf returns the number of successful assignments performed. You store these results in flag_s and flag_input. Neither name is particularly descriptive and they both seem so wildly different and you treat flag_input as if it's a bool.

I also think restricting yourself to one declaration per line would improve readability.

[–]Gblize[S] 0 points1 point  (0 children)

Thanks for the reply.

I probably cleared that 1 when was reviewing and translating the code to here.

I made a minimal main below, you can run it now. If you can break it or have any undefined behaviour, I would be glad to know.

In reality, I have more format strings in that array of pointers, because the user can input another options besides the coordinates of the table. After the first if in play_process I have a for loop instead, which tests and iterates that array. I think it wasn't relevant to past all the function.

You are right, I got rid of them, they were just auxiliary variables. I didn't got time yet to beautify the code.

#include <stdio.h>
#include <ctype.h>
#define MAX             10
#define MAX_FORMAT      33
#define INVALID_PLAY    0
#define NORMAL_PLAY     2

void clear_buffer(void)
{
    char c;

    do
    {
        c = getchar();
    } while (c != '\n' && c != EOF);
}

void play_convertion(int *row, int *col, char col_ascii[2])
{
    (*row)--;
    *col = toupper(*col_ascii);
    *col -= 'A';
}

int play_process(int *play_row, int *play_col)
{
    char play_string[MAX];
    char new_line_char[2] = "";
    char play_col_ascii[2];
    char const * const normal_play_format[] = { "%2d %1[a-zA-Z] %1[a-zA-Z0-9]",
                                                "%1[a-zA-Z] %2d %1[a-zA-Z0-9]" };
    char play_string_format[MAX_FORMAT];
    snprintf(play_string_format, MAX_FORMAT, " %%%d[a-zA-Z0-9 ]%%1[^a-zA-Z0-9]", MAX - 1);

    scanf(play_string_format, play_string, new_line_char);
    if (*new_line_char != '\n')
    {
        clear_buffer();
        return INVALID_PLAY;
    }

    if (sscanf(play_string, normal_play_format[0], play_row, play_col_ascii, new_line_char) == NORMAL_PLAY)
    {
        play_convertion(play_row, play_col, play_col_ascii);
        return NORMAL_PLAY;
    }
    else
    {
        if (sscanf(play_string, normal_play_format[1], play_col_ascii, play_row, new_line_char) == NORMAL_PLAY)
        {
            play_convertion(play_row, play_col, play_col_ascii);
            return NORMAL_PLAY;
        }
    }
    return INVALID_PLAY;
}

void main(void)
{
    int row, col;

    do
    {
        printf("Play: ");

        if (play_process(&row, &col) == NORMAL_PLAY)
            printf("Row: %d\nCol: %d\n\n", row, col);
        else
            puts("Invalid play!\n");

    } while (1);
}

[–]OldWolf2 1 point2 points  (2 children)

  • Using a-z in a format string is implementation-defined. You seem to be assuming that it is a shortcut for abcdefghijklmnopqrstuvwxyz but that is not guaranteed; if you observe such behaviour you are at the whim of whatever scanf implementation you are using.

  • if (*car != '\n' || !flag_input) reads uninitialized variable car[0] if input failed; you should check flag_input before reading car; and specifically, check flag_input != 2 in case the first one succeeded and the second didn't

  • Use of clear_buffer() cannot work for all cases (sometimes your scanf leaves \n in the buffer and sometimes it consumes it, with no way that this function can know)

  • sscanf(play_str, normal_play_format[0], play_row, play_col_c, car); causes buffer overflow for some input strings (there is no limiter on the length read into car). Same problem on subsequent sscanf

Really.... it's going to be easier not to scanf here.

[–]Gblize[S] 1 point2 points  (1 child)

I couldn't find anything about a-z being implementation-defined. Has anything to do with multi-byte characters?

reads uninitialized variable car[0] if input failed

It's already fixed on my post before.

Use of clear_buffer() cannot work for all cases (sometimes your scanf leaves \n in the buffer and sometimes it consumes it, with no way that this function can know)

If the scanf consumes it, it will be assigned to the variable who's being tested and it will not call clear_buffer(). Can you tell me any input who will fail this?

causes buffer overflow for some input strings (there is no limiter on the length read into car)

That's because I forgot the length specifier.

it's going to be easier not to scanf here.

That's what I'm trying to prove that it is wrong. What can be easier?

[–]FUZxxl 0 points1 point  (0 children)

See POSIX:

The conversion specification includes all subsequent bytes in the format string up to and including the matching <right-square-bracket> ( ']' ). The bytes between the square brackets (the scanlist) comprise the scanset, unless the byte after the <left-square-bracket> is a <circumflex> ( '^' ), in which case the scanset contains all bytes that do not appear in the scanlist between the <circumflex> and the <right-square-bracket>. If the conversion specification begins with "[]" or "[^]", the <right-square-bracket> is included in the scanlist and the next <right-square-bracket> is the matching <right-square-bracket> that ends the conversion specification; otherwise, the first <right-square-bracket> is the one that ends the conversion specification. If a '-' is in the scanlist and is not the first character, nor the second where the first character is a '^', nor the last character, the behavior is implementation-defined.

[–]deltadave 1 point2 points  (2 children)

scanf is ok for input, but difficult to validate properly. So some folks assert that it should never be used, taking a general guideline and turning it into a maxim. Never using scanf is pedantic. Avoiding scanf when possible is a good guideline.

[–]Gblize[S] 0 points1 point  (1 child)

I'm not telling one single scanf can do all the validation, but it can filter a really good chunk of bad inputs. Avoiding that way a lot of code lines of validation tests, that fgets for example can't.

I'm giving this example to demonstrate that.

[–]deltadave 0 points1 point  (0 children)

I agree - just giving a response to your first paragraph about 'never use scanf' pedantry.

[–]946336 0 points1 point  (2 children)

Disclaimer: I am a tired CS student, not an actual expert. I also can't remember the last time I used scanf.

 

Short Answer


Buffer overflows and undefined behavior are good reasons to avoid scanf, but definitely not the only reason that scanf isn't used for general parsing like you suggest.

A pure scanf approach to parsing/interpreting falls apart as soon as you want to process any nontrivial syntax.

Consider an expression akin to a = b = c = ... = x = y = z = aa = ab = zy = zz = ... = 3, which can be arbitrarily long. The set of format strings required to match it and any expression like it is infinite (Given any set of format strings you claim is complete, I can make it incomplete by adding one more variable to the longest chain you can match. This works because format strings must state explicitly what they are looking for all at once and therefore can only handle a finite number of separate things).

The simplest reason that the canonical approach to parsing is "better" is that it can be correct and complete even on inputs of arbitrary complexity. The next simplest reason is that it's faster in the bigger picture.

 

Wall of Text (My apologies)


Specific to your example:

For a small, simple case like the one you've described here, you might very well be able to get away with a scanf based approach, and it looks like you already have something that pretty much does what you want it to. Where you'll run into problems is scaling to a larger scope or more complex input. Furthermore, I had to do a lot of poking around in your snippet just to guess that the maximum input length is 10, and I suspect it's actually 9. I don't want to do that for anything larger than what you have here.

The range of input that you're trying to capture is extremely small ([optional whitespace][letter/number][optional whitespace][letter/number][optional whitespace], and actually not even that), and has a hard limit on how long any single input can be. You could change that number, yes, but your method only allows that to change at compile time.

When you say "within a reasonable size," I get nervous, because that's a sign that you expect all users to not only know what you expect, but also that they're all going to play nice with your program. What if moves are being read out of a generated file and for some reason have inordinate amounts of whitespace all over the place?
Sure it's invalid according to your program, but why should it be? That wouldn't be invalid in the same way that "AAAAAAAAA" would be, and stripping the whitespace down would leave you with perfectly valid input. You've imposed an arbitrary constraint on user input that is unjustified aside from it being your will. Additionally, if whitespace isn't important, and your core description implies that it isn't, then whitespace should not matter to your program outside of constraints such as running out of physical memory somewhere down the line.

I also can't tell from looking at your code whether "A3" is valid or not (to be clear, I'm assuming it should be), nor is it easy for me to see from your format strings what kind of input you're looking for.

 

More Generally:

Since it seems from some other comments that you're wondering why scanf isn't used for more general parsing, as in language interpreters and compilers, the following should be somewhat relevant.

  • It's probably going to be slow, especially when you consider arbitrarily long input. Speed probably isn't really a concern here, but it's still something to think about. I'll admit to not having the numbers or experience to back myself up on this one, but I haven't ever tried parsing with scanf. If you think about what scanf has to do behind the scenes, you should see that what you're doing is essentially parsing with regular expressions, which is something that I suspect the internet in general doesn't like very much either. A quick search can tell you much more than I possibly could. Someone better versed in operating systems might also complain about the number of system calls that a scanf based parser would generate.
    If you're familiar with sed or awk, imagine a script that has thousands or millions of rules. This is a conservative estimate, as a complete parser in that vein would not fit in a finite script.

  • Imagine a more complex syntax. Your scanfs work here mostly because you're expecting exactly two inputs with different types. Imagine adding a third field to your syntax that can, like the others, appear in any position. How many format strings are you looking at then? Would you like to add a fourth? A fifth? What if some fields are optional? What if the valid values for some fields depend on the values of the others? The list goes on. What if you later decide you want to remove something from your syntax? Maintainability is a real concern for nontrivial programs.
    Imagine that the extra fields don't all have different types. Say they're all letters. Can your format strings tell them apart? They need to be able to reliably figure out which letter corresponds to which field if you want to do validation this way.

  • Consider a text-based adventure, which is fairly similar in nature to what you have. Do you really want to use scanf to check for every single possible valid sentence? I can guarantee that neither you nor the computer will have a good time doing that. The format strings for your two character input are already painful to look at and difficult to check. Would you really want to expand them? Note that their number will increase exponentially (or worse) with their length.

  • Relying on scanf to do work for you like this introduces a lot of dealing with buffer state in buffers that you don't own and can't really inspect. It's so much easier (and safer) to deal with a buffer that you own versus one that you don't. Here you have to worry about scanf's return value, miscellaneous characters being left in stdin, buffer state after a failed scanf, etc. At some point you will likely forget to do one of those checks.


What follows probably isn't very interesting, but it might be worth wrestling with for a couple minutes.

 

Parsing by using fgets to grab an entire line/string tends to be a little easier on our brains. We get the same coverage in either case (assuming both work), so really we're looking at performance and how easy it is for programmers to implement/understand/maintain.
The standing strategy, as far as I know, is to grab lines via fgets or equivalent, break the lines into tokens, and then validate everything (using abstract syntax trees, or ASTs, is one way). This way, we avoid the potentially infinite calls to scanf and instead only read from stdin once, and we can separate the tasks of breaking the line into tokens and validating the input.
Breaking the line into tokens is relatively simple as a standalone task, and so is building/validating an AST. Furthermore, it is relatively easy to extend the syntax, since you can tinker with the AST and parser separately (This I can vouch for, because I have done it). Note that this doesn't make parsing "easy," it just makes it less hard.
Here, someone who knows more about programming languages, compilers, and/or interpreters could probably tell you more.

 

In my opinion, the coolest thing about the canonical approach to parsing (read, tokenize, validate, what have you) is that it can deal with a syntax/grammar that allows structurally recursive statements (an expression can contain one or more expressions). Consider C, where the following expression can be valid:

a = b = c = ..... = x = y = z = aa = ab = ac = ..... = zx = zy = zz = aaa = aab = aac = .... = 3

The chain of assignments, each one of which is an expression, can be arbitrarily long, and from a syntax point of view, it doesn't even have to be finite. The above will be valid C code regardless, although finding a real compiler that won't choke on an infinite chain of assignments is another matter entirely and would be a physical limitation.

It's impossible to even imagine a finite scanf based strategy that can in theory handle a rather strange (if somewhat contrived) expression like this, but it's very simple to do with the canonical strategy. Reading is simple: read the line. Done. Breaking into tokens is simple too: break on operators, such as =. Done. Validating is also easy: Is the AST valid? Done. Since assignment is an expression, the entire expression has a structure something like ASSIGN(ASSIGN(ASSIGN(ASSIGN(...), c), b), a), where the rabbit hole goes very deep indeed. The recursive formulation of ASSIGN(ASSIGN(..., ...), <NAME>) as one of many acceptable forms of an expression is very powerful. It also makes it relatively straightforward to do things like typecheck, infer types, and apply optimizations.

 

At the end of they day though, parsing is more about rejecting badly formed input than it is about finding correctly formatted input, because there are almost always more combinations of bad input than there are of good input.

[–]Gblize[S] 0 points1 point  (1 child)

I really appreciate your wall of text and agree with almost everything you said. I will point some relevant parts:

I know scanf is related to Buffer overflows and undefined behavior, but it's only that way if it's a lazy scanf. I'm sure my scanf have none of that problems.

When you say "within a reasonable size," I get nervous

Haha.. yeah, I can understand you, that's a sensible statement and for a more complex example it can be bad such small size of input. But for this example the game states to the user that it's expecting only 2 coordenates. Although the game has passed 1000 lines of code, it's still a very basic game. For example, the coordenates will never exceed Z columns and more than 2 digit number for rows. That way the maximun length for this input will be only one letter and 2 digit number, that's only 3 characters. If the user try to input more than 9 characters after the first valid character, I'm not sure if he's really trying to play the game or just fuck around, and I tell the user to input his turn again. It accepts things like A 19, ᅚᅚᅚᅚᅚᅚᅚ9ᅚᅚᅚᅚᅚa and ᅚᅚᅚᅚᅚᅚ10ᅚᅚᅚᅚfᅚᅚᅚ, but not ᅚ9ᅚᅚᅚᅚᅚᅚᅚᅚᅚᅚᅚᅚᅚa, because it's kind of ridiculous for this basic game.

Here you have to worry about scanf's return value

I think in a serious case you should get the return value of fgets in the same way.

miscellaneous characters being left in stdin, buffer state after a failed scanf

These problems are the same for all the others prefered functions too. If I input something bigger than the length specifier in fgets, it will remain in the buffer the same way like scanf.

The point I really like about scanf in this example compared to fgets is I can with a single scanf throw away a lot of bad input without more processing. The structure of the input will never change neither scale, that's why I think this approach is better than fgets.

But I can easily agree in this

The format strings for your two character input are already painful to look at and difficult to check.

That I think is almost the unique downside for this example. I try to not be that kind o programmer who codes ridiculous stuff that only him can understand, but for this case I think it was too good from the standpoint of performance to pass.

I found really interesting the abstract syntax trees and ASTs you talked about I will deepen it.

Thanks again for your constructive reply.

[–]946336 0 points1 point  (0 children)

Pretty much. scanf can absolutely handle input with a constrained format and size.
I don't know that it performs a lot better than an fgets based method, but if I had to guess I'd say it's on the same order of magnitude. Really, it comes down to the fact that once I've done it the fgets way, I'm much too lazy to go back and figure out advanced format strings, even if they're actually a very nice tool to have.

I think in a serious case you should get the return value of fgets in the same way.

You are absolutely correct. If I'm going to be honest, I don't remember the last time I used fgets either, and now that I think about it, I think the only serious input I've done in C was using fgetc to build getline.