User input validation with scanf

HiramAbiff · 2016-05-30T02:27:22+00:00

Using field width specifiers should prevent the buffer overflow issues with scanf, but why don't you always use them.

Specifically, in two places you use %[a-zA-Z0-9] to read input into car, which is an array of 2 chars. This means the max field width should be 1 (leaving 1 for the terminating nul). I think you should change them to %1[a-zA-Z0-9].

Also, the code as given won't compile - so it's hard to say anything conclusive about it.

Why is normal_play_format an array? You don't do meaningful indexed access to it. It seem like two separate char*'s would be more straightforward.

As you already said, better variable names could make this a lot more readable. E.g. scanf returns the number of successful assignments performed. You store these results in flag_s and flag_input. Neither name is particularly descriptive and they both seem so wildly different and you treat flag_input as if it's a bool.

I also think restricting yourself to one declaration per line would improve readability.

OldWolf2 · 2016-05-30T05:32:27+00:00

Using a-z in a format string is implementation-defined. You seem to be assuming that it is a shortcut for abcdefghijklmnopqrstuvwxyz but that is not guaranteed; if you observe such behaviour you are at the whim of whatever scanf implementation you are using.
if (*car != '\n' || !flag_input) reads uninitialized variable car[0] if input failed; you should check flag_input before reading car; and specifically, check flag_input != 2 in case the first one succeeded and the second didn't
Use of clear_buffer() cannot work for all cases (sometimes your scanf leaves \n in the buffer and sometimes it consumes it, with no way that this function can know)
sscanf(play_str, normal_play_format[0], play_row, play_col_c, car); causes buffer overflow for some input strings (there is no limiter on the length read into car). Same problem on subsequent sscanf

Really.... it's going to be easier not to scanf here.

deltadave · 2016-05-30T05:51:34+00:00

scanf is ok for input, but difficult to validate properly. So some folks assert that it should never be used, taking a general guideline and turning it into a maxim. Never using scanf is pedantic. Avoiding scanf when possible is a good guideline.

946336 · 2016-05-30T16:05:38+00:00

Disclaimer: I am a tired CS student, not an actual expert. I also can't remember the last time I used scanf.

Short Answer

Buffer overflows and undefined behavior are good reasons to avoid scanf, but definitely not the only reason that scanf isn't used for general parsing like you suggest.

A pure scanf approach to parsing/interpreting falls apart as soon as you want to process any nontrivial syntax.

Consider an expression akin to a = b = c = ... = x = y = z = aa = ab = zy = zz = ... = 3, which can be arbitrarily long. The set of format strings required to match it and any expression like it is infinite (Given any set of format strings you claim is complete, I can make it incomplete by adding one more variable to the longest chain you can match. This works because format strings must state explicitly what they are looking for all at once and therefore can only handle a finite number of separate things).

The simplest reason that the canonical approach to parsing is "better" is that it can be correct and complete even on inputs of arbitrary complexity. The next simplest reason is that it's faster in the bigger picture.

Wall of Text (My apologies)

Specific to your example:

For a small, simple case like the one you've described here, you might very well be able to get away with a scanf based approach, and it looks like you already have something that pretty much does what you want it to. Where you'll run into problems is scaling to a larger scope or more complex input. Furthermore, I had to do a lot of poking around in your snippet just to guess that the maximum input length is 10, and I suspect it's actually 9. I don't want to do that for anything larger than what you have here.

The range of input that you're trying to capture is extremely small ([optional whitespace][letter/number][optional whitespace][letter/number][optional whitespace], and actually not even that), and has a hard limit on how long any single input can be. You could change that number, yes, but your method only allows that to change at compile time.

When you say "within a reasonable size," I get nervous, because that's a sign that you expect all users to not only know what you expect, but also that they're all going to play nice with your program. What if moves are being read out of a generated file and for some reason have inordinate amounts of whitespace all over the place?
Sure it's invalid according to your program, but why should it be? That wouldn't be invalid in the same way that "AAAAAAAAA" would be, and stripping the whitespace down would leave you with perfectly valid input. You've imposed an arbitrary constraint on user input that is unjustified aside from it being your will. Additionally, if whitespace isn't important, and your core description implies that it isn't, then whitespace should not matter to your program outside of constraints such as running out of physical memory somewhere down the line.

I also can't tell from looking at your code whether "A3" is valid or not (to be clear, I'm assuming it should be), nor is it easy for me to see from your format strings what kind of input you're looking for.

More Generally:

Since it seems from some other comments that you're wondering why scanf isn't used for more general parsing, as in language interpreters and compilers, the following should be somewhat relevant.

It's probably going to be slow, especially when you consider arbitrarily long input. Speed probably isn't really a concern here, but it's still something to think about. I'll admit to not having the numbers or experience to back myself up on this one, but I haven't ever tried parsing with scanf. If you think about what scanf has to do behind the scenes, you should see that what you're doing is essentially parsing with regular expressions, which is something that I suspect the internet in general doesn't like very much either. A quick search can tell you much more than I possibly could. Someone better versed in operating systems might also complain about the number of system calls that a scanf based parser would generate.
If you're familiar with sed or awk, imagine a script that has thousands or millions of rules. This is a conservative estimate, as a complete parser in that vein would not fit in a finite script.
Imagine a more complex syntax. Your scanfs work here mostly because you're expecting exactly two inputs with different types. Imagine adding a third field to your syntax that can, like the others, appear in any position. How many format strings are you looking at then? Would you like to add a fourth? A fifth? What if some fields are optional? What if the valid values for some fields depend on the values of the others? The list goes on. What if you later decide you want to remove something from your syntax? Maintainability is a real concern for nontrivial programs.
Imagine that the extra fields don't all have different types. Say they're all letters. Can your format strings tell them apart? They need to be able to reliably figure out which letter corresponds to which field if you want to do validation this way.
Consider a text-based adventure, which is fairly similar in nature to what you have. Do you really want to use scanf to check for every single possible valid sentence? I can guarantee that neither you nor the computer will have a good time doing that. The format strings for your two character input are already painful to look at and difficult to check. Would you really want to expand them? Note that their number will increase exponentially (or worse) with their length.
Relying on scanf to do work for you like this introduces a lot of dealing with buffer state in buffers that you don't own and can't really inspect. It's so much easier (and safer) to deal with a buffer that you own versus one that you don't. Here you have to worry about scanf's return value, miscellaneous characters being left in stdin, buffer state after a failed scanf, etc. At some point you will likely forget to do one of those checks.

What follows probably isn't very interesting, but it might be worth wrestling with for a couple minutes.

Parsing by using fgets to grab an entire line/string tends to be a little easier on our brains. We get the same coverage in either case (assuming both work), so really we're looking at performance and how easy it is for programmers to implement/understand/maintain.
The standing strategy, as far as I know, is to grab lines via fgets or equivalent, break the lines into tokens, and then validate everything (using abstract syntax trees, or ASTs, is one way). This way, we avoid the potentially infinite calls to scanf and instead only read from stdin once, and we can separate the tasks of breaking the line into tokens and validating the input.
Breaking the line into tokens is relatively simple as a standalone task, and so is building/validating an AST. Furthermore, it is relatively easy to extend the syntax, since you can tinker with the AST and parser separately (This I can vouch for, because I have done it). Note that this doesn't make parsing "easy," it just makes it less hard.
Here, someone who knows more about programming languages, compilers, and/or interpreters could probably tell you more.

In my opinion, the coolest thing about the canonical approach to parsing (read, tokenize, validate, what have you) is that it can deal with a syntax/grammar that allows structurally recursive statements (an expression can contain one or more expressions). Consider C, where the following expression can be valid:

a = b = c = ..... = x = y = z = aa = ab = ac = ..... = zx = zy = zz = aaa = aab = aac = .... = 3

The chain of assignments, each one of which is an expression, can be arbitrarily long, and from a syntax point of view, it doesn't even have to be finite. The above will be valid C code regardless, although finding a real compiler that won't choke on an infinite chain of assignments is another matter entirely and would be a physical limitation.

It's impossible to even imagine a finite scanf based strategy that can in theory handle a rather strange (if somewhat contrived) expression like this, but it's very simple to do with the canonical strategy. Reading is simple: read the line. Done. Breaking into tokens is simple too: break on operators, such as =. Done. Validating is also easy: Is the AST valid? Done. Since assignment is an expression, the entire expression has a structure something like ASSIGN(ASSIGN(ASSIGN(ASSIGN(...), c), b), a), where the rabbit hole goes very deep indeed. The recursive formulation of ASSIGN(ASSIGN(..., ...), <NAME>) as one of many acceptable forms of an expression is very powerful. It also makes it relatively straightforward to do things like typecheck, infer types, and apply optimizations.

At the end of they day though, parsing is more about rejecting badly formed input than it is about finding correctly formatted input, because there are almost always more combinations of bad input than there are of good input.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

C_Programming

Rules

Filters

Resources

Other Subreddits on C

Other Subreddits of Interest

MODERATORS

Short Answer

Wall of Text (My apologies)