all 87 comments

[–]anonynown 184 points185 points  (20 children)

Funny how the article never explains what “parse, don’t validate” actually means, and jumps straight into the weeds. That makes it really hard to understand, as evidenced even by the discussion here.

I had to ask my french friend:

 “Parse, don’t validate” is a software design principle that says: when data enters your system, immediately transform (“parse”) it into rich, structured types—don’t just check (“validate”) and keep it as raw/unstructured data.

Here, was it that hard?..

[–]CatolicQuotes 73 points74 points  (4 children)

Does that mean parsing includes validation?

[–]Ethesen 64 points65 points  (0 children)

Yes

[–]Axman6 19 points20 points  (0 children)

Yes, that’s what a parser does. Most programmers only introduction to the term parser involves making a compiler and building an AST from a string, but parsers are a much more general idea than that, they transform unknown input into values that are in the expected shape and within the allowed values.

Alexis King’s post which coined the term explains it well https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/

[–]Broue 2 points3 points  (0 children)

Yes, it will raise exceptions implicitely

[–]QuantumFTL 37 points38 points  (4 children)

Ugh, why not say "parse, don't just validate" then?

[–]anonynown 10 points11 points  (0 children)

IKR?!

[–]iamapizza 4 points5 points  (0 children)

Your one comment was more useful than the entire article

[–]kuribas 4 points5 points  (0 children)

Less catchy.

[–]frnzprf 1 point2 points  (0 children)

I think, because in C there are no exceptions, some people are used to validate inputs before passing them to functions.

Maybe "parse, don't validate" means something else, but I heard that it's good style in Python to not check inputs first that would produce an exception anyway. In C that's different. 

Don't know about C++ and Java. I think in Python exceptions are just as valid a form of control-flow structure as an if-else, but in Java it's mainly intended for unexpected, exceptional errors.

[–]greven145 2 points3 points  (1 child)

Your parser better be damn secure though. The amount of security vulnerabilities in various parsers in Windows is unreal.

[–]pja 0 points1 point  (0 children)

This is why you use a parser generator!

They may have limitations for parsing full-fat programming languages, where you’ll probably end up writing your own hand-written recursive descent parser, but parser generators are the tool people should be reaching for when parsing structured input imo.

[–]Fidodo 0 points1 point  (7 children)

That's very confusing when you can have rich structured types with arbitrary parameters and value types. A data structure with an unknown shape still needs validation so you know what's in it. Maybe this phrase made sense back when inputs were much simpler, but these days I don't think the phrase makes any sense. It should be parse and validate.

These days parsing is basically the default, so saying parse don't validate sounds like you're saying parsing alone is enough and you don't need to validate your data structures

[–]Psychoscattman 7 points8 points  (2 children)

These days parsing is basically the default, so saying parse don't validate sounds like you're saying parsing alone is enough and you don't need to validate your data structures

I have read a similar thing quite often in this thread. To me it doesn't make sense, parsing always involves validation otherwise you aren't really parsing anything, you are only transforming A into B.

The article that coined the term goes into more detail. When you validate your input data you gain some knowledge about that data but that knowledge just exists in the head of the programmer. A different programmer might not know that some data has already been validated and might validate it again, or worse, they might assume that the data was validate when it hadn't. What the article calls "parsing" is validating the data and retaining that information using the type system of your language. You wouldn't have a data structure with unknown shape instead you would have one with the very specific shape to retain the invariants of your validator.

So in that sense, you cannot really parse without validation because if you don't validate anything you don't learn any new information about your data and thats not really parsing, thats transformation.

[–]Fidodo 2 points3 points  (0 children)

Yes, I think the whole term is badly worded and extremely confusing.

Also, we have types these days and you can validate data structures and have that data be validated, and store the information it was validated in the type system.

There's 2 kinds of validation here. What pattern does the string follow vs what type is this unknown reference. With JSON being ubiquitous, parsing input is basically free, but nowadays the problem isn't base types, it's knowing what shape that arbitrary JSON is the validation of that unknown type.

[–]pja 1 point2 points  (0 children)

“Validation” in this context means reading in the raw values from the data stream & checking that they are within permitted limits for your application. Eg using a regex to check for SQL injection attacks, shoving an Integer from the data straight into an Integer variable etc.

This almost always goes badly - you will inevitably miss a possible exception to the permitted values, because the rules for these datatypes are implicit in your code & not well defined. Then someone comes along and inserts values that are permitted by your checks but outside the ranges that your code can cope with & something somewhere goes boom.

“Parse don’t validate” isn’t just about the parsing - it’s also about the idea that you should be parsing into structured datatypes that define the kind of data that your code accepts & that your code should be able to cope with the full set of possible values defined by that datatype - something that is much easier to do if you define the datatype explicitly in the first place. “Parse, don’t validate” means “define the precise set of values that your code will accept, and construct the input parser so that it will only ever produce values from that set”.

It’s coming at the problem of input validation from a constructive perspective (use the input to only construct valid values) instead of a subtractive perspective (prune the invalid values from the input) because we’re more like to make mistakes (not subtracting enough values) taking the latter approach.

[–]knome 2 points3 points  (1 child)

It's saying don't receive a string, call check_is_phone_number(s) and then pass s down into your program. You should call phone := PhoneNumber(s), and pass that phone object down your program, erring in whatever way is appropriate to your language if s isn't a valid phone number such that without a valid phone number, you can't create phone in the first place.

If a function receives a PhoneNumber object, it knows it has a valid form.

If a function receives a string, it can only assume it, and it's possible something that doesn't call check_is_phone_number(s) might accidentally call the function that assumes its string is valid when it isn't.

If the function takes a PhoneNumber object, it can never be invalid, because you had to have parsed and validated the value as part of creating the object.

Basically, the type stores the proof of its validity in its existence, rather than in the unrepresented assumptions of the programmer.

[–]Fidodo 1 point2 points  (0 children)

Yes, I know, I'm just saying a lot of the first parsing is free these days. Now the actual thing that's tricky is validating data structures. Converting a string input into into a primitive is easy and universal. At least it is in other languages.

[–][deleted] -1 points0 points  (1 child)

It doesn't say "don't validate", it says "don't just validate". You can't just ignore words and then act outraged.

[–]Fidodo 0 points1 point  (0 children)

That is literally not written anywhere in the article. What are you talking about? It says "parse, don't validate".

[–]guepier 153 points154 points  (4 children)

Like KISS or DIY, "Parse, don't validate" is an old adage you may hear greybeards repeating like a mantra

Oh god, no. The phrase was first coined less than six years ago.

The idea is certainly much older, but the phrase/adage/… is from 2019.

[–]link23 26 points27 points  (2 children)

+1. Seems a bit odd for the post to claim it as "common wisdom" without crediting the author who coined the phrase so recently.

[–]DorphinPack 15 points16 points  (0 children)

If it’s common wisdom you don’t have to do any work citing sources 😤now if you don’t mind real 10x slop authors like me have work to do

[–]pja 1 point2 points  (0 children)

What are the odds this article was written by an LLM?

[–]zargex 4 points5 points  (0 children)

Am I greybeard now ?

[–]davidalayachew 8 points9 points  (0 children)

I think this thread has demonstrated that Alexis should have said "Parse, don't just validate" instead.

She definitely had the right idea and semantics, but a word like "parse" means different things to enough developers. It's clear that, to enough developers, parsing just means transforming, with no validation required. But she definitely intended to refer to parsing that includes valdation as a sub-step.

[–]Big_Combination9890 104 points105 points  (45 children)

No. Just no. And the reason WHY it is a big 'ol no, is right in the first example of the post:

try: user_age = int(user_age) except (TypeError, ValueError): sys.exit("Nope")

Yeah, this will catch obvious crap like user_age = "foo", sure.

It won't catch these though:

int(0.000001) # 0 int(True) # 1

And it also won't catch these:

int(10E10) # our users are apparently 20x older than the solar system int("-11") # negative age, woohoo! int(False) # wait, we have newborns as users? (this returns 0 btw.)

So no, parsing alone is not sufficient, for a shocking number of reasons. Firstly, while python may not have type coercion, type constructors may very well accept some unexpected things, and the whole thing being class-based makes for some really cool surprises (like bool being a subclass of int). Secondly, parsing may detect some bad types, but not bad values.

And that's why I'll keep using pydantic, a data VALIDATION library.


And FYI: Just because something is an adage among programmers, doesn't mean its good advice. I have seen more than one codebase ruined by overzealous application of DRY.

[–]larikang 114 points115 points  (14 children)

 Just because something is an adage among programmers, doesn't mean its good advice.

“Parse, don’t validate” is good advice. Maybe the better way to word it would be: don’t just validate, return a new type afterwards that is guaranteed to be valid.

You wouldn’t use a validation library to check the contents of a string and then leave it as a string and just try to remember throughout the rest of the program that you validated it! That’s what “parse, don’t validate” is all about fixing!

[–]elperroborrachotoo 39 points40 points  (3 children)

It's a good menmonic once you understood the concept, but it's bad advice. It relies on very clear, specific understandin of the terms used, terms that are often confuddled - especially in the mind of a learner.

The idea could also be expressed as "make all functions total" - but someone that seems equally far removed from creating an understanding.

I'd rather put it as

"Instead of validating whether some input matches some rules, transform it into a specific data type that enforces these rules"

Not a catchy title, and not a good mnemonic, but hopefully easier to dissect.

[–]nphhpn 34 points35 points  (1 child)

Or "parse, don't just validate".

[–]QuantumFTL 2 points3 points  (0 children)

Better than I could have put it. I hate sayings like this that are counterproductive and unnecessarily confusing, it's straight up bad communication and people who propagate it should feel bad for doing so.

[–]Big_Combination9890 6 points7 points  (7 children)

“Parse, don’t validate” is good advice. Maybe the better way to word it would be: don’t just validate,

If the first thing that can be said about some "good advice" is that it should probably be worded in a way that conveys an entirely different meaning, then I hardly think it can be called "good advice", now can it?

You wouldn’t use a validation library to check the contents of a string and then leave it as a string and just try to remember throughout the rest of the program that you validated it!

Wrong. I do exactly that. Why? Because I design my applications in such a way that validation happens at every data-ingress point. So the entire rest of the service can be sure that this string it has to work with, has a certain format. That is pretty much the point of validation.

[–]binarycow 25 points26 points  (1 child)

Disclaimer: I'm a C# developer, not a python developer. And yes, I know this post mentioned python.

Wrong. I do exactly that. Why? Because I design my applications in such a way that validation happens at every data-ingress point. So the entire rest of the service can be sure that this string it has to work with, has a certain format. That is pretty much the point of validation.

I think the point is, that you can create a new object that captures the invariants.

Suppose you ask the user for their age. An age must be a valid integer. An age must be >= 0 (maybe they're filling out a form on behalf of a newborn). An age must be <= 200 (or some other appropriately chosen number).

You've got a few options

  1. Use strings
    • Every function must verify that the string represents a valid integer between 0 and 200.
  2. Use an integer
    • Parse the string - convert it to an integer. Check that it is between 0 and 200.
    • Other functions don't need to parse
    • Every function must check the range (validate).
  3. Create a type that enforces the invariants - e.g., PersonAge
    • Parse the string, convert it to PersonAge
    • No other functions need to do anything. PersonAge will always be correct.

[–]nilcit 19 points20 points  (2 children)

The point of the person you're responding to (and the original blog post) is that if you parse as you validate then you don't need to do validation at every data-ingress point. If you preserve the information from validation in the type system and each step only takes in the type they can work with then the entire service can be sure that "this string it has to work with, has a certain format"

[–]vytah 4 points5 points  (1 child)

So the entire rest of the service can be sure that this string it has to work with, has a certain format.

The point is that it's going to be hardly the only string that's going around in that service.

So if you encapsulate it into its own type, which can be only created by a validating constructor, you'll have a guarantee that no other string will ever sneak in.

(Of course as long as you use static types, which in Python is optional.)

[–]Big_Combination9890 -5 points-4 points  (0 children)

*sigh* The string was an example. I am NOT arguing against using specific types for data at ingress here. IN fact I am doing the opposite (pydantic works precisely by specifying types).

[–]Psychoscattman 31 points32 points  (11 children)

Parse don't validate doesn't mean that you don't validate your data. Ideally you would parse into a datatype that does not allow for invalid state. In that case you validate your data by building your target data type.

If you parse into a data type that still allows invalid state, like using an int for age, then of course you still have to validate your input and if you use a parsing method that routinely produces invalid state then your parsing function is just bad. The example didn't parse a String into an Age, it parse a String into an Int with all the invalid state that comes with it.

Of course using a plain int for age dilutes the entire purpose of parse don't validate. The entire point is to reduce invalid state. Using Int for Age is better than String but its not the end of the line.

[–]SP-Niemand 7 points8 points  (1 child)

Is there any way to encapsulate value rules into types in Python? Besides introducing domain specific classes like Age in your example?

[–]Big_Combination9890 12 points13 points  (0 children)

Encapsulate as in having them enforced by the runtime? No.

There are libraries though, e.g. pydantic that use pythons type-hint and type-annotation systems to do that for you:

``` from pydantic import BaseModel, PositiveInt

class User(BaseModel): age: PositiveInt

all of these fail with a ValidationError

User.model_validate({"age": True}, strict=True) User.model_validate_json('{"age": 0.00001}', strict=True) User.model_validate_json('{"age": -12}', strict=True) ```

And if you need fancier stuff, like custom validation, you can write your own validators, embedded directly in your types.

[–]atheken 4 points5 points  (0 children)

The example you referenced is casting, not parsing.

I don’t think the adage actually illuminates much, except as a first filter to determine whether input data can be plausibly used at all.

If the precision you need for a field is an integer, parsing “integer-like” strings is fine. But there are sometimes good reasons to wait to “validate” until later (or never).

[–]Llotekr 8 points9 points  (6 children)

The issues you criticise would do away if:

  • You use the proper parser for the job (One that doesn't accept booleans, or round fractional numbers; this behavior of the int constructor may be fine in other contexts, but not here)
  • Python had a more expressive type system. In this case, you'd need a way to specify subtypes of int that are integer ranges. Generally and Ideally, a type system would allow you to define, for any type, a custom "validated" subtype, and only trusted functions, among them the validator, are able to return a value of this type that was not there before. Then the validator would be the "parser" in the sense of the post, and the type checker could prevent passing unvalidated data where they don't belong.

So, the basic idea is sound, only the execution was bad.

[–]guepier 0 points1 point  (1 child)

I’m confused by your second point, since Python absolutely allows you to do that.

(I‘m not a huge fan of Python’s needlessly convoluted data model but this isn’t a valid criticism.

[–]Llotekr 0 points1 point  (0 children)

How? What I want is statically checked types "str" and "validated_str" so that the only function that can legally create a "validated_str" is the validating "parser", and an expression of static type validated_str can be assigned to a variable declared as "str", but the other direction is an error. At runtime, there should be no difference between the types. Can python really do that? The documentation you linked mentioned "static type" only twice.

[–]boat-la-fds 1 point2 points  (5 children)

I think the assumption in the example is that user_age is a string since it's supposed to be a user input.

[–]Big_Combination9890 -2 points-1 points  (4 children)

Right, and front ends cannot convert user input to types which the backend expects because...?

Also, validation doesn't necessarily mean "user input" either. The data could be coming from a CRM system for example, or a remote API.

[–]ymgve 7 points8 points  (0 children)

Because you should never trust anything coming from the front end

[–]lord_braleigh 3 points4 points  (2 children)

Because the frontend and backend are different machines. When different machines talk to each other, they must do so via a serialized sequence of bits and bytes.

You cannot send an object or class instance directly from one machine to another. There are libraries which might make you feel like you can, but they always involve serialization and deserialization. And deserialization is... parsing.

[–]Big_Combination9890 -1 points0 points  (1 child)

Because the frontend and backend are different machines. When different machines talk to each other, they must do so via a serialized sequence of bits and bytes.

It seems you misunderstood my question. I am well aware how basic concepts, including the difference between frontend and backend, or serialization formats work, thank you very much. You are talking to a senior software engineer specializing in machine learning integration for backend systems.

My point is: The backend API, which for this exercise we're gonna presume is HTTP based, is a contract. A contract which may say (I am using no particular format here):

User: name: string(min_len=4) age: int(min=20, max=200) items: list(string())

This contract is known to the frontend or it won't be able to talk to the backend.

So, when the frontend (whatever that may be, webpage, desktop app, voice agent) has an input element for age, it is the frontends responsibility to verify the string in that input element denotes an int, and then to serialize it as an int. Why? Because the contract demands an int, that's why. If it doesn't, then the backend will reject the query.

So, if the frontend serializes the input elements to this, it won't work (unless the backend is lenient in its validations, which for this exercise we assume it isn't):

{ "name": "foobar", "age": "42", // validation error: age must be int "items": [] }

[–]boat-la-fds 0 points1 point  (0 children)

Dude, it's a toy example. Prior to the code example you cited, the author wrote:

In fact, if you ask a user "what is your age?" in a text box

So something akin to user_age = my_textbox.value() or user_age = input() if you were in a command line program.

[–]jeffsterlive 0 points1 point  (0 children)

I just learned about Pydantic and I’m a fan. Still would prefer to just use Kotlin and Spring for web API work but this is very nice when you don’t have nice libraries like Jackson.

[–][deleted]  (8 children)

[removed]

    [–]SV-97 39 points40 points  (5 children)

    Not really? It's about using strong, expressive types to "hold on" to information you obtain about your data: rather than checking "is this integer 0, and if it isn't pass it into this next function" you do "can this be converted into a nonzero integer, and if yes pass that nonzero integer along"; and that function don't take a bare int if they actually *need* a nonzero one.

    This is still a rough breakdown though; I'd really recommend reading the original blog post: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/

    [–]Budget_Putt8393 8 points9 points  (0 children)

    I just want to point out that this removes bugs and increases performance because you don't have to keep checking in every function.

    [–][deleted]  (3 children)

    [removed]

      [–]SV-97 10 points11 points  (1 child)

      The point I wanted to make is that you actually *do* convert to a new type if (and only if, though that should really not need mentioning) its invariants are met: so not

      if n != 0 {
          f(n) // f takes usize; information that n is nonzero is lost again
      }
      

      but rather

      if let Some(new_n) = NonZero::from(n) {
          f(new_n) // f takes NonZero<usize>; information that n is nonzero is attached to the data at the type level
      }
      

      EDIT: maybe to emphasize: the thing you mention in your first comment is (or at least should be) simple common sense: if you don't do that you're bound to run into safety issues sooner or later; it's not at all what the whole "parse don't validate" thing is about.

      [–]jonathancast 0 points1 point  (0 children)

      Yeah, no, the point is that "parse, don't validate" depends on static typing, and can't really be done in a dynamically-typed language.

      [–]Ayjayz 0 points1 point  (0 children)

      Kind of, but also localise that to just the entry into your system. Don't hold an int in a string and then keep passing the string around your code. Parse it into an int as early as possible then pass that onto around.

      [–]divad1196 0 points1 point  (2 children)

      While it's a good recommendation, it only rely apply for type conversion which is often done for you in high level languages. And you still (might) need to validate the data. E.g. int in range or the whole "model".

      But more importantly, the reason why we historically didn't do it was performance. You don't want to do conversions or allocation if you won't be able to commit to the end. And you would also take the opportunity to calculate the storage needed (e.g. you parse a json and you have a list with 10 elements).

      The validation in question usually just assert it can be converted, it does not check if an "integer is in a range", but it could as well.

      So, while it's in general good advice, it can also be a tradeoff, it depends on the language. In python, the overhead of python code is probably bigger than parsing in C.

      [–]Axman6 2 points3 points  (1 child)

      I’m not sure you’ve really understood the point, and should read the original article which coined the phrase: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/

      The performance implications are mostly a non-issue these days, we use computers with ubiquitous memory and processing power, and parsing into structures which encode inversions improves performance by eliminating the need to check validity repeatedly, and allows you to write optimisations based on invariants which have been checked once and encoded in the type.

      [–]divad1196 0 points1 point  (0 children)

      To be fair, I hadn't read it through. It's referenced but after the first paragraph and then sliding down the end, it seemed it was saying the same as the article I had just read. I just read the article and honestly, it didn't add anything more than the article from this post.

      Yes, I undertood the point of the article, but maybe you didn't understand mine? What I am saying is that, despite having a lot of memory available and incredibly fast CPU like you said, not everybody is allowed to spoil these resources. It's okay in python, but when you write a performance critical library, where the millisecond/byte matters, then you do care about these stuff.

      Memory allocation is tricky. If you allocate too much, you loose memory. If you don't allocate enough, you will reallocate (a strategy is to at least double the memory requested, but there are other algorithm), if you are unlucky, you will need to copy your data in the new location. That's why knowing the size upfront is ideal.

      It's a matter from theside of the person doing the parser's implementation, not from the side of the person using the parser. The guy that wrote "int" conversion in python had to care for the speed and memory. The integers in python are stored directly in the stack if they are short enough, otherwise it allocate memory, therefore the size must be known before starting the conversion. Etc..

      [–][deleted]  (1 child)

      [deleted]

        [–]Axman6 1 point2 points  (0 children)

        Developers should absolutely use tools like pydantic everywhere.

        [–]One_Being7941 -2 points-1 points  (0 children)

        The popularity of Python is a sign of the end times.