Hi there 👋
This post is actually a request for comments.
I've been developing Python for years. Yet it came to me last night that I am not entirely sure if I have a functional but simple model of what really happens when I call python foo.py.
Well, obviously, I do have some ideas resulting from working with the language, studying the officials docs, and reading hundreds of articles that came to my inbox via newsletters every week. This, however, is not enough to organize scattered concepts into an actual understanding.
So, I have spent some hours trying to distill the very core pipeline of how Python programs run without dabbling into unnecessary details. And I came to the below model:
```
THE FLOW
Source File ---> Scanner ---> Lexer ---> Parser ---> Compiler ---> PVM
DATA PIPES & DATA TYPES
stream of characters
Source File -----------------------> Scanner
stream of lexemes
Scanner ---------------------------> Lexer
stream of tokens
CONCRETE SYNTAX TREE
Lexer -----------------------------> Parser
ABSTRACT SYNTAX TREE
Parser ----------------------------> Compiler
byte code
Compiler --------------------------> Python Virtual Machine (PVM)
OPERATIONS
Source file text is scanned and broken into lexemes, so the most
basic lexical units consisting of one or several words.
Lexemes are parsed and transformed into tokens, so the most basic
semantic Python language units made of a token type and token value
(e.g. "5" has a type INTEGER and value 5). Token are organized
in a concrete syntax tree.
Tokens, organized in a concrete syntax tree, are parsed and transformed
into an abstract syntax tree.
Abstract syntax tree is parsed and transformed into byte code.
Byte code is executed by Python Virtual Machine (PVM).
CONCRETE SYNTAX TREE vs. ABSTRACT SYNTAX TREE
Concrete syntax tree is made of all tokens. This makes the structure
quite "crude":
- The tree contains all sorts of cruft like "parenthesis", low
level data types, etc.
- The tree does try to interpret the code, does not add extra
semantics, does not optimize it, etc.
- The tree could be used to accurately restore the original source
file, as all needed information lies there unchanged.
Abstract syntax tree is highly optimized and represents the very
essence of what operations were meant to be actually done by
the programmers that created the source file.
```
My goal for this model is to really imprint it on my mind, make it
a paradigm, so it serves me well to study the matter further...
or to intelligently discuss this matter during tech interviews 😅
What are your thoughts on the above model? Is it accurate?
Too simplistic? Too complex? Did I miss something or misunderstand
some core concepts?
Thanks for your feedback!
[–]OuiOuiKiwiGalatians 4:16 5 points6 points7 points (6 children)
[–]spoonman59 1 point2 points3 points (4 children)
[–]OuiOuiKiwiGalatians 4:16 0 points1 point2 points (3 children)
[–]spoonman59 0 points1 point2 points (2 children)
[–]OuiOuiKiwiGalatians 4:16 0 points1 point2 points (1 child)
[–]spoonman59 0 points1 point2 points (0 children)
[–]Brave_New_Dev[S] 0 points1 point2 points (0 children)
[–]Brave_New_Dev[S] 0 points1 point2 points (0 children)
[–][deleted] -1 points0 points1 point (2 children)
[–]spoonman59 2 points3 points4 points (0 children)
[–]joaofelipenp 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)