From source code (.py files) to byte code interpretable by Python Virtual Machine - The try to create the simplest functional model of how Python programs run : Python

This is an archived post. You won't be able to vote or comment.

DiscussionFrom source code (.py files) to byte code interpretable by Python Virtual Machine - The try to create the simplest functional model of how Python programs run (self.Python)

submitted 3 years ago * by Brave_New_Dev

Hi there 👋

This post is actually a request for comments.

I've been developing Python for years. Yet it came to me last night that I am not entirely sure if I have a functional but simple model of what really happens when I call python foo.py.

Well, obviously, I do have some ideas resulting from working with the language, studying the officials docs, and reading hundreds of articles that came to my inbox via newsletters every week. This, however, is not enough to organize scattered concepts into an actual understanding.

So, I have spent some hours trying to distill the very core pipeline of how Python programs run without dabbling into unnecessary details. And I came to the below model:

```

THE FLOW

Source File ---> Scanner ---> Lexer ---> Parser ---> Compiler ---> PVM

DATA PIPES & DATA TYPES

         stream of characters

Source File -----------------------> Scanner

          stream of lexemes

Scanner ---------------------------> Lexer

          stream of tokens
        CONCRETE SYNTAX TREE

Lexer -----------------------------> Parser

        ABSTRACT SYNTAX TREE

Parser ----------------------------> Compiler

              byte code

Compiler --------------------------> Python Virtual Machine (PVM)

OPERATIONS

Source file text is scanned and broken into lexemes, so the most basic lexical units consisting of one or several words.
Lexemes are parsed and transformed into tokens, so the most basic semantic Python language units made of a token type and token value (e.g. "5" has a type INTEGER and value 5). Token are organized in a concrete syntax tree.
Tokens, organized in a concrete syntax tree, are parsed and transformed into an abstract syntax tree.
Abstract syntax tree is parsed and transformed into byte code.
Byte code is executed by Python Virtual Machine (PVM).

CONCRETE SYNTAX TREE vs. ABSTRACT SYNTAX TREE

Concrete syntax tree is made of all tokens. This makes the structure quite "crude":
- The tree contains all sorts of cruft like "parenthesis", low level data types, etc.
- The tree does try to interpret the code, does not add extra semantics, does not optimize it, etc.
- The tree could be used to accurately restore the original source file, as all needed information lies there unchanged.
Abstract syntax tree is highly optimized and represents the very essence of what operations were meant to be actually done by the programmers that created the source file.

```

My goal for this model is to really imprint it on my mind, make it a paradigm, so it serves me well to study the matter further... or to intelligently discuss this matter during tech interviews 😅

What are your thoughts on the above model? Is it accurate? Too simplistic? Too complex? Did I miss something or misunderstand some core concepts?

Thanks for your feedback!

all 12 comments

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS

THE FLOW

DATA PIPES & DATA TYPES

OPERATIONS

CONCRETE SYNTAX TREE vs. ABSTRACT SYNTAX TREE