Recently, I’ve been interested in learning some of the internal workings of CPython in order to more deeply understand Python and possibly contribute to it. I’ve also come across Nim and wanted to try a project using it. To accomplish two goals at once, I decided to write a toy Python VM using Nim to execute Python bytecode.
This post is where I intend to compile information about Python’s compilation step as I learn about it, as a reminder for myself and a resource for anyone else who might be curious.
What Is Python’s Compilation Step?
For a detailed overview of CPython and the steps a program goes through from source code to execution, check out this RealPython guide by Anthony Shaw. Part 3, in particular, describes the step in question here.
In short, and broadly speaking, the process goes source code -> lexing/tokenizing -> parsing to AST -> compiling to bytecode -> execution. You can compile code using Python’s built-in compile
function, which results in a code object.
Here’s the JSON object that results from a code object compiled from the statement hello = 3000
. This JSON contains everything needed to run the program. The most important items in this example are "code"
, which contains the opcodes; "consts"
, which is a list of constants used in the program; and "names"
, which is a list of variable names used in the programs. This will make more sense later.
{ "argcount": 0, "cellvars": [], "code": "d\u0000Z\u0000d\u0001S\u0000", "consts": [3000, null], "filename": "test.py", "firstlineno": 1, "flags": 64, "freevars": [], "kwonlyargcount": 0, "lnotab": "", "name": "<module>", "names": ["hello"], "nlocals": 0, "stacksize": 1, "varnames": [] }
Another helpful tool is dis
, which outputs the opcodes of a compiled program.
For example, the code object, according to dis
, looks like this:
1 0 LOAD_CONST 0 (3000) 2 STORE_NAME 0 (hello) 4 LOAD_CONST 1 (None) 6 RETURN_VALUE
The first column is the line number. The second column I’m not sure about yet; it might be the index of the opcode, since it goes like [opcode, arg, opcode, arg…]. The third column is the name of the opcode. The fourth column is the argument for the opcode, which is usually an index for a different list. Finally, the fifth column is what that index points to.
Putting it all together:
The first opcode, LOAD_CONST
, loads a constant from index 0 in the consts
list, which is 3000
, and puts it on the stack. The next opcode, STORE_NAME
, pops a value (3000) off the stack and associates it with the name at index 0 of the names
list, which is hello
.
The next two opcodes just indicate the end of the frame.
That’s a lot to digest, and we’ve only just managed to assign an integer to a variable!
Strategy
This is a big project, so I’m planning on doing it a bite at a time. At its current stage, my Nim VM can take the bytecode of programs that simply assign constants to variables —and the constant has to be either an integer or a string.
My strategy is to write a Python program that requires just a bit more functionality than what my VM currently implements. Then work on the VM until it can successfully run the program. Rinse and repeat.
I’ll try to write it in the best Nim that I can, but I won’t let analysis paralysis prevent me from going ahead and getting something working. Since I’m doing this to learn Nim, a lot of it probably isn’t going to be the most idiomatic, performant Nim code. That’s okay. I’ll try to fix things as I continue to learn.
When I finish implementing an opcode, or learn something useful about code objects and such like, I’ll update this article with what I’ve learned.
Credits/References
Thanks to Alison Kaptur for an excellent article describing a simple implementation of a Python VM written in Python. That pointed me to the actual VM, Byterun, which she worked on, though it was primarily written by Ned Batchelder. It has been a great help in understanding how all of this works. So, without further ado, here are my notes on terminology, code objects, and opcodes.
Definitions
Code Object
: A collection of fields containing all the information necessary to execute a Python program.Stack
: A data structure holding the values that the VM is working on. You push values to the top of the stack to store them and pop value from the top of the stack to work with them.Opcode
: A byte that instructs the VM to take a particular action, such as pushing a value to the stack or popping the top two values from the stack and adding them together.Jump
: An opcode that “jumps,” skips to another part of the bytecode. This can be relative (jump forward 10 bytes from the current position, for example) or absolute (jump to the 23rd byte from the beginning).
Code Objects
Code objects consist of several fields containing various information about the program.
Field Name | Description |
---|---|
argcount | I’m not sure yet. |
cellvars | I’m not sure yet. |
code | A bytestring representing the opcodes and their arguments. Every other byte is an opcode, and in between each opcode is a byte for the argument. |
consts | A list of constants. These could be integers, strings, None [null], etc. These are put onto the stack by the LOAD_CONST opcode. |
filename | The name of the file which was compiled. When compiling from within a program, this is passed to the compile built-in function as an argument, so it could be any string, really. |
firstlineno | The first line number. I’m not sure what this is for, yet. I’m assuming it has something to do with frames. |
flags | I’m not sure yet. |
freevars | I’m not sure yet. |
kwonlyargcount | I’m not sure yet. I assume it has something to do with keyword-only arguments to functions. |
lnotab | I’m not sure yet. |
name | The name of the module. I assume this is related to importing. |
names | A list of names of variables, which will be referenced by certain opcodes to associate variables with values. |
nlocals | I’m not sure yet. |
stacksize | I’m not sure yet. It’s possible that the max size of the stack is precomputed so that the data structure can be initialized to the correct size. |
varnames | I’m not sure yet. |
Opcodes
Note: sometimes the byte value skips ahead, to leave room for new opcodes to be inserted there in the future. To indicate when that happens, in the table below, the Byte Value is in bold.
Some of the information here comes from the dis
module’s documentation. Some of it comes from the opcode
module’s source code. Some of it comes from what I’ve learned through experimentation.
Name | Byte Value | Description |
---|---|---|
POP_TOP | 1 | Pop the top value from the stack. |
ROT_TWO | 2 | Rotate the top two values of the stack. For example: [1, 2, 3, 4] -> [1, 2, 4, 3]. |
ROT_THREE | 3 | As above, but rotate the top three values. |
DUP_TOP | 4 | Duplicate the top stack value. |
DUP_TOP_TWO | 5 | Duplicate the top two stack values. |
ROT_FOUR | 6 | As ROT_TWO, but rotate the top four values. |
NOP | 9 | No operation. Does nothing. Used as a placeholder. |
UNARY_POSITIVE | 10 | Unary operations take the top of the stack apply an operation to it, then push it back on the stack. This one adds 1 to it. |
UNARY_NEGATIVE | 11 | Same as above, but subtract 1. |
UNARY_NOT | 12 | Negates the top stack value (as in not x ). |
UNARY_INVERT | 15 | Inverts the top stack value (as in ~x ). |
BINARY_MATRIX_ MULTIPLY | 16 | Binary operations take the top two values from the stack, perform an operation on them, then push the result onto the stack. This one performs matrix multiplication (the @ syntax, new in 3.5). |
INPLACE_MATRIX_ MULTIPLY | 17 | Performs an inplace matrix multiplication. |
BINARY_POWER | 19 | |
BINARY_MULTIPLY | 20 | |
BINARY_MODULO | 22 | |
BINARY_ADD | 23 | |
BINARY_SUBTRACT | 24 | |
BINARY_SUBSCR | 25 | |
BINARY_FLOOR_DIVIDE | 26 | |
BINARY_TRUE_DIVIDE | 27 | |
INPLACE_FLOOR_DIVIDE | 28 | |
INPLACE_TRUE_DIVIDE | 29 | |
GET_AITER | 50 | |
GET_ANEXT | 51 | |
BEFORE_ASYNC_WITH | 52 | |
BEGIN_FINALLY | 53 | |
END_ASYNC_FOR | 54 | |
INPLACE_ADD | 55 | |
INPLACE_SUBTRACT | 56 | |
INPLACE_MULTIPLY | 57 | |
INPLACE_MODULO | 59 | |
STORE_SUBSCR | 60 | |
DELETE_SUBSCR | 61 | |
BINARY_LSHIFT | 62 | |
BINARY_RSHIFT | 63 | |
BINARY_AND | 64 | |
BINARY_XOR | 65 | |
BINARY_OR | 66 | |
INPLACE_POWER | 67 | |
GET_ITER | 68 | |
GET_YIELD_FROM_ITER | 69 | |
PRINT_EXPR | 70 | |
LOAD_BUILD_CLASS | 71 | |
YIELD_FROM | 72 | |
GET_AWAITABLE | 73 | |
INPLACE_LSHIFT | 75 | |
INPLACE_RSHIFT | 76 | |
INPLACE_AND | 77 | |
INPLACE_XOR | 78 | |
INPLACE_OR | 79 | |
WITH_CLEANUP_ START | 81 | |
WITH_CLEANUP_ FINISH | 82 | |
RETURN_VALUE | 83 | |
IMPORT_STAR | 84 | |
SETUP_ANNOTATIONS | 85 | |
YIELD_VALUE | 86 | |
POP_BLOCK | 87 | |
END_FINALLY | 88 | |
POP_EXCEPT | 89 | |
STORE_NAME | 90 | All opcodes from here on have arguments. This operation pops the top value from the stack and associates it with a name in the names list. The argument is the index of the name. |
DELETE_NAME | 91 | Deletes the association between a name and a value. The argument is the index of the name. |
UNPACK_SEQUENCE | 92 | The argument is the number of tuple items. |
FOR_ITER | 93 | The argument is a relative jump. |
UNPACK_EX | 94 | |
STORE_ATTR | 95 | The argument is the index of a name in the names list. |
DELETE_ATTR | 96 | The argument is the index of a name in the names list. |
STORE_GLOBAL | 97 | The argument is the index of a name in the names list. |
DELETE_GLOBAL | 98 | The argument is the index of a name in the names list. |
LOAD_CONST | 100 | Push a constant onto the stack. The argument is the index of a constant in the consts list. |
LOAD_NAME | 101 | Push the value associated with a name onto the stack. The argument is the index of a name in the names list. |
BUILD_TUPLE | 102 | Pop the top n values from the stack and build a tuple from them. The argument is the number of tuple items to pop. Push the resulting tuple onto the stack. |
BUILD_LIST | 103 | Pop the top n values from the stack and build a list from them. The argument is the number of list items to pop. Push the resulting tuple onto the stack. |
BUILD_SET | 104 | Pop the top n values from the stack and build a set from them. The argument is the number of set items to pop. Push the resulting set onto the stack. |
BUILD_MAP | 105 | Pop the top n values from the stack and build a map (dictionary) from them. The argument is the number of dict entries. Push the resulting dict onto the stack. |
LOAD_ATTR | 106 | The argument is the index of a name in the names list. |
COMPARE_OP | 107 | |
IMPORT_NAME | 108 | |
IMPORT_FROM | 109 | |
JUMP_FORWARD | 110 | The argument is the number of bytes to skip (a relative jump). |
JUMP_IF_FALSE_ OR_POP | 111 | The argument is the byte index to jump to (an absolute jump). |
JUMP_IF_TRUE_ OR_POP | 112 | The argument is the byte index to jump to (an absolute jump). |
JUMP_ABSOLUTE | 113 | The argument is the byte index to jump to (an absolute jump). |
POP_JUMP_IF_FALSE | 114 | The argument is the byte index to jump to (an absolute jump). |
POP_JUMP_IF_TRUE | 115 | The argument is the byte index to jump to (an absolute jump). |
LOAD_GLOBAL | 116 | The argument is the index of a name in the names list. |
SETUP_FINALLY | 122 | The argument is the number of bytes to jump (a relative jump). |
LOAD_FAST | 124 | The argument is the local variable number. |
STORE_FAST | 125 | The argument is the local variable number. |
DELETE_FAST | 126 | The argument is the local variable number. |
RAISE_VARARGS | 130 | The argument is the number of raise arguments (1, 2, or 3). |
CALL_FUNCTION | 131 | |
MAKE_FUNCTION | 132 | |
BUILD_SLICE | 133 | |
LOAD_CLOSURE | 135 | |
LOAD_DEREF | 136 | |
STORE_DEREF | 137 | |
DELETE_DEREF | 138 | |
CALL_FUNCTION_KW | 141 | |
CALL_FUNCTION_EX | 142 | |
SETUP_WITH | 143 | |
LIST_APPEND | 145 | |
SET_ADD | 146 | |
MAP_ADD | 147 | |
LOAD_CLASSDEREF | 148 | |
EXTENDED_ARG | 144 | Note: this one is out of order in opcode.py, so I’ve listed it out of order here, too. |
BUILD_LIST_UNPACK | 149 | |
BUILD_MAP_UNPACK | 150 | |
BUILD_MAP_UNPACK _WITH_CALL | 151 | |
BUILD_TUPLE_UNPACK | 152 | |
BUILD_SET_UNPACK | 153 | |
SETUP_ASYNC_WITH | 154 | |
FORMAT_VALUE | 155 | |
BUILD_CONST_KEY_MAP | 156 | |
BUILD_STRING | 157 | |
BUILD_TUPLE_UNPACK_ WITH_CALL | 158 | |
LOAD_METHOD | 160 | |
CALL_METHOD | 161 | |
CALL_FINALLY | 162 | |
POP_FINALLY | 163 |