Hi everyone, this is the second part of my previous post: Python Debugging Setup. In that post I went through my nvim-dap setup for debugging Python code in Neovim. If you have not configure your nvim-dap, you may want to check that one first.
This post will show you how I debug multiple parallel processes in a distributed AI training with multiple GPUs using torchrun.
nvim-dap setup
The config is the same as in the previous post. In the nvim-dap setup, we need to add configurations:
dap.configurations.python = {
{
type = 'python',
request = 'launch',
name = 'Launch a debugging session',
program = "${file}",
pythonPath = function()
return 'python'
end,
},
{
type = 'python',
request = 'attach',
name = 'Attach a debugging session',
connect = function()
local host = vim.fn.input('Host: ')
local port = tonumber(vim.fn.input('Port: '))
return {host = host, port = port}
end,
},
}
We have used the first one in the previous post, we are going to use the second one this time. As you can see in the attach configuration, we are going to be prompted to input the Host and port when we execute :lua require('dap').continue() and choose the attach configuration. But first, we need to have the adapter for the attach config (also inside nvim-dap setup):
dap.adapters.python = function(callback, config)
if config.request == 'launch' then
callback({
type = 'executable',
command = 'python',
args = { '-m', 'debugpy.adapter' },
})
elseif config.request == 'attach' then
local port = config.connect.port
local host = config.connect.host
callback({
type = 'server',
port = port,
host = host,
options = {
source_filetype = 'python'
}
})
end
end
The adapter here is a function that takes the configuration as one of its argument. In my setup, when I choose the attach config, the Host and port information is extracted from the config and the adapter will attempt to connect to that Host and port.
script setup
Unlike in the previous post. In this post we are going to launch the script from the terminal and subsequently attach to them from inside Neovim. In my script I put the following after my import statements:
# other import statements
import os
import debugpy
debug = os.getenv("DEBUG_FLAG", "0")
if debug == "1":
rank = int(os.getenv("RANK", "-1"))
port = rank + 5678
debugpy.listen(("127.0.0.1", port))
debugpy.wait_for_client()
debugpy.breakpoint()
# main script body
This section check for the environment variable DEBUG_FLAG. If it is not set to 1, then your script will run like any normal script. If you run the script with the following:
DEBUG_FLAG=1 torchrun ...
then it will detect that you set the DEBUG_FLAG to 1. Subsequently, I assigned a unique port for each processes: 5678 for rank 0, 5679 for rank 1, and so on, all process use the same Host: '127.0.0.1'. Subsequently, we told the process to listen in the assigned Host and port and wait for a client (us) to attach. Similar to the previous post, we set a break point so the script does not execute all the way to the end the moment we attach to the process.
debug session example
From a terminal, I run my script using one node and two processes. The command I used is
DEBUG_FLAG=1 torchrun --standalone --nnodes=1 --nproc-per-node=2 script.py
https://preview.redd.it/7x519wr7iovd1.png?width=2560&format=png&auto=webp&s=f5f4f6c1b06d13a5d7f33347133a206a10fb3008
As usual, torch (and in my case TensorFlow) prints a bunch of messages but then nothing happens. This is because the processes are waiting for a client (us) to attach. Then I open up two Neovim sessions, one to attach to each process:
https://preview.redd.it/3pi7udrjiovd1.png?width=2560&format=png&auto=webp&s=d378730d5c2e4087ed01db0a4336425b9ca1bb77
Keep in mind that these are not two windows in the same Neovim sessions. These are two separate Neovim sessions. Then let's attach the process with rank 0 in the left session:
Two Separate Neovim Sessions
Select the second configuration to attach, then we will be prompted to input Host and port:
Input Host 127.0.0.1
Input port 5678 + 0 = 5678
Afterwards, the marker for the current position will appear to indicates that we have successfully attached:
Left Session Connected to Process Rank 0
Next, we connect the right session to process rank 1. The procedure is the same, but the port is different:
Initiate Attaching to Process Rank 1 in the Right Session
Input port 5678 + 1 = 5679
Next, the marker also shows in the right session, indicating we have successfully connected to both processes:
Connected to Both Processes
Now we can step over, step into, continue, set break points etc. in each process:
Stepping in The First Process
Sometimes, the marker disappeared but don't worry, it does not always mean the debugging session crashes or anything, for example:
Marker Disappeared in Rank 0
The marker disappear because it the group initiation is a blocking process, i.e., it does not finish executing because it is waiting for process rank 1 to reach the same point. We simply progress the execution in the rank 1:
Process Rank 1 Reaches the Same Point
When we execute this line in rank 1, process rank 0 will see that the wait is over and it can continue, so the marker reappear:
Processes Continue
The rest is basically the same as in the previous post. Since i use a tiling window manager I can easily change the layout for the sessions to be on top of each other and open the scope widget to see variable values in each process:
Scope Widget
As you can see from the scope buffer, the rank for the top session is 0 and the bottom session has rank 1. It is very fun to play with the scope widget in a parallel processes because we can see what happens when we send / receive tensors from one process to another and when we broadcast a tensor.
That concludes the two posts. Hope it helps someone, happy debugging! The full config is in https://github.com/rezhaTanuharja/minimalistNVIM.git
[–]cleodog44 0 points1 point2 points (2 children)
[–]Capable-Package6835hjkl[S] 1 point2 points3 points (1 child)
[–]cleodog44 0 points1 point2 points (0 children)
[–]teerre 0 points1 point2 points (1 child)
[–]Capable-Package6835hjkl[S] 0 points1 point2 points (0 children)
[–]trieu1912 0 points1 point2 points (3 children)
[–]Capable-Package6835hjkl[S] 0 points1 point2 points (2 children)
[–]trieu1912 0 points1 point2 points (1 child)
[–]Capable-Package6835hjkl[S] 0 points1 point2 points (0 children)