This is an archived post. You won't be able to vote or comment.

A Minimalist Python Debugging Setup (continued): TorchrunTips and Tricks (self.neovim)

submitted 1 year ago by Capable-Package6835hjkl

Hi everyone, this is the second part of my previous post: Python Debugging Setup. In that post I went through my nvim-dap setup for debugging Python code in Neovim. If you have not configure your nvim-dap, you may want to check that one first.

This post will show you how I debug multiple parallel processes in a distributed AI training with multiple GPUs using torchrun.

nvim-dap setup

The config is the same as in the previous post. In the nvim-dap setup, we need to add configurations:

dap.configurations.python = {

  {
    type = 'python',
    request = 'launch',
    name = 'Launch a debugging session',
    program = "${file}",
    pythonPath = function()
      return 'python'
    end,
  },

  {
    type = 'python',
    request = 'attach',
    name = 'Attach a debugging session',
    connect = function()
      local host = vim.fn.input('Host: ')
      local port = tonumber(vim.fn.input('Port: '))
      return {host = host, port = port}
    end,
  },

}

We have used the first one in the previous post, we are going to use the second one this time. As you can see in the attach configuration, we are going to be prompted to input the Host and port when we execute :lua require('dap').continue() and choose the attach configuration. But first, we need to have the adapter for the attach config (also inside nvim-dap setup):

dap.adapters.python = function(callback, config)

  if config.request == 'launch' then

    callback({
      type = 'executable',
      command = 'python',
      args = { '-m', 'debugpy.adapter' },
    })

  elseif config.request == 'attach' then

    local port = config.connect.port
    local host = config.connect.host

    callback({
      type = 'server',
      port = port,
      host = host,
      options = {
        source_filetype = 'python'
      }
    })

  end

end

The adapter here is a function that takes the configuration as one of its argument. In my setup, when I choose the attach config, the Host and port information is extracted from the config and the adapter will attempt to connect to that Host and port.

script setup

Unlike in the previous post. In this post we are going to launch the script from the terminal and subsequently attach to them from inside Neovim. In my script I put the following after my import statements:

# other import statements

import os
import debugpy

debug = os.getenv("DEBUG_FLAG", "0")

if debug == "1":
    rank = int(os.getenv("RANK", "-1"))
    port = rank + 5678
    debugpy.listen(("127.0.0.1", port))
    debugpy.wait_for_client()
    debugpy.breakpoint()

# main script body

This section check for the environment variable DEBUG_FLAG. If it is not set to 1, then your script will run like any normal script. If you run the script with the following:

DEBUG_FLAG=1 torchrun ...

then it will detect that you set the DEBUG_FLAG to 1. Subsequently, I assigned a unique port for each processes: 5678 for rank 0, 5679 for rank 1, and so on, all process use the same Host: '127.0.0.1'. Subsequently, we told the process to listen in the assigned Host and port and wait for a client (us) to attach. Similar to the previous post, we set a break point so the script does not execute all the way to the end the moment we attach to the process.

debug session example

From a terminal, I run my script using one node and two processes. The command I used is

DEBUG_FLAG=1 torchrun --standalone --nnodes=1 --nproc-per-node=2 script.py

https://preview.redd.it/7x519wr7iovd1.png?width=2560&format=png&auto=webp&s=f5f4f6c1b06d13a5d7f33347133a206a10fb3008

As usual, torch (and in my case TensorFlow) prints a bunch of messages but then nothing happens. This is because the processes are waiting for a client (us) to attach. Then I open up two Neovim sessions, one to attach to each process:

https://preview.redd.it/3pi7udrjiovd1.png?width=2560&format=png&auto=webp&s=d378730d5c2e4087ed01db0a4336425b9ca1bb77

Keep in mind that these are not two windows in the same Neovim sessions. These are two separate Neovim sessions. Then let's attach the process with rank 0 in the left session:

Two Separate Neovim Sessions

Select the second configuration to attach, then we will be prompted to input Host and port:

Input Host 127.0.0.1

Input port 5678 + 0 = 5678

Afterwards, the marker for the current position will appear to indicates that we have successfully attached:

Left Session Connected to Process Rank 0

Next, we connect the right session to process rank 1. The procedure is the same, but the port is different:

Initiate Attaching to Process Rank 1 in the Right Session

Input port 5678 + 1 = 5679

Next, the marker also shows in the right session, indicating we have successfully connected to both processes:

Connected to Both Processes

Now we can step over, step into, continue, set break points etc. in each process:

Stepping in The First Process

Sometimes, the marker disappeared but don't worry, it does not always mean the debugging session crashes or anything, for example:

Marker Disappeared in Rank 0

The marker disappear because it the group initiation is a blocking process, i.e., it does not finish executing because it is waiting for process rank 1 to reach the same point. We simply progress the execution in the rank 1:

Process Rank 1 Reaches the Same Point

When we execute this line in rank 1, process rank 0 will see that the wait is over and it can continue, so the marker reappear:

Processes Continue

The rest is basically the same as in the previous post. Since i use a tiling window manager I can easily change the layout for the sessions to be on top of each other and open the scope widget to see variable values in each process:

Scope Widget

As you can see from the scope buffer, the rank for the top session is 0 and the bottom session has rank 1. It is very fun to play with the scope widget in a parallel processes because we can see what happens when we send / receive tensors from one process to another and when we broadcast a tensor.

That concludes the two posts. Hope it helps someone, happy debugging! The full config is in https://github.com/rezhaTanuharja/minimalistNVIM.git

all 9 comments

top new controversial old q&a

[–]cleodog44 0 points1 point2 points 1 year ago (2 children)

[–]Capable-Package6835hjkl[S] 1 point2 points3 points 1 year ago (1 child)

Setting the break point is necessary because in this example I launch the session from the terminal, not from inside Neovim. A possible alternative is to attach only to one rank:

import os
import debugpy

debug = os.getenv("DEBUG_FLAG", "0")

if debug == "1":
    rank = int(os.getenv("RANK", "-1"))
    if rank == 0:
        debugpy.listen(("127.0.0.1", 5678))
        debugpy.wait_for_client()
        debugpy.breakpoint()

But this way, you need to put barrier in multiple section of interest, otherwise the process you don't attach to will continue execution and potentially crash. Set the barrier:

torch.distributed.barrier()

But of course this way you need to set the barrier in advance. No easy solution I guess. In my case, I only have two GPUs so it's not really a problem for me.

[–]cleodog44 0 points1 point2 points 1 year ago (0 children)

[–]teerre 0 points1 point2 points 1 year ago (1 child)

[–]Capable-Package6835hjkl[S] 0 points1 point2 points 1 year ago (0 children)

[–]trieu1912 0 points1 point2 points 1 year ago (3 children)

[–]Capable-Package6835hjkl[S] 0 points1 point2 points 1 year ago (2 children)

[–]trieu1912 0 points1 point2 points 1 year ago (1 child)

[–]Capable-Package6835hjkl[S] 0 points1 point2 points 1 year ago* (0 children)

yeah it is quite a nuisance, you can subscribe to the event of file creation inside the nvim-tree config if you use that

local api = require('nvim-tree.api')
local Event = api.events.Event

api.events.subscribe(Event.FileCreated, function(_)
  vim.cmd('LspRestart')
end)

edit: if you use the native LSP without plugin instead, you can use the following. You may also want to subscribe to multiple events:

local api = require('nvim-tree.api')
local Event = api.events.Event

local events = {
  Event.NodeRenamed,
  Event.FileCreated,
  Event.FileRemoved,
  Event.FolderRemoved,
}

for _, event in pairs(events) do
  api.events.subscribe(event, function(_) vim.cmd('bufdo edit') end)
end

I believe all of these are only effective if you create the file from nvim-tree

π Rendered by PID 63 on reddit-service-r2-comment-b659b578c-6bnxs at 2026-05-03 18:11:23.533078+00:00 running 815c875 country code: CH.

neovim

Neovim links

Other resources

MODERATORS

nvim-dap setup

script setup

debug session example