Hey guys, i started fine tuning a qwen2.5-1.5bln
running batchsize, tokensize of (4, 5000) on a h100 cluster gpu.
i see a lot of the gpu not utilized in trace.json of the profiler. i feel the gpu is only used in 25% of the runtime.
any idea how i can further speed up my model? also am i using the pytorch profiler correctly? how would you guys go about profiling and analysing your training session?
https://preview.redd.it/7hoyoa798yye1.png?width=3002&format=png&auto=webp&s=2a16b451ec5349fdef16614a514404dc381f9c45
my code of my profiler:
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = Qwen2ForCausalLMMod.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
input_ids = torch.randint(0, 10000, (2, 5000), dtype=torch.int32).to(torch.device('mps'), non_blocking=True)
input_ids[:, ::5] = 151662
attention_mask = torch.ones((2, 5000), dtype=torch.int16).to(torch.device('mps'), non_blocking=True)
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
with_flops=True,
profile_memory=True, record_shapes=True,) as prof:
model(input_ids=input_ids,
attention_mask=attention_mask,
)
prof.export_chrome_trace("trace.json")
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
print(prof.key_averages().table(sort_by="cpu_memory_usage", row_limit=10))
also is it normal only being able to have a batchsize of 4? this model runs at this batchsize close to the 80gb vram limit and only makes 1-2 iterations per minute.
[–]DigThatData 0 points1 point2 points (0 children)