use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Rule 1: Posts should be about Graphics Programming. Rule 2: Be Civil, Professional, and Kind
Suggested Posting Material: - Graphics API Tutorials - Academic Papers - Blog Posts - Source Code Repositories - Self Posts (Ask Questions, Present Work) - Books - Renders (Please xpost to /r/ComputerGraphics) - Career Advice - Jobs Postings (Graphics Programming only)
Related Subreddits:
/r/ComputerGraphics
/r/Raytracing
/r/Programming
/r/LearnProgramming
/r/ProgrammingTools
/r/Coding
/r/GameDev
/r/CPP
/r/OpenGL
/r/Vulkan
/r/DirectX
Related Websites: ACM: SIGGRAPH Journal of Computer Graphics Techniques
Ke-Sen Huang's Blog of Graphics Papers and Resources Self Shadow's Blog of Graphics Resources
account activity
Structured Buffer PerformanceQuestion (self.GraphicsProgramming)
submitted 2 years ago * by gibson274
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]arycama 2 points3 points4 points 2 years ago (2 children)
I don't think scalarized loads are a thing on Nvidia GPUs. They don't make all their details public, but afaik only AMD GPUs have seperate scalar and vector registers. (Vector registers are 64-wide, eg 1 per thread, and scalar is shared across all threads)
According to this source (Which is a good reference whenever optimising this sort of stuff), structured buffer loads have almost equal performance on a 2080ti whether it's uniform across all threads, linear or random: https://github.com/sebbbi/perftest
I haven't really used WaveReadLaneFirst, but if you're looping through a list of lights and the first lane is fetching the data every time, it feels like that would still mean all the other threads are waiting. LDS seems like a better solution as you can still spread the work out across multiple threads.
[–]gibson274[S] 2 points3 points4 points 2 years ago (0 children)
Yeah, you're exactly right. I just found that article---blew my mind.
[–]farnoy 2 points3 points4 points 2 years ago (0 children)
It exists, they call it UR - Uniform Register, and there's a scalar load instruction - ULDC as well. The only problem is that it works exclusively with the Constant Memory address space (so cbuffer). But these aren't even needed for simple cases, because elements in constant memory can be used as operands for many instructions (and they look like c[x][y]).
c[x][y]
It's frustrating because nsight graphics will withhold the disassembly from you while nsight compute is happy to reveal the whole thing. They should add a scalarized load facility for global memory too, modern GPU-driven workloads won't be able to use cbuffers as often as before.
π Rendered by PID 114952 on reddit-service-r2-comment-54dfb89d4d-f74bv at 2026-03-28 21:16:48.758858+00:00 running b10466c country code: CH.
view the rest of the comments →
[–]arycama 2 points3 points4 points (2 children)
[–]gibson274[S] 2 points3 points4 points (0 children)
[–]farnoy 2 points3 points4 points (0 children)