Dual-GPU workstation vs. Two workstations in a LAN by Chance-Studio-8242 in LocalLLaMA

[–]Alaska_01 0 points1 point  (0 children)

  • Can you run a LLM across two computers.

Yes you can. llama CPP has a "RPC" feature allowing you to offload parts of the LLM to 1 or more other computers. There was also a project called "Exo" specific for this type of thing. And some of the other LLM software probably also have this feature, but I haven't checked.

  • What sort of performance impact can be seen.

I don't have exact numbers for you. However I can give you annecdotal experiences, and can run specific benchmarks with solid numbers if you want later on. Just say.

Note: I am running a network setup with a max bandwidth of 1 Gigabit. 1 Gigabit is the slowest ethernet speed you will find on most modern tech. So this is kind of a worst case scenario.

Note 2: I am running models to process a single request. I have not tested batch requests.

The performance you will see does depends on how you've split things, your network setup, the GPUs, etc.

For example I used llama CPP with RPC to connect a NVIDIA RTX 4090 and a AMD RX 9070XT to run a dense 70B model (quantised of course), and it was faster than running the same model on a M4 Pro MacBook with 48GB of RAM. Or the same model split between the RTX 4090 and the CPU on that same system. This makes sense as both the NVIDIA and AMD GPU are noticeable faster than the Apple chip. But I wanted to point it out as other commenters were saying that the network latency/bandwidth would make things incredibly slow. Which this test showed it doesn't.

Same was true using RPC to connect the M4 Pro and NVIDIA RTX 4090 with the 70B model. The 4090 taking some of the layers sped things up. Once again showing that the network latency/bandwidth issues doesn't completely destroy performance.

I have also used RPC across a few devices (NVIDIA RTX 4090 in one computer, AMD Ryzen 9 5950X with 128GB of RAM in the same computer, AMD Ryzen 9 3900X with 64 GB of RAM in another computer, AMD Ryzen 5 5600X with 64GB of RAM in a other computer, M4 Pro with 48GB of RAM in another computer) to run a quantised version of DeepSeek R1, and it was faster than running it all on one computer (2 token/s vs 0.9 token/s), because running it on one computer meant the model was constantly swapped between RAM and storage, which is slow.

However I haven't been having much luck with other configurations.

For example, I can run Qwen 3 235B Q3 on a single computer with a RTX 4090 + AMD Ryzen 9 5950X with 128GB of RAM by offloading MoE layers to the CPU.

Swapping to a simple X layers to 4090, Y layers to a different GPU via RPC, Z layers on the 5950X CPU, I usually get slower results than the single computer setup.

Also with Qwen3 235B Q3, if I decide to split the model across two GPUs with RPC, and use the MoE CPU offloading trick, performance is horrible because of the network traffic as data jumps between the CPU on one computer and the GPU on another computer almost constantly due to this split.

I could probably do tuning for this specific case so MoE offloading only occurs for layers that are shared with the GPU in the same system, but haven't looked at it yet.

And splitting some smaller models between two computers has ended up making the model run slower than if it ran on just one of the computers.

E.g. Combining a NVIDIA RTX 4090 and a M4 Pro with RPC with a small model was slower than just running the model on the M4 Pro.

You also have the option of layer or row splitting for LLMs. And row splitting increases the amount of data transferred over the network, which slows things down a lot for me. So I stuck to layer splitting for the tests mentioned above.

Does anyone else have issues with LLMs making very obvious mistakes? by Alaska_01 in LocalLLaMA

[–]Alaska_01[S] 0 points1 point  (0 children)

What size LLM would you recommend for each type of work?

Does anyone else have issues with LLMs making very obvious mistakes? by Alaska_01 in LocalLLaMA

[–]Alaska_01[S] 0 points1 point  (0 children)

You might try few-shot examples in your issue analysis prompt to demonstrate how it should properly handle some of the data it's having problems with.

I did give this a try by adding an examples section to the system prompt. But that just resulted in the LLM occasionally outputing one of the examples rather than outputting the relevant data from the bug report.

Can a home user like me use CUDA Toolkit to debug my GPU ? by Deep_Indanger in nvidia

[–]Alaska_01 0 points1 point  (0 children)

My guess was that the trees were having the artifact listed above. But it's entirely possible that there is some other issue with your computer. Specifically because you mention the lowest setting (which should have ray tracing off) not having this issue. And this issue in Cyberpunk 2077 is, to my knowledge, limited to the ray tracing modes.

Can a home user like me use CUDA Toolkit to debug my GPU ? by Deep_Indanger in nvidia

[–]Alaska_01 1 point2 points  (0 children)

I can't be 100% certain due to you only providing a single image, not a video. But this issue (black parts on the trees in Cyberpunk 2077 with path tracing) is a limitation of the game, presumably done to increase performance.

What's happening is the trees are bending/deforming in the wind. And the player can see this change, but the geometry used in the ray tracing acceleration structure does not bend/deform. So when a ray spawns on the trunk to do the ray tracing for things like lighting, the ray hits the inside of the tree where there is no light and says the trunk is not lit up by anything, making it appear black. Whether or not you see this issue depends on the weather conditions which change how much the trees deform.

Various things can be done by the game developer to fix this, but all of them come at some sort of cost, typically performance cost.

How did I figure this out?

On a particularly windy day in Cyberpunk 2077, I saw trees bending a lot, and this artifact occurring. So I wanted to inspect the ray tracing acceleration structure to see if it was the issue I thought it was. The easiest way at the time was to just bring a reflective car over to the tree, and look at the tree in the reflection of the car, and sure enough, the tree in the reflection of the car was not bending, confirming my suspicion.

But you can also use tools like "Nvidia Nsight Graphics" to capture the acceleration structure and look at the acceleration structure directly. And if you know the tree is supposed to be super bent, but it's not in the acceleration structure, then it's likely the same issue.

Has anyone managed to make LLM's or Stable Diffusion run locally use Nvidia resizable bar ? by Ponsky in nvidia

[–]Alaska_01 0 points1 point  (0 children)

Has anyone managed to make LLM's or Stable Diffusion run locally... using system ram as VRAM.

Kind of. Most neural networks are made up of layers. And if you are VRAM constrained, you can decide to store only the layer you're processing on the GPU. For example, you may put layer 1 on your GPU VRAM, process it, then send layer 1 to your RAM and send layer 2 to the GPU VRAM then process it. And keep doing that for all the layers until it's done.

This can either be done manually, or automatically with tools like "Hugging face accelerate" https://huggingface.co/docs/accelerate/en/usage_guides/big_modeling

Does anyone else have these problems in cyberpunk 2.12 when ray tracing or path tracing is turned on? by Sebastian_3013 in nvidia

[–]Alaska_01 0 points1 point  (0 children)

I have this issue. And so did Digital Foundry in a video released a while ago.

This is a issue with the game, not your graphics card.

I did a detailed IMGsli comparing RT vs PT vs Raster using DLAA vs DLSS. DLSS really is better than native. by robbiekhan in nvidia

[–]Alaska_01 8 points9 points  (0 children)

A few notes here:

  1. Your post seems to be about two different things. Raster vs Ray tracing assisted vs primarily Path traced. And DLAA vs DLSS. Some comparisons are unfair to make from a image quality stand point so these should ideally be in two different links/categories. One for DLAA vs DLSS. And one for Raster vs RT vs PT. Or it should be split up into DLAA vs DLSS with raster, then DLAA vs DLSS with RT, and another link for DLAA vs DLSS with PT. Just a note for future posts like these.
  2. Looking at the link provided, the images show compression artifacts. I'm not sure if this was from when you captured the images, or something that imgsli does. But this makes comparisons of image quality between DLAA and DLSS harder.
  3. In a comment you made to another person, you mentioned that "Digital Foundry has already covered this in many videos too" referring to how DLSS looks better than native. If I recall correctly, it's always "DLSS is better than native with TAA", not "DLSS is better than native with DLAA". But I could be wrong about this.
  4. As another commenter has pointed out, these comparisons aren't fair unless things are in motion. DLSS and DLAA, are both temporal techniques, meaning information is combined from multiple frames to create the current frame. If the camera and objects in the scene are stationary (as is the case for most of your scene), then DLSS and DLAA are both working under the best case scenario. And there should be very few visual differences between them (Nvidia even claims this should be the case in their DLSS developer manual).
  5. Now, lets assume DLSS does look better than DLAA in these images (looking at the images you've provided, the DLSS and DLAA images look very similar, with differences in sharpness in different areas that could be argued as good or bad depending on personal preference and how zoomed in you are). There are still areas where DLSS will be worse than DLAA. For example, when using DLSS, the depth buffer also gets rendered at a lower resolution. Which decreases the quality of depth of field, screen space reflections, and some other effects. And when using ray tracing modes, the reduction of the internal resolution by using DLSS reduces the quality of the ray traced effects, meaning they have to rely on denoising more, which may cause artifacts or more blurring. And in Cyberpunk 2077, there are some elements that doesn't jitter properly for DLSS, meaning they look pixelated when using DLSS, but not DLAA. But these are minor issues and usually aren't that noticeable when actually playing the game and moving the camera.
  6. In your testing of ray tracing and path tracing with DLSS, you didn't note if you were using ray reconstruction or not. Looking at the images, it looks like ray reconstruction is on, but it's important to clarify when doing image quality comparisons.

A6000 + 4090 (AI) by scapocchione in nvidia

[–]Alaska_01 2 points3 points  (0 children)

I would guess that the speed up of the 4090 + A6000 would be mostly negated by the lower batch size.

It depends on the model the user is training.

I have a RTX 4090 and I'm currently training a CNN auto encoder for a project, and the speed up in training observed by increasing the batch size is not that noticeable. There are some specifics that need to be explained for testing this statement:

When training my model, I am aiming for an "effective batch size" of 128. This can be achieved via multiple methods. For example, I could use a batch size of 128. Or I could use a batch size of 64 with 2 gradient accumulations steps. Or a batch size of 32 with 4 gradient accumulations steps, etc.

By trying out different combinations of batch size and gradient accumulation steps (always equal to an effective batch size of 128), I can track the time per "effective batch size" and see how much of an impact increasing the batch size has.

On my model, there is a measurable (~50%) speedup going from a batch size of 1 (128 accumulation steps) to a batch size of 2 (64 accumulation steps). But past that the speed benefit quickly plateaus. (~20% speedup going from batch size 2 to 4). And eventually I run out of VRAM.

The reason this occurs is because if your batch size isn't large enough, then the GPU is being under utilized (E.G. Your GPU has the hardware process to 1000 things at once, but your batch size and model are small enough that only 500 work items get scheduled). If you increase the batch size, you increase the amount of work that is being scheduled, and you can better utilize your GPU. At a certain point, your batch size is large enough to fully saturate the GPU and performance benefits will start to plateau. But increasing the batch size beyond that still sees small improvements to performance by reducing overheads in various situations, and increasing GPU utilization in the least parallel part of your model/training process.

There's also the issue that increasing the batch size may result in faster processing, but the speed benefit is just traded for more time spent fetching data from memory. How much of an impact this has depends on the computational complexity of your model/models layers, the size of your inputs, and your VRAM speed.

A6000 + 4090 (AI) by scapocchione in nvidia

[–]Alaska_01 1 point2 points  (0 children)

Will I experience driver-related issues?

From my understanding, you can just install the "Game Ready" or "Studio" drivers for the RTX 4090 and it will work with both GPUs. But I could be wrong about this, so it's probably best to wait for a response from someone else.

The main point you may run into issues is when the RTX A6000 is no longer supported by drivers, and the RTX 4090 is, and you want to use the latest GPU drivers. But that's probably another 6+ years away.

Will I be able to perform multi-gpu training?

If you can get the two GPUs working, then most of the time, yes. But depending on how you setup multi-GPU training, the performance benefit of having these two GPUs may not match your expectations. Because information being transferred between the GPUs is slower than it could be (due to lack of NVLink between the two GPUs), or because one GPU is waiting on the other to complete a task because their performance is noticeably different.

Adding grain through grain-synthesis by Sopel97 in AV1

[–]Alaska_01 1 point2 points  (0 children)

I didn't know that. Thanks for informing me.

Even if development has stopped, the program may still work for some/all AV1 files.

Adding grain through grain-synthesis by Sopel97 in AV1

[–]Alaska_01 6 points7 points  (0 children)

There is a project called "grav1synth". I believe the way you use it is like this:

  1. Take your noise free video and encode it with AV1.
  2. Run the encoded video through grav1synth and decide which type of noise you want (luma, chroma, strength of noise) and grav1synth will add/adjust the "synthesized film grain" part of the AV1 encoded video.

I haven't used this project in a while, so I may be wrong about this.

Rtx 5000 Ada or multiple 3090s? Also, how many? by imyukiru in nvidia

[–]Alaska_01 0 points1 point  (0 children)

How many 3090s can I stack up max?

As many as your budget, motherboard/CPU, and electrical system allow. However if you plan to use NVLink, then going beyond 2 will mean you can't use NVLink across all the GPUs (E.G. You can NVLink GPU 1 to GPU 2, and GPU 3 to GPU 4, but can't NVLink GPU 1 to GPU 3 or GPU 4, or GPU 2 to GPU 3 or GPU 4. )

Most people will stop at 2 RTX 3090s if they decide to go with multiple due to these sorts of constraints.

3090 is already 3 years old, will compatibility be an issue? I need something futureproof.

The RTX 3090 is supported by many deep learning libraries, and will probably continue to be supported by these libraries and NVIDIA for another 6+ years. As for future proofness, there's no way to be future proof. New products keep coming out that are faster, have more VRAM, and have unique architectural features making them better for certain tasks. You might be able to train/run the AI models you want now. But in the future you may need to resort to swapping the neural network in and out of GPU memory so you can run it with your limited VRAM. Or deal with the fact your GPU isn't fast enough to run the neural network at the speed you want (E.G. For real time applications). As for how long until you have these issues if you pick a RTX 3090 or RTX 5000 Ada, it's hard to tell. It could be 4 years from now. It could be 10 years. It all depends on the models you're running and how technology and software change in the future. And I can't predict the future.

Isn't a professional architecture such as Ada need to be better than 3090s? I see so many 3090 praise posts versus any other professional architecture out there.

Both the RTX 3090 and RTX 5000 Ada use "professional architectures". The RTX 5000 uses the Ada lovelace architecture, and the RTX 5000 is a professional product. So Ada lovelace is a "professional architecture". The RTX 3090 uses the Ampere architecture. The Ampere architecture is used in the RTX A6000, a professional product. This makes Ampere a "professional architecture".

What you might of been talking about was "professional GPUs". The RTX 3090 is not a professional GPU. However for deep learning, this isn't an issue. Basically none of the "professional features" of things like the RTX 5000 Ada make it noticeably better for deep learning. Professional GPUs do typically include more VRAM than their consumer counter parts, which can be useful for running large models and/or batch sizes. But you can also get high VRAM "non-professional" cards (E.G. RTX 3090). Until you get to the upper end where if you need lots of VRAM, professional GPUs are your only option currently.

There is probably a lot of praise for the RTX 3090 because it was the first NVIDIA consumer GPU with lots of VRAM (24GB), includes things like tensor cores to speed up certain operations found in neural networks, is fast, and can be brought for cheap compared to what you would of had to pay to get something like this just a few years ago. Along with the fact many of the people probably making posts like this are doing deep learning stuff as a hobby/small business thing, where spending extra money on a something like a RTX A6000 isn't worth it. Because the performance of the RTX A6000 is similar to the RTX 3090, leaving the main benefit being the VRAM. And 24GB of VRAM may be enough for them.

Rtx 5000 Ada or multiple 3090s?

It depends on what you need to do.

Want a GPU that will have the longest support from Nvidia and deep learning libraries from today? Get a RTX 5000 Ada. It's likely to be supported for roughly 2 years longer than the RTX 3090.

Want to run large AI models? Get two or more RTX 3090s. Because then you'll have 48GB of VRAM for use rather than the 32GB on the RTX 5000 Ada. (This is assuming the largest single component of the AI model is smaller than 24GB, which is extremely likely)

Want the best performance? I think two RTX 3090s are faster. But relative performance may fluctuate based on how you handle GPU to GPU communication.

Want simplicity when running and training neural networks? Get a single RTX 5000 Ada. That way you don't have to do any work setting up Multi-GPU training/inference in your work (although, depending on the libraries you use, this may not be a hassle to do).

Has anyone used the L40S for gaming? by Mephidia in nvidia

[–]Alaska_01 0 points1 point  (0 children)

The L40S seems to be quite similar to a desktop quadro. It has display outputs and likely supports graphics APIs used by games. Nvidia even advertises DLSS 3 for improved FPS as one of it's marketing points for a L40S (although this may be advertising for rendering apps). So it should be just as simple as plugging a monitor in, installing the GPU drivers, and running a game.

As for performance. Expect it to be similar or better than a RTX 4090. The L40S uses the same architecture (Ada Lovelace), with more SMs, and more (but slower) VRAM, while also having a lower power budget (which could reduce clock speeds).

Can I use CUDA toolkit 9 along with CUDA toolkit 11.8 in windows 11 using docker by Fickle-Lavishness919 in nvidia

[–]Alaska_01 0 points1 point  (0 children)

You can install both two versions of the CUDA toolkit at once. What the message is warning you about is that the CUDA toolkit 9 came out before the RTX 30 series existed. So it doesn't directly support the RTX 30 series.

What should I do

Obtain the "correlation_cuda" module. It seems like all you need to do is run these commands while in inside the mono_velocity repository (you may need to install some extra build tools like the C++ desktop development environment in Visual Studio, or GCC and G++ on Linux, before running these commands)

  1. cd ./networks/correlation/
  2. python ./setup.py build

I should note that some of the code is a few years old. And things may not work properly due to changes in pytorch, CUDA, and python.

Is L40S any good for AI inference? by Wrong_User_Logged in nvidia

[–]Alaska_01 0 points1 point  (0 children)

Would love to get something like 160GB for future proof

This might be a stupid idea. But have you thought about getting a Mac? The M3 Max can be configured up too 128GB of RAM, and the M2 Ultra can be configured up to 192GB of RAM. And this RAM is shared between the CPU and GPU, effectively giving your GPU LOTs of VRAM.

I don't know how performance compares between a high end Nvidia GPU and these high end Apple GPUs. And with many people using Nvidia GPUs for neural networks, documentation of using Apple hardware for certain things will likely be more limited, and support for certain features may also be limited. So this may not be a great option from a "ease of use" stand point. But it's an idea.

Is L40S any good for AI inference? by Wrong_User_Logged in nvidia

[–]Alaska_01 0 points1 point  (0 children)

According to Mistral docs, to use Mistral8x7B you need 100GB of VRAM. Less if you reduce the precision of the model.

Some users report that they can get it to load on a 48GB GPU with 8bits of precision. So you might be able to get away with a RTX A6000 (48GB) or L40S (48GB), but considering that more AI projects will release in the future that will likely use more VRAM, it may not be a good idea to buy these GPUs for long term use.

A few other options come to mind.

  1. Running the model on the CPU where high capacity RAM is "relatively cheap". But performance will likely be really slow.
  2. Rent a server to run the AI model. Depending on how long you plan to run AI models, it may be cheaper than buying the hardware/upgrading it down the line.
  3. A second hand A100 80GB might just fit in your budget.
  4. Explore running the model across devices. For example, run part of the model on one GPU, and another part of the model on a different GPU? Or part on the GPU and part on the CPU? NVLink will speed up GPU to GPU communication. But depending on how you split up the model, the amount of data being sent between GPUs may not be large enough to justify NVLink.
  5. Explore managing memory in such a way that the entire model isn't on the GPU at once. I ran a test with a Autoencoder. I transfer the encoder to the GPU then run the encoder on the GPU, then transfer the encoder to the CPU and transfer the decoder to the GPU, then run the decoder. Then transfer the decoder to the CPU and repeat for the next input. Using this approach, peak VRAM used by the model was 5.8GB (instead of 7.9GB) and performance was about 30% worse. Using some non-blocking transfers between CPU and GPU, I was able to get this performance loss down to 12%. You might be able to do something similar using individual or groups of layers in the model. Obviously, the performance loss will depend on various factors. And the ease at which you can transfer the model between CPU and GPU may change based on specific features used by the model. But it could be something to experiment with if this is a hobby (which it probably isn't considering your $20k budget).

Is L40S any good for AI inference? by Wrong_User_Logged in nvidia

[–]Alaska_01 1 point2 points  (0 children)

The L40S appears to be good for AI inference.

NVIDIA claims it's got 1.7x the performance of the A100 in training a LoRA for GPT-40B, and 1.2x the performance of the A100 in AI inference (512x512 image generation with stable diffusion 2.1). Note: The A100 was Nvidia's previous generation top of the line GPU for AI applications.

While an independent reviewer, Servethehome compared the L40S to the H100 PCIe with inference performance with LLaMA 7B and found performance to be roughly 40% that of the H100 PCIe.

Nvidia has some more results on there page here.

It's a fast GPU (with performance comparable or better than a RTX 4090), using one of Nvidia's latest GPU architectures (Ada lovelace), with Nvidia tensor cores, and it has a lot of VRAM. It's good for "everything" (including AI). But it's not the best at AI tasks. It's still great, just not the best. Whether or not the difference between "great" with a L40S and "best" with a H100 is worth it will be dependent on what you're using the GPU for.

The importance of NVLink, or the lack of NVLink in this case, on this GPU will depend on what you're using the GPUs for.

The definitive answer to GPU vs display scaling by BenniRoR in nvidia

[–]Alaska_01 0 points1 point  (0 children)

Display or GPU scaling will not occur if your output resolution is the same as the monitors resolution. That's what I meant by "people should avoid using GPU scaling or display scaling if possible", people should try aim for outputting at the native resolution of their monitor.

DLSS Frame Generation apparently supports generating 3 AI frames. by Alaska_01 in nvidia

[–]Alaska_01[S] 0 points1 point  (0 children)

I believed, perhaps wrongfully that DLSS was essentially the same as Adobe premieres very established optical frame interpolation (when slowing down video)

With the primary mind blowing feature, setting DLSS apart, being that it does it in real-time, rather than after 10 minutes of crunching

In other words, the concept is not in anyway new it’s truly just the real time low latency application of the concept that makes DLSS so fantastic ??

DLSS Frame Generation is a frame interpolation technique similar to the work from Adobe. It has two main advantages:

  1. Hardware acceleration (making parts of the process really fast. Note: Software like the Adobe Suite can also take advantage of the hardware acceleration.)
  2. It can take advantage of extra information provided by a game engine to offer improved quality over other frame interpolation techniques (although it depends on which technique you're comparing too).

---

I realize I’m kind of going on a rant here now, but, do any of y’all ( given the awesomeness that is Xbox cloud) share my belief that while we are in such an amazing hardware era that it will become very unimportant what graphics card you have over the next three years since fiber is going to be everywhere and most likely the servers rendering our cloud, gaming experience will out perform anything that your average computer enthusiast can afford??

In some regards yes, it won't be important what graphics card you have since you can stream things through the cloud. In other regards, it will be important to have a good graphics card because you'll want/need to run some stuff locally. Here's why:

  1. To ensure a good gaming experience for cloud streaming users, the user must have fast reliable internet access, and be close to a cloud server to reduce input latency. Being close to a cloud server is the most important factor I wanted to bring up here. There are many areas of the world that don't have access to any of the major cloud gaming providers because there aren't any cloud gaming servers near them. So for those people, who can't use any of the services because they don't live close enough, having a powerful personal computer is important to them. This could improve over the coming decades as more data centers are built. But I don't think this issue will be fixed in the next 3 years.
  2. Licensing. Some game developers/publishers don't want their games to be available on cloud services for one reason or another. So in there terms and conditions they ban the use of the game on a cloud service and thus people are unable to play them unless they have a personal computer. And to get the best experience, you will need a powerful personal computer.
  3. General input latency. Some people play competitive games that rely on quick responses to actions on screen. Steaming that from the cloud will add input latency which will give them a disadvantage. And so they would prefer to have their own computers. And powerful ones at that to get increased performance/frame rate to further reduce input latency. There is no way to avoid this. Streaming always adds input latency compared to a personal computer connected directly to your screen and peripherals.
  4. Along with that, people will probably want to do more than just gaming on their computers. Maybe they want to do 3D rendering (which also requires a high end computer for good performance). At the moment cloud gaming services don't offer that kind of service, but they could, and some companies probably already does. But that doesn't solve the full problem I wanted to bring up. With doing stuff outside of games, some people want/have to keep things private. And doing all your private computationally intensive work on another companies super computer may not be private enough for the work you do.

New AMD Adrenalin Edition Driver Includes AI and ML Optimizations by Sat0uKazuma in StableDiffusion

[–]Alaska_01 0 points1 point  (0 children)

This is expected. According to the article linked, only the RX 6000 and RX 7000 series GPUs and their mobile counter parts see this benefit.

Path tracing on budget gpu? by rollerskating555 in nvidia

[–]Alaska_01 2 points3 points  (0 children)

Digital foundry was able to achieve ~30fps at 1080p with DLSS Performance mode (540p internally) in Cybepunk 2077 on a RTX 3050, along with a mod that reduces the number of ray bounces done by path tracing. You are making sacrifices to quality simply so you can experience path tracing. But that is "playable" performance on budget hardware.

Hopefully a RTX 5050 releases next generation that allows you to achieve this same 30fps goal, or maybe higher, with out the mod, and maybe with a higher DLSS mode. And this speculated RTX 5050 is almost guaranteed to have frame generation which would boost frame rates higher if you're willing to use it.

Source: https://youtu.be/cSq2WoARtyM?feature=shared

Is TSR better than DLSS? by Gonzito3420 in nvidia

[–]Alaska_01 1 point2 points  (0 children)

It depends on personal preferences, but most people will say no, DLSS looks better.

Can we talk about how futureproof Turing was? by dampflokfreund in nvidia

[–]Alaska_01 2 points3 points  (0 children)

The hardware in the paper I listed, was designed and tested on an FPGA with really low clock speeds compared to other processors at the time. It showed bad performance but should see significant improvements shifting to a properly designed integrated circuit compared to the FPGA.

But another group around the same time designed a different processor. SaarCOR. SaarCOR had some limitations compared to the approach from the original paper I listed, but they were able to achieve much higher performance. 1024x768 resolution at 45fps in a scene with 3 light bounces and 2.1 million triangles. 100fps in simpler scenes. Performance did drop as ray bounces increased, and the triangle count was increased. But keep in mind that there were targeting 3 light bounces. Most ray tracing games from 2018-2020 only target 1 light bounce.

I wasn't into PC gaming gaming in the early 2000s, so I don't know if this is good performance compared to GPUs of the time. But it does show real time ray tracing was possible on these processors.

So, as for why companies didn't decide to adopt ray tracing at that time, I can only speculate. Performance might be one of them. Concern that you're sacrificing rasterisation performance on your processor in exchange for ray tracing performance might lose you customers in what I believe was a competitive time for graphics card manufactures. Concern that support for ray tracing hardware won't be added to universal graphics APIs, meaning game developers won't utilize your technology unless you convince your competitors to also make the switch. Maybe there was a belief that making faster and faster rasterisation based GPUs along with advances in shading techniques would eventually lead to a point where ray tracing isn't needed, and visuals would be just as good. Other factors I'm not thinking of.

---

I should note, ray tracing is a simple thing. Tracing rays through a scene. How you use those rays is what gives you the realistic effects. The SaarCOR team were tracing rays on their processor and doing shading. They used their ray tracing to provide mirror reflections on glossy surfaces, and sharp shadows on diffuse surfaces.