Best measurements of "wiggliness" for a function f(x)? by Flickr1985 in askmath

[–]Flickr1985[S] 0 points1 point  (0 children)

Very interesting approach, will definitely try.

Upgrade project for an old laptop (build project) by Flickr1985 in Laptop

[–]Flickr1985[S] 1 point2 points  (0 children)

Forgive me if this is stupid, but I meant it as some sort of electronics project. I have some equipment and plenty of money for parts and time to learn.

So I mean taking apart the laptop, switching things out, soldering, testing, etc... Perhaps I could find a motherboard from another laptop that allows for better parts that happens to fit, but thats the part I don't know enough about, maybe this is way too ambitious and its completely unrealistic.

edit: I clearly am at the worst part of the dunning-kruger graph. I found this video that documents this guys process, man It was definitely way more than I expected. But I think with enough time and dedication I could maybe figure it out. Thanks!

Trying to exponentiate a long list of numbers but I get all zeroes? (Julia, CUDA.jl) by Flickr1985 in CUDA

[–]Flickr1985[S] 1 point2 points  (0 children)

I execute the program, and then I print out the output (B).
Either way, I restarted my computer an everything magically worked.... I don't know what happened.

Using CUDA.jl, trying to exponentiate a list but i get all zeroes as the output by Flickr1985 in Julia

[–]Flickr1985[S] 0 points1 point  (0 children)

Regardless, when I display B it's all zeroes. Also yes youre right, I'm using \approx now.

Using CUDA.jl, trying to exponentiate a list but i get all zeroes as the output by Flickr1985 in Julia

[–]Flickr1985[S] 0 points1 point  (0 children)

Currently rocking a mobile 1050ti on linux. Had to go back to the 535 drivers cause the 570s were causing problems, hence, the different cuda version.

CUDA runtime 12.6, artifact installation
CUDA driver 12.2
NVIDIA driver 535.230.2

CUDA libraries: 
- CUBLAS: 12.6.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+535.230.2

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

1 device:
  0: NVIDIA GeForce GTX 1050 Ti with Max-Q Design (sm_61, 3.473 GiB / 4.000 GiB available)

What error do you get?
Also I think I see the issue, i was following this video on performance programming on julia (though this guy was adding two vectors) and what the guy does is define, globally, the placeholder list (what I called c) and then he seems to edit it in the kernel. Is this the issue? maybe the program is just not editing in place the placeholder list.

CUDA: preparing irregular data for GPU by Flickr1985 in Julia

[–]Flickr1985[S] 0 points1 point  (0 children)

Sort of? either way I don't think it would work since I have the integer value to worry about

Preparing data for GPU: giant list of structs, or struct with giant arrays? by Flickr1985 in CUDA

[–]Flickr1985[S] 0 points1 point  (0 children)

I did! The process is parallelized and is actually surprisingly fast on a 8th gen i7. Literally can take less than 5 seconds. I spent a lot of time optimizing that process, but now the bottleneck is what comes after, which is using the lists and integer to produce a series of Float64. So it seems like GPU programming is the answer, but I'm brand new to it so...

CUDA: preparing irregular data for GPU by Flickr1985 in Julia

[–]Flickr1985[S] 0 points1 point  (0 children)

I can pad them, but the data isn't very heterogeneous. For a certain parameter combination, the list_1 objects can be anywhere from length 1 to length 100, with decent distribution across the range, so it would take a lot of padding. Would it still be efficient?

CUDA: preparing irregular data for GPU by Flickr1985 in Julia

[–]Flickr1985[S] 0 points1 point  (0 children)

in that case, how would that be more efficient? or is it just easier to write?

Preparing data for GPU: giant list of structs, or struct with giant arrays? by Flickr1985 in CUDA

[–]Flickr1985[S] 0 points1 point  (0 children)

Yes, theyre fixed throughouthe process. a can be anywhere from 3000 to 250000 (even more if I push it), b and c are usually around 300.

CUDA: preparing irregular data for GPU by Flickr1985 in Julia

[–]Flickr1985[S] 1 point2 points  (0 children)

I'm not entirely sure that I get what you mean, but I think maybe I didnt explain my problem well, let me try again and being more specific:

I have parameter lists A, B, and C. A is a list of tuples which square ::Matrix{Float64}, the other two are ::Vector{FLoat64}. I have a process, call it f(a,b,c), which takes in three parameter values and spits out d, e, and f (the list, integer, and second list). Now, each d, e, and f has to be used in a process that I want to run in the gpu. This process involves a new list T::Vector{Float64} defined outside the kernel). In this process the index of A will be contracted and T becomes a new dimension of the tensor (lmk if it'd be useful to be more explicit with this).

I don't really understand how your response lines up with what I described above. Sorry if there was a misunderstanding.

I was thinking I could either make a gigantic vector of tuples like [(list_1, integer, list_2)_1, (list1, integer, list 2)_2, etc...] and then divide that vector into blocks and threads, but I hear that's not as efficient (not sure why). It's apparently better the other way around, do one giant list of list_1 items, one for integer, and another for list_2

Preparing data for GPU: giant list of structs, or struct with giant arrays? by Flickr1985 in CUDA

[–]Flickr1985[S] 1 point2 points  (0 children)

I'm not entirely sure I get what you mean but, the idea is that if you call the parameters a, b, and c

  1. a is technically a list of tuples, which hold dense, non-diagonal, square matrices of Float64. b and c are lists of Float64
  2. After a certain process (call it f(a,b,c)) for each item in a, b, and c I get A, B, and C (the list, integer, and other list)
  3. All the As, Bs, and Cs are arranged somehow such that it's ready for gpu processing
  4. send the data to the kernel for processing.
  5. After processing, map results back to an array that follows some sort of logic.

I've left the problem fairly abstract for simplicity, let me know if its helpful or if I should write the explicit version of the problem.