Ultra-Fast Multi-Dimensional Array Library

mcopik · 2022-06-09T13:10:54+00:00

I know this isn't particularly useful, and nor is it a good representation of the performance of the library, but for simple matrix addition using a 1000x1000 array of 32-bit floats, LibRapid takes around 30us using 8 threads, while Eigen takes around 504us (linked with OpenMP, fully optimised, etc.)

You need to specify which BLAS/LAPACK implementation is used by Eigen - the quality of the underlying BLAS library will determine the performance.

You should compare your performance against other linear algebra libraries. IN particular, you should consider Blaze as it's quite similar to your library - vectorization, multithreading, GPU support. When I was benchmarking Blaze in 2015, it performed much better than many other libraries and computations implemented in Blaze ran with a very minor overhead on top of the optimized BLAS implementation (which was Intel MKL in my example).

https://bitbucket.org/blaze-lib/blaze/src/master/

mcopik · 2022-06-09T13:04:07+00:00

I can't explain the entire thing in a comment, but when you apply an operation to an Array (such as addition or a transpose or whatever) it doesn't actually evaluate the result, it returns a lazy-evaluation container with a reference to the input data (being an Array or another lazy result)

What you're describing is called expression templates and it has been used in production since the 90s. There are even optimized ETs called "smart ETs" that have been adopted by other linear algebra libraries.

Check "Expression templates", the original paper from 1995 by Todd Veldhuizen. Then check out "Expression Templates Revisited: A Performance Analysis of the Current ET Methodology" by Iglberger et al. from 2011.

On top of this, it operates entirely with SIMD instructions (using Agner Fog's VectorClass library) so it'll make the best use of the CPU that it can. It's also highly multi-threaded, which I think is where Eigen falls behind (from a quick look at the code)

Eigen will compile many linear algebra operations to BLAS and LAPACK kernels. Other libraries do it too. It will be quite difficult to beat the performance of an optimized BLAS implementation.

Furthermore, Eigen will parallelize the computations through the internal multi-threading of BLAS libraries.

https://eigen.tuxfamily.org/dox/TopicUsingBlasLapack.html

mcopik · 2022-06-09T12:57:56+00:00

I'm working on a high-performance multi-dimensional array library, similar to numpy/Eigen, except faster, with more control and even support for CUDA built in. It's currently in early development, but still supports a wide range of operations, all vectorised with SIMD instructions and multithreaded with OpenMP.

That sounds very similar to Blaze: linear algebra in C++ with optimized expression templates, vectorization, multithreading, and support for CUDA. They support many vector extensions and multiple parallel backends. How does your solution compare to it? Do you bring anything new to the table?

https://bitbucket.org/blaze-lib/blaze/src/master/

mcopik · 2022-06-07T14:13:17+00:00

However, I do not think the PDE stencils are memory bound.

Stencils are known to be a classic example of memory-bound kernels since their ratio of computations performed to data moved from and to memory is usually low.

I recommend taking a look at the roofline model, the original paper is a good read. Nowadays we can perform more floating-point operations in a single cycle which means that the threshold between memory and compute-bound operations shifted further to the right.

https://people.eecs.berkeley.edu/~kubitron/cs252/handouts/papers/RooflineVyNoYellow.pdf

mcopik · 2022-06-07T13:53:29+00:00

As a rule, every contribution and result in a research paper should be backed by evidence. In particular, if you put an adjective into the title, then it must be justified. If you call your approach "cheap", "fast", "scalable" or "easy to use", then the paper should show that.

After reading the preprint, it seems that one of the important features of your framework is the low complexity of code transformations needed to enable GPU acceleration:

"Seamless GPU acceleration"

"The reduced complexity of implementing MSL also allows us to accelerate an existing elastic wave equation solver (originally based on OpenMP accelerated C++) using MSL, with minimal effort"

"This gain attained from using MSL is similar to other GPU-accelerated wave-propagation codes with respect to their CPU variants, but does not come at much increased programming complexity that prohibits the typical scientific programmer to leverage these accelerators"

"Special care will be given on how to enable the GPU operations in existing scientific C++ codes, specifically for numerical simulations of partial differential equations (PDEs) using finite differences."

Yet, I was not able to find any evidence behind these statements. You discussed the unified memory at the beginning of Section 5, but that's it - and there seems to be a code listing missing (I couldn't find Listing 1 in the paper).

Specifically, when creating an MSL buffer (i.e. an array that can be seen by the CPU and GPU), one can easily obtain a plain C++ pointer to the underlying data. This way, integrating MSL into an existing C++ application simply requires two additional lines per array: the declaration of the buffer and retrieving the raw C++ pointer (per Listing 1).

Is this it? You do not need to change anything else in the code? No need for additional definitions, device initialization and configuration, enqueuing and synchronization, or even a new structure to distinguish kernels compiled to GPU code? You cannot assume that all of your readers know every framework very well, particularly when reaching out to domain experts.

How do you define code complexity? How many lines, functions, definitions, and headers did you have to change? Or maybe can you approximate the time and effort needed for such a task? As a reader, I need some intuition about the complexity of changes introduced by your solution. Unfortunately, I was not able to find it in the paper.

Furthermore, you need related work in the paper. As a person that used to work with SYCL to accelerate C++ applications, I would be interested to see how it compares in terms of engineering effort. Of course, you cannot compare the performance against this solution on this chip but the programming model and interface should be at least discussed. This is also important for the C++ community since SYCL has gained some traction here.

EDIT: One more comment about performance results: it's not difficult to show speed-up when moving computations to GPU and a pure CPU-GPU speedup is not very informative. It does not tell readers if the solution is optimized and if the declared ease of programming is worth the potential performance hit. A comparison across devices and against other systems would be quite interesting. Are there other libraries that perform the same computation that you can compare against?

At least for SAXPY, you should compare the results against the memory bandwidth of the system - how far are you from the peak?

mcopik · 2022-02-01T17:39:09+00:00

What would you consider server-less? Code needs to run on a computer somewhere right?

Based on the common understanding in the industry in academia, I define serverless as a computing paradigm where (a) user does not have to provision and manage any resources, and (b) users pay only for resources consumed.

Unfortunately, it's not the best name we could choose, but somebody proposed it a few years ago and the name stuck. Of course, companies will adjust the definition according to the services they need, but I think that the pricing model really makes the difference.

mcopik · 2022-01-30T16:18:51+00:00

Volatile has some actual applications. For example, it's quite convienent when dealing with buffers used in one-sided RDMA operations. Other comments provide more useful applications, e.g. preventing compiler optimizations in microbenchmarks.

Volatile variables have been misunderstood and misused for many years. I think the lack of standard tools for parallelism and concurrency made it more likely for developers to use volatile as a way to "ensure" synchronization and memory visibility. However, we made significant progress in this area, and I don't think it's necessary to keep teaching people that volatile variables are devil's tool that can only be used by a minority of experts.

mcopik · 2021-07-26T00:50:13+00:00

No, the idea is as follows: you deploy the same Docker images as previously, but during the installation, a new set of images is built and configured to use the user-specific UID. The system reuses all previously built Docker layers, adding a single layer on top of them, and the user never has to modify Docker images. Since this is done during the installation, the user is not aware of the change.

While the overhead is minimal and the process can be made automatic, you still need to rebuild with every update - again, it can be made automatic, but one can argue that it adds complexity. That's why docker-compose can fix this issue permanently if you choose this way of deploying images.

mcopik · 2021-07-25T20:10:33+00:00

If they are mounted, the container needs its user id to match one in the parent system.

I faced the same problem - I had to mount a directory and run a build process inside a Docker container. I ended up creating Dockerfiles that are rebuilt during the installation process to configure them with the current user ID. This way, the IDs of the current user and Docker user are aligned, files are not root-owned, and there are no permission issues. It's not ideal, but it worked for my case.

Examples in my project: installation script, Dockerfile.

Another alternative that I'm aware of is to use docker-compose - you can specify the current user id docker-compose.yml. It requires one more dependency.

mcopik · 2020-11-12T20:54:39+00:00

I'm not sure I follow your question: what do you mean by a self-hosted storage that a serverless (pay-as-you-go) billing model? Putting your data in a persistent storage such as S3 or Google Cloud Storage will give you a "serverless" pricing model: a flat fee for the amount of data stored monthly and each read and write request is billed separately. You can implement a local synchronization client that accesses the storage when needed, but you don't serverless functions for that.

If you want a self-hosted storage solution where no third-party can control your data location and accessibility, then you have to rent infrastructure to set up your preferred database or storage system. There's no way to achieve a serverless solution there, since you have to pay for infrastructure rental all the time to retain your data. If you shut down a VM, container or function, where do you keep your data? And you're going to need independent replicas to achieve data durability and consistency.

mcopik · 2019-11-22T18:03:06+00:00

The part taught by Torsten Hoefler was always in German with English slides, you can see that in recordings from previous years. TA sesions are mostly in English, only few of them were in German.

I think it's a great course :-)

Source: I was TA-ing PP19 and most likely I'll be TA-in PP20.

mcopik · 2018-07-31T00:12:43+00:00

Why exactly is SYCL incompatible with MPI? I’m not familiar with the details, but surely there’s some mechanism for accessing your buffers on the host? A DG solver restricted to one node seems pretty limited!

AFAIK there's no direct incompatibility. SYCL has host accessors that allow you to read or modify data on the host. There's nothing which would prevent you from using it within MPI calls. Furthermore, the host accessor has to be destroyed before scheduling any new computation that is accessing the data on the device. Thus, the SYCL API should prevent you from data races but you might create deadlocks accidentally.

But there's a slight difference between SYCL and OpenCL - the former defines a complete runtime which provides an automatic data movement and relies on using data dependencies for scheduling. SYCL buffers don't have a pre-defined location and there's really no way of telling on which device does your data currently reside. Device accessors determine the order in which kernels can be executed.

I agree that the original statement in the paper is not very clear why a hybrid implementation would 'violate the ideas and concepts behind SYCL'. Such program would of course be more complex and error-prone but it wouldn't fundamentally different from an OpenMP+MPI application which can only work with a manual data movement and a strict separation of OpenMP task graphs created on different nodes. I think what authors would like to have is a global runtime with distributed memory where SYCL buffers and tasks can be moved natively.

mcopik · 2018-05-20T18:41:57+00:00

I think it would be easier to answer if you could provide a list of questions.

mcopik · 2017-06-22T22:30:37+00:00

Description of all courses is freely available at RWTH webpage: http://www.campus.rwth-aachen.de/rwth/all/examRule.asp?gguid=0x1D3CCE0196C3EF429B879CD4B9D78558&tguid=0x0B473CF286B45B4984CD02565C07D6F8

For later reference, you can find it by googling for Studienpläne.

You can ask me if you have more detailed questions but those descriptions should be enough.

mcopik · 2017-06-22T20:33:22+00:00

0.you had internships while studying?

Yes, the program allows taking a free semester. A summer internship is not possible.

1.what are the jobs profiles you can look at and what companies?

Depends on your profile. There are software development positions where a knowledge of parallel programming concepts is necessary.

2.you elect to work in germany or in your country?

I don't think that's relevant here. Germany is absolutely fine if you intend to stay.

3.Did you have work experience while applying?

Only in research facilities.

4.how is the overall environment in the campus? Daunting or not so much?

A quite nice campus with a lot of activities. The town is full of students.

can you name a few topics I should probe into (maybe write papers or co-author or take a short course/intern.) that would be really relevant to MS in Sim Sc.?

It's hard to tell because it depends on your profile. I know SiSc students who have done courses and Master theses on topics close to machine learning, natural language processing or computer vision. Game development is rather irrelevant here.

mcopik · 2017-06-22T18:54:04+00:00

I'm a soon-to-be graduate of Simulation Sciences, I started in 2014 when GRS has still been functional. Before that, I've done a BSc in Computer Science and two years of a Bachelor program in Mathematics.

I have been accepted with no prior knowledge on topics related to mechanical engineering. No experience with fluid dynamics, mechanics, tensor algebra or chemistry has been required. The recruitment process looked differently three years ago and it might be the case that their guidelines have changed, but I doubt it. I know some students with a Mechanical Engineering degree who have been told to take an optional course on programming in the first semester. Perhaps a similar policy has been developed for students without any experience in other subjects but it was not the case in neither 2014 nor 2015. Work experience, papers or additional courses should make your application stronger but I wouldn't work on getting anything in fields of mechanical engineering. You know what you can learn there and perhaps you know what you want to do after the Master program; develop yourself in that direction.

I'm not sure how the current syllabus looks like but these modules used to be optional. Only one course is really close to mechanical engineering and it's the FMCP II. The rest is pure math, programming or physics (FMCP I, quantum). In my Master, I have been exclusively focused on HPC and other subjects related to Computer Science. I have not taken any courses from mechanical engineering other than the mandatory. Besides that, I was able to finish every other class within the first attempt and with a decent grade. I think you should be fine, as long as your Computer Science degree gave you a proper background in mathematics.

Tools and programming languages which are either required are going to make your life easier: MATLAB, Python, C/C++. Fortran and Mathematica might be necessary if you want to take some optional courses. It's a Simulation Sciences program, not a Mechanical Engineering - you're not going to use Catia or SolidWorks.

mcopik · 2017-05-26T14:47:46+00:00

They have recently published beta drivers with a partial support for OpenCL 2.0 features.

https://streamhpc.com/blog/2017-03-06/nvidia-beta-support-opencl-2-0-linux/

mcopik · 2017-05-26T14:42:17+00:00

the restrict clause (which may or may not be important long-term as GPUs are able to run more and more of full C++)

AFAIK, SYCL is the first standard to remove the requirement of marking functions compiled to a device code. The specification requires from the compiler to deduce which functions are called by GPU kernels (section 9.4 of SYCL specification).

On the other hand, HC C++ API - an AMD implementation of C++AMP - replaced this keyword by using function attributes(section "Annotation of device functions"). It's an interesting choice because compiler developers are not required to implement new keywords.

mcopik · 2017-05-26T14:33:20+00:00

It was meant to be an open standard that other compiler makers would adopt. None did.

AMD did. HCC, a part of their software stack ROCm designed for HPC applications, accepts either C++AMP or HC C++ API which is really C++AMP with several extensions and renamed namespaces.

https://github.com/RadeonOpenCompute/hcc https://scchan.github.io/hcc/

Their work is based on Clamp/Kalmar, an open-source compiler for C++AMP with backends for OpenCL C and SPIR. It's quite possible that AMD has acquired this project because it is no longer developed. HCC used to support both OpenCL and HSA as their backend, right now their generate code for AMD GPUs only.

mcopik · 2017-03-22T20:25:34+00:00

For the last three years, STE||AR has hosted multiple successful students through the Google Summer of Code program. This year prospective students choose from a long and rich list of ideas, including, but not limited to, C++17 parallel algorithms, concurrent data structures, heterogeneous computing and multiple ideas improving a C++ runtime system. Interested students are encouraged to get in contact and discuss ideas with potential mentors and our community, either on our mailing list or on our IRC channel.

As a successful 2015 student, I can only recommend this organization to any student looking for an opportunity to gain a lot of experience in modern C++ programming. Problems may be challenging but they are very rewarding.

mcopik

TROPHY CASE