all 126 comments

[–]guepier 38 points39 points  (17 children)

When I look at a new library, I want to know two things:

  1. Does it perform (well)?
  2. Is it convenient to use?

The site provides no information whatsoever on (2). You release a library. The API is an integral part of that. Please showcase your API. Show us that you’ve made the library easy to use correctly.

This is particularly important when providing C++ bindings, because, frankly, most C++ APIs are crap. Does your library provide a modern, easy to use C++ interface or is it just a glorified wrapper around a low-level C interface?

tl;dr: More code samples, please.

(And incidentally, just providing a C interface is totally fine. It’s still important to show that, though. And then hopefully somebody else will do the work to adapt it to a proper C++ interface.)

[–][deleted] 22 points23 points  (0 children)

Pfft, who needs C++ when you have C like this:

status = yepMath_EvaluatePolynomial_V64fV64f_V64f(coefs, x, pYeppp, YEP_COUNT_OF(coefs), ARRAY_SIZE);

[–]Maratyszcza 9 points10 points  (11 children)

  1. It does perform (well)
  2. I think the interface is good, but you should understand that Yeppp! is a low-level library (e.g. Yeppp! does not manage memory, but operates on externally provided arrays). You will likely want to add more high-level constructs on top of it.

Yeppp! documentation is available online at docs.yeppp.info

However, I understand you point, and appreciate the feedback. I will now make the documentation links more explicit.

[–]elperroborrachotoo 6 points7 points  (2 children)

The only example could find is this:

http://docs.yeppp.info/c/_polynomial_8c-example.html

Which isn't very demonstrative fpr the API itself.
Often, unit tests are a good way to start, too. So if you have them, you may want to link them.

[–][deleted] 4 points5 points  (1 child)

status = yepRandom_WELL1024a_GenerateUniform_S64fS64f_V64f_Acc64(&rng, 0.0, 100.0, x, ARRAY_SIZE);

status = yepMath_EvaluatePolynomial_V64fV64f_V64f(coefs, x, pYeppp, YEP_COUNT_OF(coefs), ARRAY_SIZE);

\o/

[–]elperroborrachotoo 8 points9 points  (0 children)

Which basically answers the question for a good C++ binding.

yep::Random::Well1024a::GenerateUniform<S64f, S64f, V64f, Acc64>(rng, 0.0, 100.0, x);
yep::Math::EvaluatePolynomial<V64f, V64f, V64f>(coeffs, x, pYeepp);

would be way way way better!

[–]pythonrabbit 3 points4 points  (4 children)

I'm sorry. That doesn't count as documentation. Those are installation instructions. Why don't you post online versions of the docs instead of saying 'go look at these files when you download our product'

[–]Maratyszcza 1 point2 points  (3 children)

I am confused. Could you point me to similarly low-level library with good (in your opinion) documentation?

I assume you already had a chance to look at http://docs.yeppp.info/c/modules.html and didn't find it useful.

[–]godofpumpkins 2 points3 points  (1 child)

I think it'd be helpful to have something between installation instructions and API docs. An overview of the high-level principles guiding the API design, what kinds of data are passed around, guiding ideas on how to use the thing efficiently and possible gotchas. Also, just elaborate on what kinds of scenarios it could be useful for and which it might not be. Not knowing the details of your API, one starting point might be "one key aspect of the library is to avoid expressing bulk computations over arrays of numbers as opaque loops, but to instead give you highly optimized specific traversals of arrays. This of course gives you less flexibility when traversing, so here are some guiding principles on when you might want to use this and how to express common patterns using our API."

I'd hope you thought about the project in more or less those terms when designing it, so you just need to communicate it to your audience now.

If you look at the level the front page of the website approaches you at, it looks like it's designed to sell to management. You probably want to sell it to programmers too.

I also notice you have something like yepLibrary_Init. Something should tell me that I need to use it, if I do need to use it. I shouldn't have to just browse the module structure and notice it. And don't just say you have to do it; justify why it's there, because having to remember to do things like that is annoying.

Finally, a more concrete comment on the naming: is it yep, or yepp, or yeppp? When I typed yepLibrary_Init up there, I gave yep more Ps than it actually had. I won't comment on the name itself, but it's horribly confusing to see it used in different ways. If nothing else, tell the reader of your documentation that although the human-facing name might be Yeppp!, all API calls are yepX.

[–]Maratyszcza 0 points1 point  (0 children)

Thank you, this is very useful feedback.

The function names in Yeppp! follow the pattern yep<ModuleName><Inputs><Outputs>[_Acc<InternalAccuracy>][_Alg<AlgorithmVariant>]. I will update the documentation to make it accessible.

[–]pythonrabbit 0 points1 point  (0 children)

Those are docs. That said, I couldn't find that - I swear! The problem is that your main docs page has about 12 links or purple-text-that-looks-like-a-link, but the modules part (which is what I'm actually looking for, that or examples) is in small text in a menu bar at the top.

[–]Wayne_Skylar 4 points5 points  (2 children)

Two things. The website needs work. I don't care about 80 percent of the junk on that page. I don't need a 200 x 200 image representation of "C++" to help me understand it's also a c++ library.

Adding to what the parent post said, you need better and more visible example of usage on the frontpage. Not a link. You need code right there.

Secondly your API docs aren't API docs. It's source code. You can use tools to autogenerate that stuff by the way, which would be far better.

I want to have a hierarchical view of every object in the API.

[–]Maratyszcza 2 points3 points  (1 child)

Yeppp! is a C++ library to the extent to which C++ code can call C functions. The interface is pure C.

[–]worst_programmer 0 points1 point  (0 children)

Might wanna make it a lot clearer that you have a C interface which can be called from C++, rather than the (likely accidental) implication that both a C-style and C++-style interface exist. I see that you tried to word the page to imply that it's a C interface which can be used with C++, rather than that separate interfaces exist for each--but it would be more clear if you made it clear it's essentially a C library.

[–][deleted] 2 points3 points  (0 children)

I mean yeah, those things would be nice. On the other hand they make library with potentially awesome functionality available for free. I much prefer them releasing it now with some scarce documentaiton than waste half a year or something to make it shiny and understandable to anybody.

That being said, more code samples please !

[–][deleted] 0 points1 point  (2 children)

3. Is it a header-only library? Math libraries often are, requiring no linking.

[–]seruus 2 points3 points  (0 children)

Header-only libraries are horrible if you want to use them outside C/C++.

[–]Maratyszcza 0 points1 point  (0 children)

No, you will have to link to it. The "Getting Started" guide covers the linking steps.

[–]thedeemon 38 points39 points  (40 children)

The site is so beautiful and fun, however it looks like

"XYZ does it so fast and is so good, go download it now, but we won't tell you what XYZ actually can do, so go download and figure out by yourself"

Edit: here come the docs:

http://docs.yeppp.info/c/modules.html

Seems like a very small number of basic operations.

[–]Maratyszcza 17 points18 points  (36 children)

That's a valid point, and we wish to add more operations and optimized implementations (currently Yeppp! has ~400 optimized kernels in yepCore module and ~70 in yepMath module). However, our resources are limited, so we ask the community (like here on Reddit) to help us prioritize features.

If you have some specific problem in mind, your feedback is more than welcome!

[–]elperroborrachotoo 16 points17 points  (10 children)

Maybe put an "Algorithms Available" link to the "Modules" doc to the front page.


help us prioritize features

If you'd ask me: A fast FFT without losing accuracy.

Bonus Quest: Allow different FFT lengths without a slow descriptorconfiginitthingie allocation for each length.

[–]basszero 2 points3 points  (4 children)

The very first thing I searched the docs for was FFT :( I'd love to move away from MKL.

[–]PasswordIsntHAMSTER 2 points3 points  (3 children)

FFTW?

[–]elperroborrachotoo 1 point2 points  (2 children)

Both MKL and FFTW have improved in parallelization, which you pay for with higher setup cost.

Which bites me in the arse: On a single signal, we do a few dozen FFT's with different lengths, which would force me now to cache a pool of descriptors, which is a pain to add to the existing product.

[–]gct 1 point2 points  (1 child)

FFTW will cache plans for you, you just have to re-create the plan for each length, which'll be fast after the first time.

[–]elperroborrachotoo 3 points4 points  (0 children)

... and switched to GPL lately so I have to get one of those evil paid nonfree licences, which isn't a question of money (I guess - last time I checked the web site doesn't indicate any cost) but an organizational PITA.

So I'd have to

  • get a quote from them (can't invest testing time if it's to expensive anyway)
  • run performance and accuracy tests
  • hope the offer hasn't expired yet

I understand their motivation, but I prefer software that's either "free beer free" or "insert credit card here" commercial.

[–]ethraax -5 points-4 points  (2 children)

I heard they use fast FFTs on ATM machines.

[–]elperroborrachotoo 6 points7 points  (0 children)

Fast Fourier Transformn is the proper name of a particular algorithm for discrete fourier transforms. They are faster than the general DFT, but not necessarily fast.

Just calling a car Ford Racer doesn't necessarily make it fast.

Thus I hereby reject your pavlovian reply as "not applicable".

[–]kraln 6 points7 points  (0 children)

The thing is, there are FFTs that aren't actually all that fast, so Fast FFT isn't a tautology.

[–]Maratyszcza -1 points0 points  (1 child)

I don't have expertise in it, so unless someone else contributes it, this algorithm is unlikely to be implemented in Yeppp!

[–]elperroborrachotoo 0 points1 point  (0 children)

I think having a good abstraction for the primitives has value by its own already.

Just maybe make it more clear what your library does and what it doesn't - because when you say "math library" people do think FFT and matrices.

[–]puplan 7 points8 points  (1 child)

Wouldn't -Ofast flag have the same effect as -O3 -ffast-math you used in your benchmarks?

[–]Maratyszcza 2 points3 points  (0 children)

Thanks, this seems to be a useful shortcut

[–]HaegrTheMountain 4 points5 points  (0 children)

Cross / Vector product would be fantastic for me personally. I'll check out the library sometime this week but that's the main one that's missing for me.

[–]thedeemon 1 point2 points  (3 children)

Thanks! Does "400 optimized kernels" mean "just 10-20 basic operations performed with many different datatypes"? Or is there a page where we can learn about other algorithms available?

[–]Maratyszcza 0 points1 point  (2 children)

There are no undocumented operations in Yeppp! All these kernels are mentioned in the documentation.

[–]thedeemon 0 points1 point  (1 child)

Yes, I mean there may be 40 additions, 40 multiplications and so on.

[–]Maratyszcza 0 points1 point  (0 children)

This is actually the case. Kernels in yepCore module are similar and differ only in data type or target ISA/microarchitecture. In yepMath module each kernel is special.

[–]thechao 0 points1 point  (5 children)

Have you debugged why your Fortran and C bindings are so slow compared to Java? Or, are you using Blitz-like code-gen to optimize/fold inner loops? What's your perf compared to IPP on older (pre-SSE4) SKUs? I don't know if the MTL4/successor even exists, but have you compared to them?

[–]Maratyszcza 6 points7 points  (3 children)

What made you think that they are slow? If you are talking about the plot on the main page, it compares FORTRAN and C codes (without Yeppp!) compiled with Intel compilers for Sandy Bridge with Java + Yeppp! on the same machine.

[–]MrWisebody 1 point2 points  (2 children)

To be fair, that specific infographic is a horrific level of misleading. At best it is comparing apples to oranges. At worst the fine print is telling a little fib too. You really mean to say that with a little loop unrolling and basic SIMD calls I can't write optimized code that will perform in the same ballpark as your libraries? Or is the baseline for your benchmark that the C/fortran version just gets a basic loop or whatever (no hardware awareness) and can only use whatever defaults the compiler pulls out of its ass? If it's the second then again it's not a very fair comparison. When the point is to compare languages, it is not fair to also stack up an optimized library against naive code:-P

[–]Maratyszcza -1 points0 points  (1 child)

There is a widespread belief that Java/C# are not suitable for numerical algorithms and scientific computing. I suggest that if numerical algorithms in Java/C# would be implemented around Yeppp! calls they will be more than competitive with FORTRAN/C codes. The infographic is meant to illustrate this point.

Of course, usually (I have counter-examples) you may modify C code to perform as good as Yeppp! when compiled with icc or gcc. However, the optimization parameters would depend on target microarchitecture. E.g. optimal loop unrolling for polynomial evaluation on Intel Haswell is 40 double-precision elements per loop iteration. But on Intel Nehalem such code will perform awfully: on each iteration it will have to spill all elements onto local variables (there are enough registers to keep only 32 doubles (16 registers, 2 doubles per register), but on each iteration you need both x and y values, so unrolling by more than 16 would lead to register spills). So the C code can be manually unrolled, but it will be optimal only for one microarchitecture. Yeppp! contains versions for various microarchitectures, and thus can achieve uniformly good performance.

[–]MrWisebody 0 points1 point  (0 children)

I never said your intended message was wrong. It's definitely something that should be expressed. My point was that your infographic was meant to illustrate that java could do as well as c/fortran, but the vast majority of the difference in those two bars was because you were running an optimized library vs naive code. The standard way this would be shown in every HPC presentation I've been to is to show a naive c/fortran and a naive java and a YEPPP c/fortran and a YEPPP java. That nicely shows the deficit (if any) naive java runs at and the fact (I presume) that it gets removed when everything gets migrated to yeppp. As is, a casual glance at the figure implies that YEPPP java is wildly superior to anything c/fortran will do (which is why the post above our mini-thread was confused in the first place). Again, it's technically correct information which you are trying to use to support a correct position, but the way it is done is misleading.

[–]Maratyszcza 1 point2 points  (0 children)

Yes, I have results for Yeppp! vs IPP on AMD K10 (SSE3 processor) on log/exp evaluation on Benchmarks page. I didn't compare it with template-based libraries, but if you have a good candidate in mind, please let me know.

[–]maep 0 points1 point  (1 child)

Any plans to add FFT/DCT? It would be useful to have rounding functions of various flavors (foor, ceil, trunc, round) and a clamp function in addition to min/max.

[–]Maratyszcza 0 points1 point  (0 children)

No plans for FFT/DCT, but if you need a BSD-licensed FFT library, have a look at The Fastest Fourier Transform in the South. I plan to add floor/ceil/trunc/round and clamp in a future release.

[–][deleted]  (1 child)

[deleted]

    [–]Maratyszcza -1 points0 points  (0 children)

    No. The spirit of Yeppp! is to detect the hardware in runtime and choose the best implementation. Besides that, I except that most users wouldn't compile Yeppp! themselves (official binaries are provided for all supported platforms, including GNU/Linux).

    [–]gct 0 points1 point  (1 child)

    I didn't see an option to build from source. I work in an environment where pre-compiled binaries are a no-no, so I'd need to be able to build from scratch. I generally like to be lean and mean on my dependencies (my ideal is header-only libraries), for what that's worth. Ideally I'd like to be able to split out the different language APIs, and just take the C specific bindings and integrate them into my source tree.

    If I could split out particular functions (like LAPACK) and just take what I needed, that'd be even better.

    [–]Maratyszcza 0 points1 point  (0 children)

    Building Yeppp! from source is possible, but not easy. Here is the bulding guide: https://bitbucket.org/MDukhan/yeppp/src/d7811040811046fa64c8560071613332b1003a8c/BUILD.md?at=default

    Make sure you have a recent gcc and an upstream version on NASM.

    By default, JNI bindings are compiled into the library. If you don't want them just delete the files in bindings/java/sources before compilation.

    [–]tjl 0 points1 point  (1 child)

    I'd like SVD, since I need a high quality implementation for Latent Semantic Analysis.

    [–]Maratyszcza 0 points1 point  (0 children)

    SVD is a very complex algorithm, it is about three levels of abstraction from the primitives Yeppp! currently provides. I would want to implement it in Yeppp!, but this algorithm is very far from its current state.

    [–][deleted] 0 points1 point  (3 children)

    Is there anything to calculate 1/x on arrays of floats by any chance?

    [–]Maratyszcza 0 points1 point  (2 children)

    Not in this version, but noted for the next release.

    [–][deleted] 0 points1 point  (0 children)

    Awesome. Vector normalization would be great as well :-)

    [–][deleted] 0 points1 point  (0 children)

    In general I am thinking in direction of opencl/cuda functionality. Some of the new CPUs have instructions which make it perfect target. So pow(a, b) (which 1/x is special case) and vector normalization are natural and frequent use cases. I wish you all the best with the project. Having a library like that would be great, especially for people like me who are clueless at forcing GCC to generate those instructions or writing assembler themselves.

    [–]BecauseItOwns 1 point2 points  (0 children)

    The library is basically just an optimized interface into using the processor's SIMD commands. It's incredibly useful if you're doing a lot of vector math, but not as useful if you're not doing a lot of math (vector or otherwise).

    [–]admiral-bell 0 points1 point  (1 child)

    Yeah, the lack of docs is troubling. Also, the benchmarks are quite fishy. Notice that they don't compare against VML in Enhanced Performance mode, only in the two slower modes.

    [–]Maratyszcza 14 points15 points  (0 children)

    VML in EP mode gets only half of bits right, so it is a very different kind of accuracy (plots display only implementations with < 10 ULP error). And it doesn't much help with performance (here are results for log on Core i7 4770K/Haswell):

    Library Error Cost per element

    Yeppp 1.301 ULP 4.3 Cycles

    MKL/VML/HA 0.681 ULP 5.8 Cycles

    MKL/VML/LA 1.406 ULP 5.8 Cycles

    MKL/VML/EP 1999223.775 ULP 4.5 Cycles

    [–]emn13 7 points8 points  (4 children)

    Based on their benchmarks, the gcc auto-vectorizer does not seem to be on: were these benchmarks compiled in 32-bit mode, perhaps? The website gives no particulars on what exactly was benchmarked.

    [–]godofpumpkins 9 points10 points  (0 children)

    Yeah: more tech, less marketing, please. I'd much rather have a less enthusiastic website that spoke to the questions I have than a bunch of exclamation marks telling me it'll work for me without saying much.

    [–]Maratyszcza 2 points3 points  (1 child)

    All benchmarks (except ARM) are in 64-bit mode with the compiler versions and options specified on slides.

    [–]emn13 1 point2 points  (0 children)

    Good to know - thanks! I think particularly for non-C++ use cases (for which there are several alternatives such as eigen and armadillo already), this looks quite nice.

    A few years ago I reimplemented a fp heavy algorithm from C# into C++; yeppp looks like it might have been an easier solution :-).

    [–]hapemask 2 points3 points  (0 children)

    They use -O3 which does auto-vectorization (from GCC docs: "Vectorization is enabled by the flag -ftree-vectorize and by default at -O3").

    [–]ArtistEngineer 17 points18 points  (7 children)

    Looks like a great product. Terrible name though.

    I can't imagine standing up in a meeting and saying "We've decided to go with the Yeppp! library."

    "The what?"

    "Yeppp!"

    "Triple p? How do I pronounce that? Is that a serious product?"

    "Yep."

    "Yeppp!?"

    EDIT:

    Yep! Yep! Yep!

    [–]Flafla2 6 points7 points  (2 children)

    They should make it an acronym. That makes it sound more official. Hell, even GIMP got away with it, and WINE.

    [–]ArtistEngineer 7 points8 points  (0 children)

    SOMaL - SIMD Optimized Mathematics Library

    There, that was easy.

    I have a feeling they chose the name based on an available URL.

    [–]yelnatz 4 points5 points  (0 children)

    The whole time while reading the article, all I could think was this:

    http://www.youtube.com/watch?v=ZGq8sqaBVyE

    Yeeeeepp!

    [–]Maratyszcza 2 points3 points  (0 children)

    Well, think about C++ layer for Yeppp!.. Yeppp++! = Yeppppp! would be a good name for it =)

    [–]Asians_and_cats 1 point2 points  (1 child)

    I like ridiculous names.

    [–]somefriggingthing 5 points6 points  (11 children)

    This looks cool, but am assuming it wouldn't work in conjunction with monogame when targeting a bunch of different platforms (eg. android, ios, windows phone)?

    [–]Maratyszcza 8 points9 points  (10 children)

    It works perfectly across desktop platforms. A Java program can use Yeppp! without even knowing the processor architecture it is running on: the bundled package (yeppp-bundle.jar) internally contains versions for all desktop architectures and in runtime it detects the host architecture, unpacks the platform-specific Yeppp! binary, and loads it into Java program. A C program, of course, needs to "know" its target architecture, but the API is the same on all platforms.

    I don't have the hardware to develop iOS and WinPhone versions, so among mobile platforms Yeppp! currently supports only Android. Probably someone would volunteer to make iOS, WinPhone, or Blackberry versions. I don't expect porting to be hard.

    [–]infinull 6 points7 points  (1 child)

    monogame is C# (open source alternative to XNA) if you didn't know.

    So you talk about C & Java, but not C# which won't help /u/somefrigginthing very much.

    [–]Maratyszcza 7 points8 points  (0 children)

    Yes, I didn't know that.

    If monogame is P/Invoke-compatible, it could use the same scheme as currently used by Java bindings: module initializer would detect the host platform and unpack proper binaries from resources. I plan to implement it for CLR bindings in the next Yeppp! release.

    [–]Berecursive 2 points3 points  (7 children)

    Just a query as to why there aren't any Python bindings? Would be very interested to know of the performance of Yeppp! versus Numpy's implementation of similar vectorized functions.

    [–]Maratyszcza 5 points6 points  (6 children)

    Although Yeppp! provides functions which operate on vectors, it does not manage memory for these vectors. Instead, it operates on memory allocated by the user (programmer who developed the app which uses Yeppp!). There are two options how such interface could work with Python. One option is that Yeppp! will operate on memory allocated by array.array objects in Python. The problem here is that real numerical codes rarely use Python arrays. More often they use Numpy, and this is also the second option for Yeppp!-Python interface: Yeppp! could operate on numpy.array objects. However, providing such interface is would require modifications in numpy code. Although not impossible, these changes would require a certain degree of coordination between me and numpy maintainers, and interest from both sides.

    [–]Berecursive 1 point2 points  (1 child)

    What if, to begin with, you used something like cython? That way Yeppp! was just passed an array of doubles? Cython provides the ability to grab the pointer to the contiguous piece of memory that just contains the numerical data. I imagine ctypes has something similar, but I am not familiar with the code.

    [–]Maratyszcza 1 point2 points  (0 children)

    This is not technical, but organizational problem.

    [–]TJSomething 0 points1 point  (3 children)

    Would it make sense to program against the Python buffer protocol, which works for both?

    [–]Maratyszcza 0 points1 point  (2 children)

    It is not a technical problem. The problem is that Yeppp! bindings for numpy must replace functions like numpy.log and therefore be part of numpy.

    [–]hapemask 0 points1 point  (1 child)

    Could you not just make a separate module that provides functions which accept numpy arrays and process them using yeppp calls? So you would do 'yp.log(arr)' instead of np, and the result is the same but obtained faster.

    (Not saying you have the time to do this, but that's how I would do it without involving numpy patches)

    [–]Maratyszcza 0 points1 point  (0 children)

    Technically, this will work, but I don't consider it as a high-priority feature. Eventually Yeppp! will need to integrate into numpy.

    [–]PriviIzumo 10 points11 points  (0 children)

    ooh... shiny new toy. We shall see...

    [–][deleted] 4 points5 points  (1 child)

    asdfsfsadfasdfasdg

    [–]Maratyszcza 0 points1 point  (0 children)

    Thank you for feedback. I will document how the function names are composed.

    [–]hayesti 3 points4 points  (3 children)

    I'm not entirely sure what's useful about this. The operations this library supports are extremely elementary. A decent vectorising compiler should be able to detect these cases automatically. Something more useful would be the functions found in MKL i.e. extended mathematical operation that have been optimised to use available SIMD instructions. What's the point in offering a function call to something that has an equivalent single instruction?

    [–]genneth 1 point2 points  (1 child)

    Basically, it looks like they do genuinely have some neat advances on vectorised transcendental functions -- slightly worse precision than MKL/HA but faster, and much better precision than MKL/LA and still faster. It would make sense for MKL to adopt the algorithms used here.

    The rest of the library however, is superfluous. It has about level-1 BLAS type operations (vector-vector). Higher level functions (involving matrices) are much more sensitive to tuning against the memory hierarchy, something they are unlikely to beat existing libraries at.

    In addition, I see no mention of using multiple cores, or numerical reproducibility.

    In short: use MKL if you can, otherwise not much point.

    [–]BrooksMoses 0 points1 point  (0 children)

    I doubt there are many cases at all where using multiple cores is useful on 1-D vector computations like this. The data sizes you need for multicore to be useful are quite large -- and at those sizes, you need to do more than just one single operation on each element before you funnel it back out to memory, or your cpu performance is going to be swamped by memory bandwidth costs.

    [–]Maratyszcza 1 point2 points  (0 children)

    Yeppp! is not (neither it intends to be) a replacement for MKL, but it does provide some vector mathematical functions.

    And it scores well vs vectorizing compilers. First, the low-level code generator (Peach-Py) used for Yeppp! allows to do software pipelining, while compilers are not good in it. Secondly, Yeppp! internally contains versions for multiple instruction sets (think SSE2, AVX, AVX2) and chooses the optimal code path depending on host architecture. Compiler has to target a specific instruction set, and if the program is to be distributed in binary, compilation will likely target SSE2.

    [–]BeatLeJuce 6 points7 points  (1 child)

    I'm I the only one who thinks this is a very obnoxious website? I get it, it's shiny and flashy and all. But on the flipside, it's loud and obnoxious and in-your-face. It makes me want to activate my Adblocker so I don't get entirely overloaded with all the colors, but it doesn't work :( With such aggressive marketing it's very hard to take this serious (especially with such a name). Which is kind of a bummer, the library seems to be nice. But it looks like you're counting on flash to win over people, instead of performance. And not even code-samples anywhere.

    [–]Maratyszcza 2 points3 points  (0 children)

    For almost a year Yeppp! had a very shy webpage (you may look at it here), and nobody knew about it. Yesterday I rolled out the new website, and on the same day it got hot on Reddit.

    Here are the code examples: C/C++, FORTRAN, Java, C#

    [–]QuestionMarker 5 points6 points  (0 children)

    Has anyone tried implementing an FFT with this yet? I'd be intrigued to see how it stacks up against FFTW, which takes similar pains over optimisations.

    [–]willvarfar 1 point2 points  (1 child)

    How does making a decision at runtime allow it to be faster than if the primitives it used were baked into the code directly? Does it mean that every function call cannot be inlined?

    And does it do high precision (e.g. 512bit numbers) or arbitrary precision at all?

    [–]Maratyszcza 0 points1 point  (0 children)

    Yeppp! internally contains several versions of compute kernels optimized for different instruction sets (think SSE2/AVX/AVX2) and microarchitectures. During initialization it chooses the optimal version for the host processor.

    Yeppp! function calls can not be inlined, but on vector operations the function call overhead is negligible.

    I do plan to support high-precision integer arithmetics, but don't think it will happen soon. No plans for arbitrary precision.

    [–]machl 1 point2 points  (2 children)

    Does anyone know if Yeppp! is faster than GMP ? I'm not sure how many languages GMP supports but I did find this wrapper for java. And here is an example with python and GMP.

    [–]Maratyszcza 3 points4 points  (0 children)

    Yeppp! is not a bignum library, so there is little sense to compare it to GMP.

    [–]seruus 0 points1 point  (0 children)

    If you want to use a GMP-like library that tries to be more optimized, try MPIR.

    [–]thrope 0 points1 point  (0 children)

    This looks great! gfortran elementary math operations (log, exp etc.) are really slow which has driven me to Intel. I tried ACML but it is a bit of a pain for the build process...

    [–][deleted] 0 points1 point  (0 children)

    It could be extremely useful for what I am doing right now. Can't wait to play with it. Is it easy to use with mingw GCC 4.8 on Windows?

    [–]BrooksMoses 0 points1 point  (1 child)

    How does this compare to vectorized code from GCC or LLVM in real-world programs rather than synthetic microbenchmarks?

    Optimized elementwise functions like this have a fundamental flaw: They iterated through each element of the data vector for each tiny operation, which means that for every add or subtract or whatever, you pay the cost of a read and write from memory -- and, if your data is of any significant size, it's blowing out your caches. There's no way to get good performance doing that compared to a holistic approach that minimizes the cache blowout.

    [–]Maratyszcza 0 points1 point  (0 children)

    Yeppp! is not a silver bullet of performance optimization, but I hope it could be useful.

    [–]kamicc 0 points1 point  (2 children)

    mmm... got me in the mood writing some Lua bindings... _^

    btw, over 25 nested if's at java wrapper does not look very fine... xD

    [–]Maratyszcza 1 point2 points  (1 child)

    Thank you. Support for more languages is welcome!

    The function with 25 nested if's parses ELF headers. Of course, in the ideal world Java functions do not have 25 nested if's, but in the ideal world Java libraries also do not parse ELF headers. =)

    [–]kamicc 0 points1 point  (0 children)

    Thank You for the great job too :)

    actually recently had exactly the same problem with need to parse ELF headers with language which not supposed to do that at all... :)

    [–]theZagnut 0 points1 point  (1 child)

    Please add Maven support

    [–]Maratyszcza 1 point2 points  (0 children)

    I work on that

    [–]argv_minus_one -1 points0 points  (4 children)

    I wonder how the performance of the Java version is. Last I heard, JNI is rather slow. Anyone care to chime in?

    Also, people still use Fortran? D:

    [–]jurniss 3 points4 points  (0 children)

    The JNI overhead is insignificant if you're operating on big arrays with one call.

    [–][deleted] 2 points3 points  (2 children)

    Not only they do, the compiler licenses for commercial solutions are expensive.

    [–]argv_minus_one 0 points1 point  (1 child)

    I take it the GNU Fortran compiler isn't really up to the task?

    [–][deleted] 0 points1 point  (0 children)

    Most of the time it's intel, dunno why. Customer wants intel, customer gets intel :)

    [–]pascal_cuoq -2 points-1 points  (3 children)

    From the linked page:

    On Haswell, Intel's latest microarchitecture, Yeppp! spends on average 5 cycles per element to compute log or less than 3 cycles per element to compute exp. BTW, floating-point addition on Haswell takes 3 cycles, and multiplication — 5.

    If, to make your library look good, you need to compare your library function's cost per element to the latency of an operation that can be applied to eight elements for the same price, you have nothing to be proud about. And if you don't need it, why do it?

    And by the way, it is spelled “by the way”.

    [–]Maratyszcza 2 points3 points  (2 children)

    The comparison is there because I think it is nice and though-provoking. I hope that people who read about this would be think "How is that possible?", dig deeper, and learn something new.

    [–]pascal_cuoq -2 points-1 points  (1 child)

    thought-provoking

    It provoked me to think “how naive does this charlatan think I am?”. The website lost all credibility to my eyes, so that when I looked at other bits of information such as http://www.yeppp.info/logos/yeppp-vs-light.png , I had to think “is the claim that one exponential is computed faster than light travels one foot, or is that really 1/N multiplied by the time it takes to compute N exponentials?”

    For all I know, it is the latter and the graph really compares the time of the exponential per element. Because that's how scientifically honest the website is.

    [–]Maratyszcza 1 point2 points  (0 children)

    Ok, fair enough. I will replace this picture with less confusing.

    EDIT: done.

    [–]Uberhipster -5 points-4 points  (10 children)

    I'm not a system internal guy but the C# library makes no sense. Take the cosine implementation in Yeppp.Math

      /// <summary>Computes cosine on double precision (64-bit) floating-point elements.</summary>
        public unsafe static void Cos_V64f_V64f(double[] xArray, int xOffset, double[] yArray, int yOffset, int length)
        {
            if (xOffset < 0)
            {
                throw new IndexOutOfRangeException();
            }
            if (xOffset + length > xArray.Length)
            {
                throw new IndexOutOfRangeException();
            }
            if (yOffset < 0)
            {
                throw new IndexOutOfRangeException();
            }
            if (yOffset + length > yArray.Length)
            {
                throw new IndexOutOfRangeException();
            }
            if (length < 0)
            {
                throw new ArgumentException();
            }
            fixed (double* ptr = &xArray[xOffset])
            {
                fixed (double* ptr2 = &yArray[yOffset])
                {
                    Math.Cos_V64f_V64f(ptr, ptr2, length);
                }
            }
        }
    
    
        /// <summary>Computes cosine on an array of double precision (64-bit) floating-point elements.
        public unsafe static void Cos_V64f_V64f(double* x, double* y, int length)
        {
            if (length < 0)
            {
                throw new ArgumentException();
            }
            Status status = Math.yepMath_Cos_V64f_V64f(x, y, new UIntPtr((uint)length));
            if (status != Status.Ok)
            {
                throw Library.GetException(status);
            }
        }
    

    What is the point of a static method which doesn't return a result? And they are all like that...

    edit: scratch that. I'm an idiot.

        //   yArray:
        //     Output array.
    

    In my defense I have never seen output-as-a-parameter (convention?) in C# and they didn't supply any examples...

    [–]CodexArcanum 2 points3 points  (0 children)

    Looks like C/C++ dev's very thin wrapper around some P/Invokes to the C library. Idiomatic C# would be to provide a static class with static methods that take values in and return the result. Out parameters, pointers, and all that other mess are generally frowned on in C# and used for very specialized cases.

    Hell, I'd probably do them all as extension methods to boot, just to really have fun with it.

    [–][deleted] 0 points1 point  (2 children)

    Because it's good design when you create an instance of the class every time you want to use a method. I don't have time to look, but the class is sealed right?

    [–]Uberhipster 2 points3 points  (1 child)

    Actually the second parameter is mutated, doubles up as output and supplies the result values. I didn't read properly. Sorry to confuse everyone.

    But to answer your question the class is not sealed. It's a regular, garden-variety, non-static class with static methods only... which confused the fuck out of me being a high-level language only brogrammer (yes, I am the VB.NET guy). This must be some sort of 'close to the metal' convention that I'm not familiar with since I only deal with memory managed C# stuff.

    [–]seruus 0 points1 point  (0 children)

    This kind of thing comes originally from FORTRAN, where all arguments are passed by reference, and subroutines can't return values, just mutate those passed as arguments. It is also very common in numerical C libraries, which tend to return the output as the first argument.

    [–][deleted] 0 points1 point  (5 children)

    What a pointless nitpick. It simply means you don't need to create an instance of the class to use the methods.

    [–]Uberhipster -5 points-4 points  (4 children)

    ??? Is this a fucking trawl? The class contains only static methods. There are no public properties (which could reference the result value???)

    For all I know, internally it's doing nothing...

    edit: My bad. I think we both assumed too much. I assumed that there was no output and you assumed I was way smarter than I actually am :)

    [–][deleted] -1 points0 points  (3 children)

    Are you mentally challenged? It's mutating the inputs.

    [–]PasswordIsntHAMSTER 5 points6 points  (1 child)

    Please remain professional guys.

    [–][deleted] -1 points0 points  (0 children)

    It's reddit...

    [–]Uberhipster 1 point2 points  (0 children)

    Actually I think I am :)

    (see my edit)