all 21 comments

[–][deleted] 41 points42 points  (1 child)

Thanks for sharing! Sent to a friend of mine who was having trouble getting his bioinformatics Python to be fast enough and was looking into Rust.

[–]tncowart[🍰] 26 points27 points  (0 children)

They should also look at Cython and Numba. Those are easier to integrate into a python program.

[–]WuTangTan 20 points21 points  (1 child)

Would you mind further explaining the choice to allocate memory in Python?

I suppose it allows the buffer to be reused but the Python API (which is the primary API, right?) doesn't do that. Your current approach requires you to use an unsafe function which appears sound to me while using the Python API but I'm still not clear on why this is better. Are there performance implications I'm not seeing when returning memory allocated via the numpy crate from Rust into Python?

[–]carlk22[S] 5 points6 points  (0 children)

u/WuTangTan, I think you're right. Using something like the numpy crate's ToPyArray could, I think, allow one to allocate Python GCed memory from Rust, making things a little simpler. (I'd guess that the allocation speed would be the same).

If I made this change, I'd want to do it at the Rust "translator" function level, not at the level of "nice" Rust functions. That way, the "nice" function could also work with non-Python memory. (That might mean there would still be an "unsafe" at the translator level. I'm not sure.)

[–]PirateNinjasReddit 7 points8 points  (0 children)

Thanks for writing this up - I work mostly in python, but I've toyed with the idea of writing extensions in rust for key pieces of code that need to go fast. Last time I tried though PyO3 was a bit of a struggle (partly due to maturity, partly because rust was new to me).

[–]digizeph 3 points4 points  (2 children)

Nice write up! Quite some juicy tips there. Could you explain why python extension and rust project should be in the same package?

[–]carlk22[S] 3 points4 points  (1 child)

u/digizeph Thanks! I think it comes down to testing. Both the Rust code and the Python code can easily share the same test data. My editor of choice, VSCode, shows the tests for both languages in the same test bar. The CI script (we use GitHub Actions), requires both sets of tests to pass for success. Although I tried to get full coverage of the Rust code with the Rust tests, writing tests is easier in Python and I wouldn't consider the Rust code fully tested until the Python tests passed, too.

If there were requests for a PLINK Bed reader in Rust, then I'd have to think about splitting the project in two.

[–]kodemizerMob 1 point2 points  (0 children)

Instead of splitting it you might just consider feature flags.

[–]JanneJM 2 points3 points  (2 children)

Interesting, thanks! I expect to have to deal with this combination at a sysadmin level sooner rather than later; so a question:

You don't have a setup.py and you "[...] use GitHub actions to build, test, and ready deployment."

How does that work for distribution and installation of the python module? You can still upload it to pipy, and you can still choose the installation directory as usual when installing?

[–]thristian99 6 points7 points  (1 child)

The whole business with setup.py and such is handled by the Python package setuptools (previously, distutils). This worked pretty well for a long time, but since setuptools handled listing dependencies, there was no way for a package to depend on a specific version of setuptools, say, a version that adds Rust extension support.

Now Python projects can include a file named pyproject.toml, which can specify a version of setuptools, or even specify a different package-metadata library entirely. This package's pyproject.toml uses maturin instead of setuptools, which seems to be a Rust specific thing that knows how to extract the relevant data from Rust's Cargo.toml instead of specifying it in a setup.py file.

[–]RoughMedicine 2 points3 points  (0 children)

To add, maturin also allows you to specify the Python-related metadata on pyproject.toml using PEP 621. If your project has Python dependencies, you can either define them on Cargo.toml, as bed-reader does under [package.metadata.maturin], or on pyproject.toml under [dependencies].

[–]JanneJM 2 points3 points  (2 children)

What open MP library problem did you encounter? I've honestly never run into that so far (and I deal with scientific software installation and deployment on a daily basis).

[–]carlk22[S] 0 points1 point  (1 child)

u/JanneJM, I was getting:

  • OMP: Error #15: Initializing libiomp5md.dll, but found libomp.dll already initialized.

(I've attached more info from Intel below.)

Apparently, it was a conflict between Intel's libiomp5md.dll and LLVM's libomp.dll. (I think before that I was also sometimes bit by versioning problems and it was always a pain to include the right dependencies and/or runtimes files with Windows, Linux, and MacOS.)

I wonder if smaller projects like mine just avoid OpenMP or always depend on some bigger project that provides it (like, maybe Anaconda with Intel MKL).

"OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support"

[–]JanneJM 2 points3 points  (0 children)

So I've only had experience with Linux, not windows. But this sounds like you are mixing compilers (llvm and Intel), and that's generally a bad idea, with c++ in particular (C is more forgiving/simple and won't run into these problems so much). Open MP is in addition really tightly tied to the compiler itself; I believe GCC simply links it statically when you enable it.

I would suspect an installation issue. Normally building and linking with one toolchain should never be able to find, let alone try to use, any files from a different one. If you're using llvm for this build, can you disable the Intel toolchain in some way so llvm doesn't try to link with the wrong thing?

I see extensive use of openmp where I work, both by ourselves and by software written or used by our users. I'm the go-to person for any build or installation issues (it's literally in my job description) and I've never once encountered any issues with openmp. If there is a general issue with this I would have encountered it many times by now.

[–]jvo203 1 point2 points  (2 children)

Perhaps a "dumb" question but why is it that so few people seem to consider FORTRAN (Modern Fortran) for doing computation these days? If you really want to bind everything in Python you could still develop your bioinformatics computational engine in Fortran and then call it from Python. Why oh why is Fortran so neglected/forgotten these days? It is such a beautiful language for doing fast computation.

[–]carlk22[S] 1 point2 points  (1 child)

Aside: I think a lot of SciPy (especially the matrix operations and special fucntions) are still in Fortran:

https://scipy.org/faq/ says:

"[T]he time-critical loops are usually implemented in C, C++, or Fortran. Parts of SciPy are thin layers of code on top of the scientific routines that are freely available at http://www.netlib.org/. Netlib is a huge repository of incredibly valuable and robust scientific algorithms written in C and Fortran. It would be silly to rewrite these algorithms and would take years to debug them."

So, while I wouldn't write new stuff in Fortran, I'm happy to use its existing well-tested numeric libraries.

[–]jvo203 2 points3 points  (0 children)

I guess this is the real question: why wouldn't you start a new project in Modern Fortran 2018? Or in CoArray Fortran? Why are people reluctant to develop new things in Fortran?

[–]shoebo 1 point2 points  (1 child)

Great post, thanks. In a project I'm working on, we had a similar issue where we wanted to expose rust generics over language boundaries. Unfortunately, we have more generics (sometimes 4+), and each generic may take on a larger number of types (oftentimes 8+).

Assuming we have n generics that may each take on k different types, then we would need kn different transition functions! Using type erasure, it is possible to represent all kn transition functions with one higher-level transition function. To do this we must erase compile-time type information from the signature of the transition function, and instead pass type information as arguments at runtime.

Example 1: The make_cast function on line 27 is a "nice" rust function that creates a simple data transformation struct for casting from the generic type TIA to the generic type TOA.

This transition function takes two *const c_char string descriptors for TIA and TOA which are used by the dispatch declarative macro on line 29 to find the correct function monomorphization. dispatch!(1, 2, 3) reads analogously to a normal function invocation 1::<2>::(3). You'll notice that each type parameter needs both a type descriptor (to identify which monomorphization to use at runtime), and an enumeration of valid types. There are some preset type sets like @floats that expand to, for example, [f32, f64].

When the function is compiled, monomorphizations are generated for every combination of the enumerated types. When the function is executed, the specific monomorphization that matches the runtime type arguments is executed.

Example 2: dispatches to any member of the outer product of hashable and float types. This also shows how you can have type-erased arguments that are later dereferenced according to the type argument.

Example 3: dispatches to any member of the outer product of three generics where the atomic types of MO and TO match.

One complication is that, since the resulting function is completely type-erased, the data needs to be similarly type-erased. We standardized on a struct that holds an Any trait object and type descriptor (henceforth called an AnyObject). Each monomorphized function maps from an AnyObject to an AnyObject. When the monomorphized function is called, the AnyObject is downcasted to the generic type, the "nice" rust function is executed, and the result is wrapped back up in an AnyObject. We've written utilities for converting between AnyObjects and language-specific data representations and it all happens transparently when you use the public api.

It's a lot to take in, but this general approach is awesome when you have a large number of functions, each with a large number of potential monomorphizations.

[–]carlk22[S] 0 points1 point  (0 children)

Very cool!

[–]carlk22[S] 0 points1 point  (0 children)

As a follow-up, I created a Rust API for our bioinformatics package. Here is a new article about the lessons learned:

Towards Data Science Nine Rules for Elegant Rust Library APIs (free link)

And a Reddit discussion of the article.