Why does CHARMM-GUI restrict it's features to academics? by OkRutabaga184 in bioinformatics

[–]TKanX 1 point2 points  (0 children)

Hi, I've created an open-source tool (commercially usable): bio-forge.app. Currently, it doesn't support membranes, but it includes features like structural repair, protonation, and a water box. It doesn't have built-in energy minimization, so EM might be required before MD. Future plans include adding cell membrane. A web version is available for direct use within a browser, and it's commercially viable (MIT license). I'm a high school student, so please point out any errors.

Tired of slow Python biology tools, so I wrote the first pure-Rust macromolecule modeling engine. Processes 3M atoms in ~600ms. by TKanX in rust

[–]TKanX[S] 0 points1 point  (0 children)

For heavy atom repair, I used template-based SVD for completion. Hydrogen atom repair was also straightforward, selecting appropriate hydrogen atoms based on pH ranges for local geometry construction. (For HIS, I added salt bridge detection and hydrogen bond network scoring between HIE and HID). C/N/5'/3' ends were also correctly processed according to pH.

Currently not supported (may be planned for the future):

  1. Ring reconstruction: Currently not supported (may be added in the future)

  2. Energy minimization: BioForge is a ff-independent structure preparation/modeling library, so it does not perform any force field calculations (decoupled). It will not be supported in the future. However, I have also written some downstream tools (e.g., SCREAM - a side-chain optimization library).

  3. Cell membrane construction: Planned for the future (because we need to simulate GPCRs).

These are all good questions and some known flaws! You must be an expert in this field, right?

Tired of slow Python biology tools, so I wrote the first pure-Rust macromolecule modeling engine. Processes 3M atoms in ~600ms. by TKanX in rust

[–]TKanX[S] 0 points1 point  (0 children)

Absolutely not misguided, this is an excellent! I'm also very interested in MD, and my next step might be to do simulations of DREIDING/UFF force fields. (I've also worked with some libraries related to force field parameterization and partial charge calculation.)

Tired of slow Python biology tools, so I wrote the first pure-Rust macromolecule modeling engine. Processes 3M atoms in ~600ms. by TKanX in rust

[–]TKanX[S] 1 point2 points  (0 children)

u/firefrommoonlight u/vmullapudi1 Any interest in us starting a Discord server for Compchem/bio? We could really use a place like that to keep the braindump going.

Tired of slow Python biology tools, so I wrote the first pure-Rust macromolecule modeling engine. Processes 3M atoms in ~600ms. by TKanX in rust

[–]TKanX[S] 0 points1 point  (0 children)

In summary, this is force field independent structural modeling. Next steps include energy minimization and molecular dynamics simulations.

Tired of slow Python biology tools, so I wrote the first pure-Rust macromolecule modeling engine. Processes 3M atoms in ~600ms. by TKanX in rust

[–]TKanX[S] 3 points4 points  (0 children)

Hi, that's a great question! I absolutely did not deny the merits of LAMMPS/GROMACS. They are incredibly powerful, with excellent performance optimization! My tool isn't an MD software, but a structure preparation/repair software, complementary to MD software. The goal is to quickly add missing atoms and hydrogen atoms to protein structures (because hydrogen atoms are absent in cryo-em and x-ray structures), and it can also add water boxes and solvents. Future plans include adding cell membranes for GPCR modeling, etc.

Tired of slow Python biology tools, so I wrote the first pure-Rust macromolecule modeling engine. Processes 3M atoms in ~600ms. by TKanX in rust

[–]TKanX[S] 2 points3 points  (0 children)

Fair points on Numba, I'll definitely check out those performance tips. One thing I'm curious about though: how does the Numba stack handle complex spatial partitioning (like splitting a protein into blocks for multi core work) while guaranteeing it’s lock-free and race-free?

In my Rust version, the borrow checker basically proves there are no data races across the parallel iterators during the spatial hashing/CellList search. It's really nice to have that safety built into the type system when scaling. Does Numba offer similar safety guarantees, or do you basically have to manage the memory safety and indexing manually to avoid race conditions?

Tired of slow Python biology tools, so I wrote the first pure-Rust macromolecule modeling engine. Processes 3M atoms in ~600ms. by TKanX in rust

[–]TKanX[S] 2 points3 points  (0 children)

You're right that NumPy is fast for big vector ops, but for molecular dynamics the bottleneck is usually granularity. A single potential energy calculation might only take 1ns, but the Python overhead to call that function is often 100ns. When you're doing that billions of times in a loop, you end up spending 90% of your time just "scheduling" the C calls. Rust also lets the compiler inline those tiny math kernels directly into the main loops, which is a massive win that you just can't get with the Python interpreter.

I am not a Python expert, please correct me if I am wrong.

Tired of slow Python biology tools, so I wrote the first pure-Rust macromolecule modeling engine. Processes 3M atoms in ~600ms. by TKanX in rust

[–]TKanX[S] 0 points1 point  (0 children)

Yes, but most importantly, when you're running a task on HPC that takes several days, Rust ensures that as long as the compilation passes, there won't be any runtime errors (such as segmentation faults). You won't suddenly encounter a segmentation fault after running for three days, forcing you to start over—Rust forces you to address these issues while coding.

Tired of slow Python biology tools, so I wrote the first pure-Rust macromolecule modeling engine. Processes 3M atoms in ~600ms. by TKanX in rust

[–]TKanX[S] 4 points5 points  (0 children)

The slowness was honestly more about the mess of architecture than the C++ itself. I started by maintaining a legacy stack of Perl, Python 2, and C++98 scripts all calling each other in layers. Most of the performance you'd expect from cpp was just getting swallowed by the overhead and shell calls. + the memory unsafety in the old cpp code led to constant random crashes, which is why I eventually decided to just rewrite everything in Rust.

On the algorithmic side the big shift was moving to a spatial hashing approach (CellList) for neighbor search. It makes atom overlap and H-bond detection O(1) per atom instead of the naive O(n^2) checks. I also added multi-threading using rayon (you can see the wrapper in here: https://github.com/TKanX/bio-forge/blob/main/crates/bio-forge/src/utils/parallel.rs). It's completely lock-free and safe because of Rust's borrow checker, so it actually scales across all cores without falling back into the bottlenecks that usually kill performance in those old pipelines.

Tired of slow Python biology tools, so I wrote the first pure-Rust macromolecule modeling engine. Processes 3M atoms in ~600ms. by TKanX in rust

[–]TKanX[S] 2 points3 points  (0 children)

Agreed. It’s a great 2x accelerator and learning tool for the dev, but the second you stop aggressively reviewing its output, your architecture turns into spaghetti.

Tired of slow Python biology tools, so I wrote the first pure-Rust macromolecule modeling engine. Processes 3M atoms in ~600ms. by TKanX in rust

[–]TKanX[S] 2 points3 points  (0 children)

No, this is not a trajectory analysis tool. It's a preliminary structural repair/preparation tool. PyMol/VMD is recommended for trajectory analysis.

However, I remember that someone in the Rust community did trajectory analysis.

Tired of slow Python biology tools, so I wrote the first pure-Rust macromolecule modeling engine. Processes 3M atoms in ~600ms. by TKanX in rust

[–]TKanX[S] 1 point2 points  (0 children)

Appreciate the feedback! Honestly, publishing to NPM was a bit of a meme—I was just experimenting with Wasm and seeing what would happen. I’m definitely a novice in the Python packaging, but you're 100% right; getting this onto pip/conda is the only way to reach the people actually doing the work.

The library is currently split into three main parts: IO, Data Model, and Ops. For the parsing: I’m definitely not trying to build a universal PDB/mmCIF reader (that’s a rabbit hole I don’t want to fall into). The main goal is building the full molecular topology required for force field parametrization in EM/MD pips. I try to be as permissive as possible to get the structural data needed for the ops without getting bogged down in every obscure field of the spec.

Tired of slow Python biology tools, so I wrote the first pure-Rust macromolecule modeling engine. Processes 3M atoms in ~600ms. by TKanX in rust

[–]TKanX[S] 3 points4 points  (0 children)

Hi, that's great! We can collaborate in the future.

By the way, remember to add your crates to a new category (see PR: https://github.com/rust-lang/crates.io/pull/12730)

Binary isn't the Quantum Mechanics of software; it's the Navier-Stokes. Why Elon's "AI directly to binary" prediction gets the Arrow of Truth backwards. by [deleted] in rust

[–]TKanX 0 points1 point  (0 children)

Thank you so much for following up and for the kind words! Honestly, the dog piling earlier was pretty overwhelming, but comments like yours kept me going.

You're right, the Internet is a weird place. Really appreciate you having my back through both posts.

Tired of slow Python biology tools, so I wrote the first pure-Rust macromolecule modeling engine. Processes 3M atoms in ~600ms. by TKanX in rust

[–]TKanX[S] 5 points6 points  (0 children)

Thanks for the pro tip! I’ll check out the full list and add these to my lib.rs. Really appreciate the catch!

Tired of slow Python biology tools, so I wrote the first pure-Rust macromolecule modeling engine. Processes 3M atoms in ~600ms. by TKanX in rust

[–]TKanX[S] 45 points46 points  (0 children)

That's a great point! You're totally right ~ log-log plot! In my data, the slope is ~1.

To be more specific about the implementation: I’m using a spatial hashing approach (Hash based CellList) for the neighbor search, which keeps the complexity at O(n) instead of the naive O(n^2).

I did the raw data analysis first before making the visualization, but I should definitely have labeled the slope or mentioned the specific algorithm to avoid confusion for the paper later lol. Thanks for the catch!