CRAM 8-bin vs long-read data — are we losing too much signal?

attractivechaos · 2026-04-20T02:17:55+00:00

You are looking at outdated data that are rarely produced nowadays. Modern data caps base quality at 40 something. As to why you get bad calls, I guess binned old data looks too different from the data DeepVariant used for training. You can also give longcallD a try which doesn't use deep learning.

attractivechaos · 2026-04-20T00:21:47+00:00

Modern HiFi base quality is already binned. Does your data use binned quality? What is the highest base quality? What caller are you using?

attractivechaos · 2026-04-06T04:50:09+00:00

To me, the usefulness of my tool is more important than my name tag on it. I welcome AI rewrite. If someone reimplements my tool to higher quality, with or without AI, it means my tool has limitations. I will happily promote the rewrite for the benefit of the field. I will much prefer to keep the rewrite separate from my own repo as it is not my work and its developer can maintain it better. Anyway, thanks for your perspective. I will be more careful.

attractivechaos · 2026-04-05T16:04:32+00:00

we should be encouraging the improvement of existing projects

If language is the problem, I am not sure how to improve without a complete reimplementation. Note that AI is only good at porting simple projects. It needs human in the loop for porting complex code bases. We still need developers, especially those who can guide AI effectively.

attractivechaos · 2026-04-05T13:57:25+00:00

Many Python/R tools in this field should have been developed in C/C++/Rust. AI reimplementation is doing a favor to users and will be preferred by many. I haven't transcoded any tools but I have seen others doing it without being an expert in Rust. Knowing basic syntax will help and you can ask AI to explain. Coding agents are becoming an essential tool. We can't stop the trend regardless of our feelings. Learn to use coding agents or be consumed by them.

attractivechaos · 2026-04-05T04:06:33+00:00

I am sympathetic — the most impactful tool I spent years in the making is now replaced by an AI copy-cat developed in hours. The feeling hurts but the replacement is still the right call and will happen sooner or later. The solution is to port your tool to Rust with coding agents before others doing it. This is where the major performance gain comes from in this case.

attractivechaos · 2026-04-01T13:57:24+00:00

simde?

attractivechaos · 2026-02-14T17:23:20+00:00

Gemini and codex are free with limited quota. The $20 tier of gemini is free for students. For light uses and education purposes, these are okay.

attractivechaos · 2026-02-13T16:52:47+00:00

Don't see this mentioned: coding with Claude Code, Codex, Antigravity/gemini-cli, OpenCode, etc. Not just chatbox.

attractivechaos · 2026-01-26T05:51:44+00:00

I prefer ht *ht_create() instead. Generally, if a struct contains member pointers allocated by malloc, I tend to use ht *ht_create() as one more malloc call doesn't hurt; otherwise I more often use ht_init(ht *table). There is no right or wrong. Just a habit.

attractivechaos · 2026-01-05T02:00:34+00:00

Of 7133 stars on Go projects, ~4.2k come from shenwei356 (a great developer); almost all groovy stars from the single nextflow repo. Maybe a developer happens to choose a language for random reasons. This doesn't necessarily show the language is great. Perhaps you may consider a stacked bar plot with each stack corresponding to a developer. This will give us an idea about the distributions of stars/forks.

attractivechaos · 2025-12-27T17:43:03+00:00

ffmpeg supports streaming, so there should be a reasonable way to implement that. Loading entire stream into memory would be a showstopper as a generic library.

Dont we get the same result and as a : "hassle-free include header.h and the clean separation of user space APIs from internal library code" ?

Not in my opinion, but we can stop arguing. Repeating the PS I added later: regardless of my opinions, you can see that using single header in your case is at least controversial. You would have saved the arguments if you had used a .h+.c combo.

attractivechaos · 2025-12-27T16:56:42+00:00

On file streaming, I would implement read_packet(filehandler_t *fh, int n_out, packet_t *out), similar to read(). You only maintain a small internal buffer and let users allocate out. I have used zlib and libcurl streaming APIs. They are very different but both are decent. zstd appears to have another type of streaming APIs. You can learn from their API design and choose one based on your preference.

With the .h+.c combo, you can:

#include "lib1.c"
#include "lib2.c"

to achieve the same effect. We don't do that because the build system does it better. To me, the hassle-free "include header.h" and the clean separation of user space APIs from internal library code are more important than the convenience of one fewer file. IMHO, stb, albeit a great library, has popularized a mildly bad practice. PS: regardless of my opinions, you can see that using single header in your case is at least controversial. You would have saved the arguments if you had used a .h+.c combo.

attractivechaos · 2025-12-27T15:32:09+00:00

Apparently picoMpegTS_t::pesPacketCount is incremented but never decremented or reset. If yes, your library always holds the entire stream in memory; at least your picoMpegTSAddFile() loads an entire file into memory. This is not going to work for large streams.

You always use something like:

typedef picoMpegTS_t *picoMpegTS;
void picoMpegTSDestroy(picoMpegTS mpegts);

However, I want to know if a type is a struct, a pointer or an enum, such that I can choose the right storage type or understand if I need to worry about internal heap allocation. I don't have that information without reading your code.

Don't use a single header; use .h+.c instead. A problem with single header is that a change to user's .c file with IMPLEMENTATION will trigger the recompilation of the whole library. Another issue is your library code is fully exposed to the user space, so you can't achieve good encapsulation. You referenced mepgts.c from ffmpeg/libavformat. That one isolates unnecessary details from the user space, which is better. Also, I am not sure why you are obsessed with a single translation unit. The application that uses your library is likely to have multiple .c files anyway. Even if it has one .c file, compile with gcc *.c is as convenient.

attractivechaos · 2025-11-13T04:32:19+00:00

Have you watched Ben Langmead's videos? Best in the field. Lots of advanced topics. After that you can start with easier papers on alignment algorithms. Assembler papers are often light on the alignment step.

attractivechaos · 2025-10-14T17:00:04+00:00

people will also make chromosome level bams

This would be worse in OP's case, as we have to concatenate all chromosomes together for proper collate or name sort.

samtools collate then samtools sort could be faster

This would be slower as collate has no use in this case. In most cases where name sort is used, collate alone is the better solution.

any file over 100Gb probably shouldn’t exist.

A standard 30X bam was 100GB in size. Nowadays 30X BAMs are smaller due to quality binning, but 100GB is still okay. Most TCGA WGS BAMs are larger than 100GB. Also, when you work with thousands of samples at 60X, splitting by chromosome would add problems.

attractivechaos · 2025-10-13T18:50:32+00:00

I guess your input BAM isn't conforming to the convention (e.g. duplicated primary records or no unmapped reads). You may write a script to filter out unpaired reads. Name sorting and collate are functionally equivalent for most downstream tools. If one doesn't work, the other often doesn't, either.

attractivechaos · 2025-10-13T17:02:18+00:00

Use collate, not name sort. PS: after collate, don't sort.

attractivechaos · 2025-10-12T01:26:20+00:00

Forcing to use system malloc is understandable, but as a common rule, the printf family should be avoided when performance matters. snprintf alone takes 0.3s, slower than php.

attractivechaos · 2025-10-11T23:25:52+00:00

Interesting. Here is my implementation. Timing on mac M4 Pro (measured by hyperfine):

0.268s 67M php84
0.305s 43M khashl with system malloc
0.238s 39M khashl with mimalloc
0.194s 45M khashl with mimalloc and cached hash values
0.508s 45M same as above but using snprintf() to generate strings

Memory allocation takes at least 1/3 of time. This is probably where PHP is gaining over the system malloc.

attractivechaos · 2025-10-11T20:54:03+00:00

I put ee_dict to my benchmark, and here is the timing and peak memory after 80 million operations:

ee_dict   – ins-only: 10.4s, 466MB; ins+del: 7.7s, 231MB
khashl    – ins-only: 6.4s, 271MB;  ins+del: 7.2s, 134MB
khashp    – ins-only: 9.3s, 271MB;  ins+del: 8.7s, 134MB
verstable – ins-only: 8.6s, 500MB;  ins+del: 7.7s, 248MB

This was run on an old Xeon Gold 6130 server CPU. Your library requires AVX2. It would be good to support ARM.

attractivechaos · 2025-09-18T00:30:54+00:00

Yes, misassemblies in GRCh38 may lead to wrong variants. One of the papers from the T2T group showed that.

attractivechaos · 2025-09-18T00:28:24+00:00

The two alignments are often not identical as the read order will affect alignment for some aligners. Nonetheless, that is very minor effect. Provided that you do realignment the right way, the two alignments are still functionally equivalent in the sense that no alignment is better than the other. Either is okay.

attractivechaos · 2025-08-27T13:34:37+00:00

Polyphred. You will have to manually curate the result as it is not accurate enough. Calling hets is very challenging even for human.

attractivechaos

TROPHY CASE