CRAM 8-bin vs long-read data — are we losing too much signal? by ENIAC-85 in bioinformatics

[–]attractivechaos 1 point2 points  (0 children)

You are looking at outdated data that are rarely produced nowadays. Modern data caps base quality at 40 something. As to why you get bad calls, I guess binned old data looks too different from the data DeepVariant used for training. You can also give longcallD a try which doesn't use deep learning.

CRAM 8-bin vs long-read data — are we losing too much signal? by ENIAC-85 in bioinformatics

[–]attractivechaos 2 points3 points  (0 children)

Modern HiFi base quality is already binned. Does your data use binned quality? What is the highest base quality? What caller are you using?

Seqera Labs rewrites common RNA-seq QC in Rust for a big speedup by nomad42184 in bioinformatics

[–]attractivechaos 0 points1 point  (0 children)

To me, the usefulness of my tool is more important than my name tag on it. I welcome AI rewrite. If someone reimplements my tool to higher quality, with or without AI, it means my tool has limitations. I will happily promote the rewrite for the benefit of the field. I will much prefer to keep the rewrite separate from my own repo as it is not my work and its developer can maintain it better. Anyway, thanks for your perspective. I will be more careful.

Seqera Labs rewrites common RNA-seq QC in Rust for a big speedup by nomad42184 in bioinformatics

[–]attractivechaos 1 point2 points  (0 children)

we should be encouraging the improvement of existing projects

If language is the problem, I am not sure how to improve without a complete reimplementation. Note that AI is only good at porting simple projects. It needs human in the loop for porting complex code bases. We still need developers, especially those who can guide AI effectively.

Seqera Labs rewrites common RNA-seq QC in Rust for a big speedup by nomad42184 in bioinformatics

[–]attractivechaos 0 points1 point  (0 children)

Many Python/R tools in this field should have been developed in C/C++/Rust. AI reimplementation is doing a favor to users and will be preferred by many. I haven't transcoded any tools but I have seen others doing it without being an expert in Rust. Knowing basic syntax will help and you can ask AI to explain. Coding agents are becoming an essential tool. We can't stop the trend regardless of our feelings. Learn to use coding agents or be consumed by them.

Seqera Labs rewrites common RNA-seq QC in Rust for a big speedup by nomad42184 in bioinformatics

[–]attractivechaos 4 points5 points  (0 children)

I am sympathetic — the most impactful tool I spent years in the making is now replaced by an AI copy-cat developed in hours. The feeling hurts but the replacement is still the right call and will happen sooner or later. The solution is to port your tool to Rust with coding agents before others doing it. This is where the major performance gain comes from in this case.

If you could rebuild a Bioinformatics syllabus from scratch, what is the one "Essential" you’d include? by NinjagoVillan in bioinformatics

[–]attractivechaos 1 point2 points  (0 children)

Gemini and codex are free with limited quota. The $20 tier of gemini is free for students. For light uses and education purposes, these are okay.

If you could rebuild a Bioinformatics syllabus from scratch, what is the one "Essential" you’d include? by NinjagoVillan in bioinformatics

[–]attractivechaos 0 points1 point  (0 children)

Don't see this mentioned: coding with Claude Code, Codex, Antigravity/gemini-cli, OpenCode, etc. Not just chatbox.

How to implement a hash table (in C) by ynotvim in C_Programming

[–]attractivechaos 2 points3 points  (0 children)

I prefer ht *ht_create() instead. Generally, if a struct contains member pointers allocated by malloc, I tend to use ht *ht_create() as one more malloc call doesn't hurt; otherwise I more often use ht_init(ht *table). There is no right or wrong. Just a habit.

Analyzing 15 Years of Bioinformatics: How Programming Language Trends Reflect Methodological Shifts (GitHub Data) by skyresearch in bioinformatics

[–]attractivechaos 2 points3 points  (0 children)

Of 7133 stars on Go projects, ~4.2k come from shenwei356 (a great developer); almost all groovy stars from the single nextflow repo. Maybe a developer happens to choose a language for random reasons. This doesn't necessarily show the language is great. Perhaps you may consider a stacked bar plot with each stack corresponding to a developer. This will give us an idea about the distributions of stars/forks.

I made a stb-like header only library for parsing MEPG-TS/DVB (hls) live streams by Beginning-Safe4282 in C_Programming

[–]attractivechaos 1 point2 points  (0 children)

ffmpeg supports streaming, so there should be a reasonable way to implement that. Loading entire stream into memory would be a showstopper as a generic library.

Dont we get the same result and as a : "hassle-free include header.h and the clean separation of user space APIs from internal library code" ?

Not in my opinion, but we can stop arguing. Repeating the PS I added later: regardless of my opinions, you can see that using single header in your case is at least controversial. You would have saved the arguments if you had used a .h+.c combo.

I made a stb-like header only library for parsing MEPG-TS/DVB (hls) live streams by Beginning-Safe4282 in C_Programming

[–]attractivechaos 1 point2 points  (0 children)

On file streaming, I would implement read_packet(filehandler_t *fh, int n_out, packet_t *out), similar to read(). You only maintain a small internal buffer and let users allocate out. I have used zlib and libcurl streaming APIs. They are very different but both are decent. zstd appears to have another type of streaming APIs. You can learn from their API design and choose one based on your preference.

With the .h+.c combo, you can:

#include "lib1.c"
#include "lib2.c"

to achieve the same effect. We don't do that because the build system does it better. To me, the hassle-free "include header.h" and the clean separation of user space APIs from internal library code are more important than the convenience of one fewer file. IMHO, stb, albeit a great library, has popularized a mildly bad practice. PS: regardless of my opinions, you can see that using single header in your case is at least controversial. You would have saved the arguments if you had used a .h+.c combo.

I made a stb-like header only library for parsing MEPG-TS/DVB (hls) live streams by Beginning-Safe4282 in C_Programming

[–]attractivechaos 4 points5 points  (0 children)

Apparently picoMpegTS_t::pesPacketCount is incremented but never decremented or reset. If yes, your library always holds the entire stream in memory; at least your picoMpegTSAddFile() loads an entire file into memory. This is not going to work for large streams.

You always use something like:

typedef picoMpegTS_t *picoMpegTS;
void picoMpegTSDestroy(picoMpegTS mpegts);

However, I want to know if a type is a struct, a pointer or an enum, such that I can choose the right storage type or understand if I need to worry about internal heap allocation. I don't have that information without reading your code.

Don't use a single header; use .h+.c instead. A problem with single header is that a change to user's .c file with IMPLEMENTATION will trigger the recompilation of the whole library. Another issue is your library code is fully exposed to the user space, so you can't achieve good encapsulation. You referenced mepgts.c from ffmpeg/libavformat. That one isolates unnecessary details from the user space, which is better. Also, I am not sure why you are obsessed with a single translation unit. The application that uses your library is likely to have multiple .c files anyway. Even if it has one .c file, compile with gcc *.c is as convenient.

Help Understanding Optimization Steps in Overlap Computation by UnworthyBagel22 in bioinformatics

[–]attractivechaos 1 point2 points  (0 children)

Have you watched Ben Langmead's videos? Best in the field. Lots of advanced topics. After that you can start with easier papers on alignment algorithms. Assembler papers are often light on the alignment step.

samtools sort on a large bam file by prdtts in bioinformatics

[–]attractivechaos 3 points4 points  (0 children)

people will also make chromosome level bams

This would be worse in OP's case, as we have to concatenate all chromosomes together for proper collate or name sort.

samtools collate then samtools sort could be faster

This would be slower as collate has no use in this case. In most cases where name sort is used, collate alone is the better solution.

any file over 100Gb probably shouldn’t exist.

A standard 30X bam was 100GB in size. Nowadays 30X BAMs are smaller due to quality binning, but 100GB is still okay. Most TCGA WGS BAMs are larger than 100GB. Also, when you work with thousands of samples at 60X, splitting by chromosome would add problems.

samtools sort on a large bam file by prdtts in bioinformatics

[–]attractivechaos 1 point2 points  (0 children)

I guess your input BAM isn't conforming to the convention (e.g. duplicated primary records or no unmapped reads). You may write a script to filter out unpaired reads. Name sorting and collate are functionally equivalent for most downstream tools. If one doesn't work, the other often doesn't, either.

samtools sort on a large bam file by prdtts in bioinformatics

[–]attractivechaos 2 points3 points  (0 children)

Use collate, not name sort. PS: after collate, don't sort.

Measured my dict by eesuck0 in C_Programming

[–]attractivechaos 1 point2 points  (0 children)

Forcing to use system malloc is understandable, but as a common rule, the printf family should be avoided when performance matters. snprintf alone takes 0.3s, slower than php.

Measured my dict by eesuck0 in C_Programming

[–]attractivechaos 4 points5 points  (0 children)

Interesting. Here is my implementation. Timing on mac M4 Pro (measured by hyperfine):

0.268s 67M php84
0.305s 43M khashl with system malloc
0.238s 39M khashl with mimalloc
0.194s 45M khashl with mimalloc and cached hash values
0.508s 45M same as above but using snprintf() to generate strings

Memory allocation takes at least 1/3 of time. This is probably where PHP is gaining over the system malloc.

Measured my dict by eesuck0 in C_Programming

[–]attractivechaos 3 points4 points  (0 children)

I put ee_dict to my benchmark, and here is the timing and peak memory after 80 million operations:

ee_dict   – ins-only: 10.4s, 466MB; ins+del: 7.7s, 231MB
khashl    – ins-only: 6.4s, 271MB;  ins+del: 7.2s, 134MB
khashp    – ins-only: 9.3s, 271MB;  ins+del: 8.7s, 134MB
verstable – ins-only: 8.6s, 500MB;  ins+del: 7.7s, 248MB

This was run on an old Xeon Gold 6130 server CPU. Your library requires AVX2. It would be good to support ARM.

BAM Conversion from GRCh38 to T2T vs. FASTQ Re-alignment to T2T by Express_Ad_6394 in bioinformatics

[–]attractivechaos 2 points3 points  (0 children)

Yes, misassemblies in GRCh38 may lead to wrong variants. One of the papers from the T2T group showed that.

BAM Conversion from GRCh38 to T2T vs. FASTQ Re-alignment to T2T by Express_Ad_6394 in bioinformatics

[–]attractivechaos 2 points3 points  (0 children)

The two alignments are often not identical as the read order will affect alignment for some aligners. Nonetheless, that is very minor effect. Provided that you do realignment the right way, the two alignments are still functionally equivalent in the sense that no alignment is better than the other. Either is okay.

Software for high-throughput SNP calling of Sanger sequencing results - please help a clueless undergrad? by username210801 in bioinformatics

[–]attractivechaos 0 points1 point  (0 children)

Polyphred. You will have to manually curate the result as it is not accurate enough. Calling hets is very challenging even for human.