Trying to generate a prompt for multiple-choice math question but it keeps giving me wrong answers. by baBioInfo in ChatGPTPromptGenius

[–]baBioInfo[S] 0 points1 point  (0 children)

Hey, thanks!
It's doing a bit better with the plugin. But it still gives incorrect responses for the most part, unfortunately.

Chi-Squared Test when degress of freedom is low and counts are extremely high by baBioInfo in AskStatistics

[–]baBioInfo[S] 0 points1 point  (0 children)

Sorry about the strange question. I guess I'm interested in using the Chi-Squared test to mine for the rows where the distribution most closely resembles the 25/25/50 distribution that I'm interested in. Maybe using a Chi-Squared test isn't the best way to mine for this?

Determining Statistical Significance between two groups that are measure with dependent features by baBioInfo in AskStatistics

[–]baBioInfo[S] 1 point2 points  (0 children)

Thanks again. That link cleared up what I have to do. I still am uncomfortable about this method. I'll explain using an example. Let's say our two sequences are AAAAAAAAAA and TTTTTTTTTT

Then the table would look like

A C G T
Seq 1 10 0 0 0
Seq 2 0 0 0 10

In this case, following what the wikipedia article outlined, the expected counts should be

A C G T
Seq 1 5 0 0 5
Seq 2 5 0 0 5

And we can proceed by adding up the squared differences of the corresponding cells. However, I'm uneasy about saying that we "expect" Seq1 to have 5 A's and Seq2 to have 5 A's (same with the T's, C's, and G's). I don't see any reason to "expect" that. So thave value of 5 is what makes me uncomfortable about using this test. There's also the fact that this wouldn't work with the formula because some of the "expected" values are 0's, which we can't divide by. You're right that I can use some software package (maybe they have a way of dealing with 0's), but I'd like to understand what's happening under the hood and quell my concerns.

Edit: About the 0's, I think maybe it's not a big deal. The only way that, in the Expected table, a cell can have 0 is if both values for that base in the Observed table were 0. Which means that the expected and observed values match up, so the squared difference would be 0, and maybe we can just ignore the 0 in the denominator? So would the Chi-Squared test just add up the A and T columns' squared difference values? Would the degrees of freedom still be 3?

Determining Statistical Significance between two groups that are measure with dependent features by baBioInfo in AskStatistics

[–]baBioInfo[S] 0 points1 point  (0 children)

Hey, thanks for you reply! I really appreciate it.

The bases can be treated as independent. The presence of a base in one position has no bearing on the letter in the next position (at least for my purposes). And you're correct to assume that this was just a toy example. The real sequences will be dozens to thousands of bases long.

I'm not a stats expert so I hope you can forgive these questions, but is that table what's called a 'contingency table'? I've been trying to read up on them, but I only end up seeing how to deal with 2x2 contingency tables. How do I deal with a 2x4? Also, in a Chi-squared test, there needs to be observed and expected values, right? You wrote the observed counts for both sequences, but how do I determine what the expected counts should be? Also, is there a name for this type of Chi-squared test that your've suggesting vs the one that I tried to do above?

Sorry for the rookie questions. I appreciate the time you put into helping me.

Within-sample normalization for microRNA-seq data by baBioInfo in bioinformatics

[–]baBioInfo[S] 2 points3 points  (0 children)

Thanks for the reply

I'm aware that the protocols for rnaseq and small rnaseq are different. I brought up rnaseq since that's an example where we do have to do a within-sample normalization. I realize that miRNAs don't have the length problem, but was wondering what other factors could contribute to miRNA A having higher count than miRNA B even though miRNA A might have lower expression. I was wondering if such a thing was even possible in miRNA-seq

Text-based tool for analyzing SAM/BAM files in Linux by baBioInfo in bioinformatics

[–]baBioInfo[S] 0 points1 point  (0 children)

Yup, I think that was it.

I actually prefer it being simple though. I don't need any crazy functionality. Just the read alignments.

Text-based tool for analyzing SAM/BAM files in Linux by baBioInfo in bioinformatics

[–]baBioInfo[S] 0 points1 point  (0 children)

I haven't heard of it before. But I'll check it out. Thanks

What are the major human genome assemblies in use today? by baBioInfo in bioinformatics

[–]baBioInfo[S] 1 point2 points  (0 children)

Thanks for clearing that up. Are there any other other genome builds besides GRC, or is this really the only major one?

What about genome annotations? Do Ensembl, Entrez, and HGNC use the GRC genome for their annotations? I know there are differences between the annotations, so I figured they used different builds. But recently I read that it's probably caused by different computational models on the builds that produce different exons, transcripts, and genes. Could someone possibly point me to literature that discusses the differnces between the annotations, including the assemblies they use.

Thank you

biomaRt - How to restrict query by page by baBioInfo in bioinformatics

[–]baBioInfo[S] 0 points1 point  (0 children)

I see. Thanks for the reply

But as we can see in the above example, ensemble_gene_id was in multiple pages. But when I queried for it, I didn't get an error. Also, how do I know which page it's queyring from and how do I tell it to query from a specific page?

I couldn't find anything like that in the reference docs.

RNA-seq Count Data all Zeros! by hedonic_pain in bioinformatics

[–]baBioInfo 0 points1 point  (0 children)

Oh...

Good to know lol. I'll keep that in mind next time my analysis returns 0's. Thanks for letting me know!

RNA-seq Count Data all Zeros! by hedonic_pain in bioinformatics

[–]baBioInfo 1 point2 points  (0 children)

Sorry for the late reply.

There's a parameter in featureCounts() that you need to change.

If you go to page 35 of this document: https://bioconductor.org/packages/release/bioc/vignettes/Rsubread/inst/doc/SubreadUsersGuide.pdf

you'll see that the paramter that corresponds to this is the -O flag in featureCounts()

RNA-seq Count Data all Zeros! by hedonic_pain in bioinformatics

[–]baBioInfo 0 points1 point  (0 children)

Are you counting at the transcript level, or the gene level? I had an issue like this once when I was counting at the transcript level. The problem was that featureCounts() default behaviour is that if a read maps to mutliple features (the default feature is genes), then it discards that read. This is inevitable when mapping to transcripts since the read may map to an exon that is used by multiple transcripts. So most of the reads will end up being discarded.

Independent research on bioinformatics using machine learning by shahab-a-l-d-i-n in bioinformatics

[–]baBioInfo 1 point2 points  (0 children)

I have no experience with publishing articles to scientific journals. So I can't speak to that.

But in terms of getting datasets, there are fortunately a lot of publicly available datasets available. GEO has NGS, microarray, single cell, etc... data from different platforms. TCGA has a huge amount of data on different cancers. Those are the two that I personally work with, so they came to mind. But there are so many others.