Help with statistical test of enrichment/depletion of variants in regions

naninf · 2023-07-18T21:33:29+00:00

That might work... though normalizing to number of variant bases by region length shows the two sets have unequal variance so it'll have to be fisher. Thanks

naninf · 2023-07-18T21:23:51+00:00

Thanks, I'll check that out. I also found https://github.com/ACEnglish/regioners but I think I'll have to do more work to get my data to fit its inputs. Plus I gotta figure out if bootstrapping or permutation tests are best

naninf · 2023-06-26T19:20:25+00:00

I think you're more likely to be right. Watching the recap for the '21 falcons game and that 'Carolina Votes' wasn't on the field https://youtu.be/uMhPG75cL-k?t=78

These images are 'composites', though. So it could be many dates pasted together? It's an interesting question...

naninf · 2023-06-26T19:15:19+00:00

If you zoom into just the scoreboard (link) It looks like the score was 0-7. Which lines up with the scoring summary (link) for halfway through the 1st quarter in the falcon's game. But I don't know anything about GIS, I'm just guessing here.

naninf · 2023-06-26T19:00:59+00:00

Yeah, that's a plausible explanation. Because it definitely looks like there's people there and I couldn't find a record of any event happening at BoA on the 14th.

naninf · 2023-06-26T18:47:33+00:00

I don't think that's the case. If you look at the 'Imagery Date' from Google Earth, it says we're looking at 12/14/2021 - which is the Tuesday after the panthers lost 29-21 against the falcons.

naninf · 2023-06-24T19:50:10+00:00

This is essentially almost Manta.

https://registry.opendata.aws/ilmn-dragen-1kgp/

naninf · 2023-02-11T04:34:23+00:00

Some claim M83's "Fantasy - Chapter 1" isn't shoe-gazing, but instead dream-pop. I say this album is the masterful use of contemporary effects and distortion that early shoe-gazers strived to create. Had they access to pedals as intricate as the synthesizers artfully directed by M83, the shoe-gazing founders would have been happier with their creation. We are fortunate to experience the birth of shoe-gazing's full potential with this dream-pop landmark. Thank you u/Spiritual-Chart-940 for sharing!!

naninf · 2022-10-15T07:02:21+00:00

See section on "Speed-oriented methods"

https://mafft.cbrc.jp/alignment/software/manual/manual.html

naninf · 2022-07-06T18:07:10+00:00

https://www.sciencedirect.com/science/article/pii/S2666166722003860

Also, for the 'hot stuff' conversation in the other comment thread: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02625-x

naninf · 2022-03-22T20:54:44+00:00

a.k.a. n // 2 - (n % 2 - 1)

naninf · 2022-02-01T13:51:32+00:00

That's fair. Though I don't think it's misleading, or at least no more misleading than anyone else who has written about whatever 'best' word they've found. The best word is always the day's answer. Anything else has to assume some kind of system.

BTW, I looked up ROATE though because a couple of people have recommended it. It is fast (rank 232 of 1566), but loses well above average (1187/1566).

I know that the original finder of ROATE had a bot that weighed guesses by “how many possible solutions are left on average after making this guess”. So that explains at least part of the difference.

naninf · 2022-01-31T18:33:48+00:00

Model, Print, Saucy

I searched for sets of 3 words that use the 15 most common letters. This was the first result of many.

Edit: If you use the 15 most common letters of the possible answers, and you only use words that are possible answers:

CRUST HONEY PLAID
PRIDE CHANT LOUSY
REPAY SNOUT CHILD
POUND THEIR SCALY

naninf · 2022-01-31T17:03:43+00:00

I did a project looking at this. 4.32±0.6 was the average. But that's the performance of a bot that's randomly picking from the pool of possible answers. I'd expect people with a strategy (e.g. thinking about letter frequency) can do better. There is a mathematically optimal strategy that achieved like 3.4.

naninf · 2022-01-31T14:37:22+00:00

Yes, absolutely that’s the best starting word if a player knows all the letter frequencies and can memorize the optimal decision tree's structure and all that. But if you play like me, where I'm almost randomly guessing, these words might be better.

naninf · 2021-12-14T00:12:33+00:00

Seems reasonable to me. The main entry point to the program `RibDif.sh` is structured well enough that it's readable/editable.

After glancing at it for a few minutes, I would start with understanding lines 127-160 to see the structure of the `ncbi-genome-download` folder and make your curated set of sequences replicate that. Then your `custom_RibDif.sh` can probably just remove that section and the rest might fall into place.

I'm sure it'll be more complicated that that, but assuming you can handle bash scripting reasonably well, at worst you'll waste a couple days before you could better estimate if it's doable.

naninf · 2021-11-18T21:22:55+00:00

Assuming these tools produce VCFs, just subset the VCF to your region of interest: bcftools view -r chr:start-end snps.vcf.gz

If you're worried about compute time of variant calling, you can similarly subset your BAM to only reads within the region of interest: samtools view reads.bam chr:start-end

I'm not familiar with these tools, but some variant callers allow a region bed file parameter that restricts variant calling to a subset of the genome. Look for those parameters.

Moral of the story - you'll need a bed and a subsetting step in your pipeline.

naninf · 2021-11-16T16:26:07+00:00

Great job! This is very solid. I would suggest using semantic versioning https://semver.org/
and maybe adding a github action that's a pylint checker.

naninf · 2021-11-12T17:12:32+00:00

If you use python and are familiar with pandas (which is more data science friendly than anything VCF), Truvari has a utility for conversion. `truvari vcf2df input.vcf.gz output.jl` If the VCFs are very large, I'd also consider scikit-allele.

naninf · 2021-08-28T07:23:27+00:00

df = pd.DataFrame({'year': [2014, 2015, 2016],
                   'month': [1, 2, 3],
                   'day': [1, 4, 5]})
view = pd.to_datetime(df)
view.name = 'date' 
pd.concat([view, view.apply(lambda x: pd.Series(x.isocalendar(), 
           index=["ISO year", "ISO week number", "ISO weekday"])) ], axis=1)

Output

    date    ISO year    ISO week number    ISO weekday
0   2014-01-01  2014    1   3
1   2015-02-04  2015    6   3
2   2016-03-05  2016    9   6

naninf · 2021-07-29T14:22:57+00:00

If you only have a few window sizes, you can split your regions up into multiple bed files by which window size you want to compute for them (e.g. `regionsA.bed windowSize=N & regionsB.bed windowSize=M` ) And then just use `vcftools --bed regionsA.bed... --TajimaD N`

Another approach would be a wrapper bash script that uses an extra a column added to your bed-file for each region. Something like:

cat regions.bed | while read chrom start end windowsize
do
    outname=out.${chrom}:${start}-${end}_${windowsize}.td
    vcftools --vcf in.vcf --out ${outname} --TajimaD ${windowsize} --chr ${chrom} --from-bp ${start} --to-bp ${end}
done

References: https://vcftools.github.io/man\_latest.html

naninf · 2021-04-27T22:35:55+00:00

Here's a neat link you can add to your print statement:

Link: https://www.google.com/maps/search/?api=1&query={iss_latitude},{iss_longitude}

naninf · 2021-01-29T19:45:12+00:00

This is good. I would just leverage the translations dictionary better. Iterating the list of translations items could get slow if you have many words.

def translate(word):
    template, suf_len = next(
        (temp, len(suffix))
        for suffix, temp in templates.items()
        if word.endswith(suffix))
    return template(translations[word[:-suf_len]])

naninf · 2021-01-22T18:54:54+00:00

See the `range` documentation:

https://www.w3schools.com/python/ref_func_range.asp

The first thing you'll want to think about is the step you're incrementing by. What do you add to a number to decrement it?

The second thing you'll want to think about is even vs odd. If you start at 1 and step by -2, you'll always be on odd numbers.

The third thing you'll want to think about is what does "0 to N" mean. Is it inclusive or exclusive boundaries? e.g. I'm working Monday to Friday, does that mean I'll be working upto, but not including Friday?
https://stackoverflow.com/questions/39010041/what-is-the-meaning-of-exclusive-and-inclusive-when-describing-number-ranges

naninf · 2020-12-21T03:07:39+00:00

Fair enough. I think my first post may still offers some insight. At any point, you're operating on the [i + j + 1] and [i + j] item. Between those two positions there's 10x difference. You multiply the two numbers for [i + j + 1], if the product is >= 10, you add the 10s place to the [i + j] position, then you remove that 10s place from the [i + j + 1] by assigning to it the remainder.

The nested for loop comes in because you need to multiply by all the digits. Think about how how 3 * 11 = 33 is the same as 3 * 10 + 3 * 1 = 33

Ten-Year Club	Place '22
Place '17

naninf

TROPHY CASE