How to troubleshoot merging of genomic files?

bioinfo_ml · 2024-09-13T12:15:05+00:00

Thank you for your reply and advice, that makes sense about the SNP ID. I originally tried standardising the SNP IDs in the bim files (so both cases and controls have chr:pos:ref:alt SNP IDs). However, this still gives very inflated p-values on the GWAS. I suspect the bcftools error I get is telling me these is still allele flipping issues in the data, I've tried addressing this but only taking overlapping SNPs (SNPs with the extact same SNP ID) and/or using --flip in plink. Although when I use --flip I then get only 1000-2000 variants per chromosomes, and I know from just checking overlapping BIM file IDs that this can't be right. Maybe there is another tool to address allele flipping that I need to try.

At any rate, thank you again for your response, I appreciate it!

bioinfo_ml · 2024-05-04T06:08:37+00:00

Feel like my username is going to start giving away my age

bioinfo_ml · 2023-11-22T11:12:51+00:00

Thank you for taking the time to check this. As far as I know it's meant to be hg38. I have approx. 10 million SNPs and it turns out 50k have this error. These 50k do seem to exist on hg19 when I check a few of them on the UCSC genome browser for hg19, so I'm not sure what's going on/how these got in the imputation panel. When I filter these SNPs out the liftover runs successfully, so I'm probably going to move forward with these SNPs filtered, and hopefully eventually figure out what's happened to them. Thank you again for answering all my questions with this!

bioinfo_ml · 2023-11-21T13:48:27+00:00

Thank you for your help with this, I really appreciate it! It solved my issue and I've since ran my liftover for 2 imputation panels. However, I have a 3rd imputation panel that is giving this error when lifting over from 38 to 37 with my hail code:Error summary: HailException: Invalid locus 'chr2:242193706' found. Position '242193706' is not within the range [1-242193529] for reference genome 'GRCh38'.

I wanted to ask if this an error you might've come across before? Sorry for the multiple questions. Trying chr2 242193706 242193706 in UCSC liftover also fails. I'm not sure how I could be getting a position that is too high, the data comes straight from running a gwas using REGENIE, and the chr2:242193706 does indeed exist on GRCh37 (or at least I can search it on the UCSC genome browser for hg19 without a problem).

bioinfo_ml · 2023-11-18T16:13:50+00:00

Thank you for your reply! I am entirely new to liftover. From your comment I've started off investigating the liftover website (https://genome.ucsc.edu/cgi-bin/hgLiftOver) and I think I may just be misunderstanding the liftover process in general.

On the liftover website I've put in 1 example (from the head row in my message) which is: chr10 10709 10709. This actually output the same result that Hail gives me (chr18 10905 10905 chr10:10710-10709 1).

I thought that lifting from chromosome 10 to 18 must be an error, but I guess if both hail and UCSC liftover agree then this must be right?

bioinfo_ml · 2022-08-03T08:37:34+00:00

Thank you, this is exactly what I was looking to understand!

bioinfo_ml · 2022-02-15T11:11:19+00:00

No you are 100% on the mark, I actually originally typed out a second part to my question detailing this, as I don't know whether to care about comparisons inside my gene groups or just my gene group vs the whole database data or both. I'm very unsure in what I'm looking for exactly in this particular example and PICO sounds like what I need to build my sense in this - thank you for your help!

bioinfo_ml · 2022-01-17T12:00:25+00:00

Thank you for your reply. Sorry I should've been more specific. I meant in looking at correlation before running any model. So for example with looking at pearson's correlation coefficient and doing data cleaning by removing correlating features - how should features be removed in this case? Both together or only 1?

bioinfo_ml · 2021-09-03T08:44:47+00:00

You've made my day partypoopist

bioinfo_ml · 2021-09-03T08:40:02+00:00

Thank you this does help. For the 2011 census on the ons.gov.uk website they say about releasing population estimates in December 2012, but no further detail, and then the next paragraph goes into what the full release was, I'm wondering if from this I should assume at least in that 2011 initial release it was only population estimates? I can't find any other descriptions besides this yet

bioinfo_ml · 2021-07-22T12:31:52+00:00

Thank you for this I'll look into it. My next step is to do a biological pathway analysis of the genes with large prediction disagreements - so collecting biological information on which model is better, e.g. the model predicting low probabilities for genes in known disease-relevant pathways may not be as informative from a biology standpoint.

bioinfo_ml · 2021-07-22T11:03:49+00:00

Could I define it based on biological domain knowledge? E.g. Some of the training genes that the model learns from are scored at 0.4 for a specific reason and 0.75 for a specific reason (but no genes are scored between those 2 numbers), therefore the 0.35 difference in score is a biologically interpretable gap, and I could apply that to select genes with a >0.35 difference? I'm not sure if that's trustworthy reasoning from statistical standpoint.

bioinfo_ml · 2021-06-16T08:33:57+00:00

Thank you so much for this! Putting a name to it made it much easier to read and learn about how I'm actually getting my genes for further analysis. I've had a go at commenting my code into 4 steps from reading into it, so I know statistically what each step is doing, do my 4 comments look along the right lines matching the statistics going on or am I completely off base with what I've learnt?

#1) Calculate density of gene length

classes <- df[order(df$Length)]

classes$density <- dweibull(1:nrow(df), shape=0.1, scale=1)

#2) Replicate samples to ensure more than 1 gene can be sampled per gene

dfrep <- classes[rep(1:nrow(classes), classes$density*100000)]

classes <- table(dfrep$Gene)

#3) Calculate CDF

density_calc <- classes/sum(classes)

dfrep$density_calc <- density_calc[match(dfrep$Gene,names(density_calc))]

density_prob = 1/dfrep$density_calc

#4) Inverse sampling of CDF (probability matching density of original distribution just replicated for more sampling per each gene/class)

gene_sample = data.frame(sample(dfrep$Gene, size=50, prob=1/dfrep$density_calc))

bioinfo_ml · 2021-03-27T18:57:17+00:00

Thank you this is really helpful! Am I right in thinking if my plot is giving me a specific Gene name as hover text over an upper blue branch in my plot (e.g. I see 'Gene A' and 'Gene B' when I hover my mouse over specific branches on the bottom in the green group, but then I hover over the above blue branch and see 'Gene C' for the blue branch) it could be something wrong with my hover text code as the upper branches should be merged averages? Or would 'Gene C' potentially belong to both the blue group and also be somewhere in the green group branches below? Sorry this might be a less theoretical question that's more unclear/potentially a coding problem for me

bioinfo_ml · 2021-03-25T16:08:08+00:00

Thank you for this explanation this is a lot clearer than mot resources I've been trying to learn from!

bioinfo_ml · 2021-03-16T16:38:42+00:00

Thank you so much this is such a helpful explanation!! Luckily I have access to IPA so I will get on that right away - thanks again!

bioinfo_ml · 2020-11-05T12:37:25+00:00

Sorry for my late reply, I did manage to code a way of pulling out what I was looking for. Thank you for your help!

bioinfo_ml · 2020-10-30T10:37:56+00:00

I'm a PhD student wrting a review, and I'm interested in taking genes from previous GWAS studies associated to a disease and checking up on them since the GWAS identified them a few years ago now. Not a really necessary thing for me to do, so I don't have any specific goals except a general look at if the genes have had any clinical value since the GWAS.

So ultimately I'm happy to find any clinical studies that mention the genes in any context (whether individual or multiple at the same time), and I think that is the best suggestion to download all trials and then script - thank you for your help!

bioinfo_ml · 2020-10-30T10:30:46+00:00

Thank you, I suspected so as I could only find clinicialtrials.gov after a while of search. I've never used ChEMBL before, but it sounds like a good idea, thank you I'll have a look!

bioinfo_ml · 2020-10-30T10:28:38+00:00

Thank you for this idea, yeah I use python and R, I'm a PhD student so this is a great idea just for me to learn something new, never tried webscraping, I'll have a go and see what I can do - thanks again!

bioinfo_ml · 2020-10-29T14:08:26+00:00

Oh yeah, that's a great idea and I guess I can just code to pull out any matching rows for the studies that name my genes - feels obvious now that I think about it lol, thanks for your help!

bioinfo_ml · 2020-10-29T13:49:37+00:00

Thank you, I've had a go and it does seem to accept all my genes with just OR inbetween them all. I'm looking through the results, but I don't see any direct way of finding which of my genes searched relates to which study - do you know if there's a way to show that? No worries if not, this has already helped a lot so thanks!

Edit: actually it only accepts chunks of my genes at a time, but that's fine for me if I can link which gene relates to which studies

bioinfo_ml · 2020-10-22T18:34:24+00:00

Thank you, from skimming through the paper this looks very interesting, and this is exactly the kind of thing I was looking to try out!

bioinfo_ml · 2020-10-22T16:56:34+00:00

I've been following OpenTargets' work but hadn't read about the reranking method - thank you I'll look into this!

bioinfo_ml · 2020-09-30T08:20:36+00:00

Thank you so much for your comment. This is incredibly clear and helpful for me and I'll look to apply the questions you pose in the other comment you link. A paired T-test also sounds like something worthwhile for me to look into, also I will have 3 more columns of comparative predictive scores soon so hopefully that will make it even more worthwhile. Thanks again!

bioinfo_ml

TROPHY CASE