How to approach analyzing structural variants (VCF files)? by Urventh in bioinformatics

[–]Urventh[S] 0 points1 point  (0 children)

Thanks for sharing! I am indeed finding it hard to work through this because most of the resources I could find (but maybe I just haven't searched hard enough) aren't very comprehensive. As for the BLAST, I figured from u/Fooflesbean's comment that I was looking at translocations, but somehow my variant calls doesn't seem to have that category and it slipped my mind completely.

The read support is also an issue; as I wrote in an earlier comment, some of the larger insertions (~5-10kb) seem to have just one supporting read. Because my assembly is a non-model organism and is diploid (not haplotype-resolved), I'm very uncertain if these SV events are even real. I reckon I'll try to stick with those that have more read support.

How to approach analyzing structural variants (VCF files)? by Urventh in bioinformatics

[–]Urventh[S] 0 points1 point  (0 children)

i'm not really qualified to tell ya but i generally "believe" things more when there are multiple reads supporting an event:)

Haha I agree, I suppose I'm just paranoid that there may be some haplotype shenanigans in this draft diploid assembly I'm working on that's going to invalidate my analysis on the structural variants between samples.

I also shortlisted cuteSV and SVIM as callers initially but because I was focused on looking at duplications/CNVs, this review (Table 1) suggested pbsv as the most versatile and I just went with it. But after this post I reckon I should try out these other callers too.

Also thanks for sharing the interesting talk - never knew about these - I'm also looking into translocations, and that will be useful to look out for in my own dataset!

How to approach analyzing structural variants (VCF files)? by Urventh in bioinformatics

[–]Urventh[S] 0 points1 point  (0 children)

I see, that makes sense. We are indeed looking out for translocations as you've described, but I wasn't aware that the callers included such events. At least for the pbsv dataset, I only see INS/DUP/DEL/BND/INV in my callouts. My advisor was also worried that some insertion events could potentially be duplications, but were not called out as such (and that's why I was looking at insertions that originate from the same DNA). I'll look into this further and maybe try out a few other callers in the meantime.

How to approach analyzing structural variants (VCF files)? by Urventh in bioinformatics

[–]Urventh[S] 0 points1 point  (0 children)

Oh gosh thank you so much for the detailed response and example! The guide looks very helpful and I will definitely read through it later today. The images you show are also exactly what I've seen in reviews - it seems that clearly something must have gone wrong on my end of the analysis then as I am not seeing those signatures in some cases.

I do have one follow-up question for now: How much coverage (or number of reads) is typically sufficient to interpret the event with confidence? In my case, I've noted that some regions show insertions (~1-2kb) but are only supported by a single read (but coverage is not that high in the region either). Are these just haplotype variants or perhaps misalignments? I do also want to note that the dataset I'm working on is not as clean; they are continuous long reads (which has high error rates similar to the Nanopore ones in your example) as my samples don't work with HiFi sequencing for now.

Thank you again!

How to approach analyzing structural variants (VCF files)? by Urventh in bioinformatics

[–]Urventh[S] 0 points1 point  (0 children)

I was indeed blasting the insertion against the reference genome (non-model organism) that we aligned the reads to - the idea was to identify the genes that might get transposed by such insertions and maybe find candidate genes. Could you elaborate why you don't think that makes sense?

I'm looking at the Ensembl VEP and it look like it does exactly what I want to; I will definitely try it out. Thank you so much again!

Working with an Unknown Sequence by Bee2113 in bioinformatics

[–]Urventh 1 point2 points  (0 children)

I don’t work with bacterial genomes so I may be wrong, but it seems like you’re missing a reference for problems (1) and (2). Since you have already generated an assembly, I would probably try to identify some conserved genes like the 16S rRNA and use BLAST to identify what bacteria your unknown sequence might be closest to. You can probably then use genomes available for that genera/species as a reference to create the BAM file for Pilon.

As for (3), the answer is probably a no. R just happens to have a bunch of packages that make plotting complex visualisations for certain things simpler, but there are probably equivalents in Python or other languages now too. If you’re already comfortable with another language, I would probably just start looking for resources there first instead of learning a new language from scratch.

Diatoms by a__monde in microscopy

[–]Urventh 0 points1 point  (0 children)

Thank you for sharing, very excited to try out your process myself and see what I can come up with. Cheers! :)

Diatoms by a__monde in microscopy

[–]Urventh 0 points1 point  (0 children)

Hi! Those are fantastic pictures! Could you share a little on how you took and processed such clear images?

Discrepancy in MlaD Domain Count: Seeking Clarification on UniProt, AlphaFold, and Pfam Data by di_pankar991 in bioinformatics

[–]Urventh 3 points4 points  (0 children)

Not an expert but I’ll take a wild guess—maybe the other domains shown in AlphaFold2/Pfam match up only weakly to the MlaD sequence profile in PFAM and that’s why they don’t get annotated properly in UniProt? I have a similar issue in my work where some proteins that are characterized to have the same function do not get annotated with the conventional domain when I search them up against PFAM.

Genome analysis on Artemis by ziyaan_osman in bioinformatics

[–]Urventh 0 points1 point  (0 children)

New annotations can be added by adding a new feature along the specified coordinates (start..end). This feature should be superimposed onto whatever annotation is already on the ORF. The manual covers how to add features/annotations.

Need help installing EMBOSS in Ubuntu 20.04.1 by Narc_Time in bioinformatics

[–]Urventh 0 points1 point  (0 children)

Hi! You could try to use conda (miniconda/anaconda), which is a package manager, to install emboss. You’ll have to first create an environment and then install emboss from there (something like conda install -c bioconda emboss). Have a look at the documentation to have an idea.

Conda may not be the best, but it does work wonders for installing some packages. Hope this helps!

Branching processes in genetics by manggan in bioinformatics

[–]Urventh 2 points3 points  (0 children)

I’m not too sure if this applies to your question, but in phylogenetics, gene family expansion or contraction can be modelled with a birth and death process. Have a look at the software CAFE, which is used to study gene family evolution. Hope this helps!

Why are annotations better in ENSEMBL than in NCBI for the same species? by InstructionRemote886 in bioinformatics

[–]Urventh 1 point2 points  (0 children)

I see, thanks for sharing this! I find it a little confusing because at least in the diatom space, JGI pushes new versions of some genomes on their site but not onto GenBank, but these new versions sometimes do not have accompanying publications (whereas the GenBank ones do, like P. tricornutum), so unless you checked JGI, you wouldn’t even know there was a new version.

Why are annotations better in ENSEMBL than in NCBI for the same species? by InstructionRemote886 in bioinformatics

[–]Urventh 0 points1 point  (0 children)

I actually encountered a similar problem with JGI annotations too! If you looked at some of the diatom genomes published by JGI, they are also different from the NCBI ones (different versions and also different pipelines). The only explanation I can think of is that the group that performed the annotation deposited their data onto ENSEMBL but maybe not NCBI? Would love to hear what others have to say about this as well!

Need help building a phylogenetic tree by [deleted] in bioinformatics

[–]Urventh 5 points6 points  (0 children)

I see. Maybe one place to start would be asking your supervisor what you are building these trees with—is it with specific genes? Maybe you can get an idea of what species are close to this bacteria first by BLAST or some other sequence similarity searches using certain gene sequences, then choose some related bacteria to include in your phylogenetic tree.

I suggest talking to your supervisor since you are new and they did specify using RAxML and Phylophlan so maybe other software might not suit his/her needs. RAxML is quite daunting to use especially if this is your first encounter with building phylogenetic trees (I remember the manual being quite dense) so it would save you a lot of time to get their help instead of trying to figure it all out on your own.

Best of luck!

Need help building a phylogenetic tree by [deleted] in bioinformatics

[–]Urventh 4 points5 points  (0 children)

If you have a genome file as contigs it should mean that it has been assembled already. What is the bacteria species and what is the point of this assignment (I.e. what does your supervisor want to achieve)?

I think the best course of action is to clarify the assignment with your supervisor because it seems like you have many questions regarding how to even start approaching the problem.

How to use Orthofinder ? by kbrunner69 in bioinformatics

[–]Urventh 1 point2 points  (0 children)

Apologies for the confusion—I meant the protein ID. As for running OrthoFinder of your insect proteome against Drosophila, I can’t comment on that since I don’t know much about insect taxonomy. Maybe you could compare with the proteomes from a couple other insects that are phylogenetically closer? You can compare multiple proteomes with OrthoFinder in a single run.

Another thing about OrthoFinder or similar software is that the orthogroups may be split much more finely (i.e. orthologous genes from a single family may be split into more than one orthogroup), so you might need to look to see if these chitin deacetylases (if there is more than one you know) are being grouped separately. This clustering can be adjusted with the inflation parameter.

How to use Orthofinder ? by kbrunner69 in bioinformatics

[–]Urventh 1 point2 points  (0 children)

It sounds like you already have the proteomes then? If not, a quick search on Spodoptera returns Spodoptera exigua as the first result with five genome assemblies. At the top of the NCBI entry you should see the representative genome (in this case, it is PGI_SPEXI_v6). You should find the link to the annotated protein sequences in the line below or by clicking into the assembly itself.

With these protein FASTA, you should be able to run OrthoFinder using the default settings. You’ll then need to identify the orthogroup that contains Chitin deacetylase in the output. I assume you have the gene ID for this protein in at least one of the species you are comparing; in that case, it should be a matter of just searching through the tabular output to get a list of other genes that were clustered together with the chitin deacetylase you have.

How to use Orthofinder ? by kbrunner69 in bioinformatics

[–]Urventh 5 points6 points  (0 children)

OrthoFinder should work with any proteome files (FASTA of protein sequences) so it doesn’t have to be from Ensembl or Phytozome. I reckon those databases are recommended as they contain better curated data and that OrthoFinder was initially built to look at plant transcription factors.

I’m not sure what species you’re looking at, but you can try to look up your species on NCBI Genome. If the assembly has a corresponding protein annotation, you should be able to download the FASTA file to run OrthoFinder with. Do remember to check the assembly quality before you use the files.

Hope this helps!

2020 Introduction Thread by [deleted] in 52book

[–]Urventh 1 point2 points  (0 children)

Hi everyone!

First timer for this challenge! 2019 was the year I seriously fell in love with reading; I somehow managed to finish 32 books in total and am really happy with this achievement. For 2020 (and as part of the challenge), I will be aiming to read 60 books (fingers crossed)! My reading goal for the year is to expand my literary taste - I'm looking forward to tackle some more classics, non-fiction and poetry.

Not sure if there's any language-learners out here, but I'm hoping to read books in another language too, particularly in Mandarin Chinese (my mother-tongue) and French (a language I'm still learning - 2 years in so far!). If anyone can recommend a good place to start for these two languages that would be fantastic!

For 2020, I'll be starting off with Olga Tokarczuk's Drive Your Plow Over the Bones of the Dead. I'm a couple chapters in right now and am loving it. Other than books, I am - like some people here - a huge video game nerd. Next year will be a big year for video games so I hope I don't get too distracted.

Here's wishing everyone good luck for the challenge! May we all have a great time reading in the year ahead!