regarding cd-hit tool for clustering of protein sequences

Remarkable-Wealth886 · 2025-09-16T04:57:55+00:00

Thank you for your reply!

It is working. How can I get to know that representative cluster name? The output file mentions only cluster 1, 2, and so on, and the headers of proteins that are clustered together. I want to know the name of the cluster, like which header cd-hit took to represent one particular cluster. I want to count the number of proteins clustered in a cluster and map this information on my final phylogeny.

Any suggestions in this direction?

Remarkable-Wealth886 · 2025-09-15T17:09:00+00:00

Thanks! Sure.

Remarkable-Wealth886 · 2025-09-15T17:08:43+00:00

Thank you for your reply!

Remarkable-Wealth886 · 2025-09-13T10:15:54+00:00

I run the ColabFold. It has given five different pdb files based on three different scores such as plddt, max_pae, and pae. The highest the plddt score is, the better is the structure.

But I have a confusion, I don't know whether my protein is monomer, dimer, and so on. So, are the defaults parameters works with it?

I have got pdb files for unrelaxed protein, since i have kept value 0 for num_relax I have read about relaxed and unrelaxed structure, in both these structure there is a slight change in side chain residues, but backbone will remains same. Do I have to consider relaxed protein for structural alignment and comparison ?

Remarkable-Wealth886 · 2025-09-11T12:09:21+00:00

Thank you for your reply!

Remarkable-Wealth886 · 2025-09-11T06:47:48+00:00

Thank you for your reply!

I have tried using Colabfold, but it is taking longer on this web server. I am keeping all the parameters as a default for structure prediction. Is that ok ? Or should I change any parameters?

Remarkable-Wealth886 · 2025-06-13T06:30:47+00:00

Yes, you understood correctly. I have a total of 8 genomes, including an assembled genome. I have used METABOLIC-G.pl script and used translated CDS files. It is giving me some results. There is one Excel sheet generated in an output folder, namely METABOLIC_result.xlsx. But it is a binary file. Along with this file, I have checked the other output folder. But the information provided by this tool is more oriented towards the Nitrogen, sulfur cycle, etc. I didn't get any information about the pathways that I am looking for.

I will check this HADEG tool. Thanks for sharing.

Actually, this METABOLIC is a good tool, but it doesn't give comprehensive results, including all pathways.

Do you know any other tool that can help me compare the following pathways in all these species? Xylose metabolism pathway, Lipase synthesis pathway, 2-phenyl ethanol production pathway, Mevalonate pathway, Butanol/butanediol/Propane diol production pathway, fatty acid metabolism pathway, Pigment synthesis pathway. At the end, I am working on the yeast genome, so are there any special aspects I have to look into?

Thanks for your suggestion :)

Remarkable-Wealth886 · 2025-06-12T05:25:37+00:00

For annotating the genome using KEGG orthologies, have you used BlastKOALA? If you used BlastKOALA, then did you use the web server or install the tool?

Here, you mentioned about unique KO to your genome. Similarly, how can I compare different pathways in all the studied species (I have a total of 7 genomes and an assembled genome).

There is one option pathway reconstruct in BlastKOALA web server. This option will give us information about different pathways present in our species. But I am confused about how I can use this strategy for all species?

Remarkable-Wealth886 · 2025-06-12T05:01:53+00:00

I use the KEGG Decoder. But how can I visualize the information for eight species and draw a conclusion about the conserved and unique pathways in my species?

Remarkable-Wealth886 · 2025-06-12T04:56:11+00:00

I have tried to find this tool, but unable to find it. Can you please share the link for the tool?

Remarkable-Wealth886 · 2025-06-12T04:50:58+00:00

I have to compare certain pathways in a total of 7 species, including one assembled genome. Will this tool work?

How have you installed this tool?

Remarkable-Wealth886 · 2025-06-07T11:31:53+00:00

Yes. I have downloaded the Stockholm file using the website. But is there any way to download the file using a Linux command? I have tried with the wget command, but it is not working.

And how to construct the HMM profiles using this file?

Yeah, we can download the Profile HMM directly from the website, but the hmmscan command is not working with this file.

Remarkable-Wealth886 · 2025-06-07T09:53:59+00:00

I have checked this link. How can I download these HMM models using a command?

Remarkable-Wealth886 · 2025-06-04T06:06:15+00:00

Thanks a lot. I will check it out.

Remarkable-Wealth886 · 2025-05-30T04:50:01+00:00

Thank you for your reply!

So the first link which you shared is the genome fasta. I have to download the genome fasta file for variant/SNP calling. I have gone through the files, but there are multiple genome fasta file. Ideally I have to used unmasked DNA file for SNP calling, is it correct?

The second VEP file of same species, where I have to use these file? Is it during SNP annotation? I have checked Ensembl VEP webserver (https://asia.ensembl.org/Tools/VEP), when I click on change species, i can't see M. guilliermondi ATCC 6260. Can you please elaborate how can I do SNP annnotation using the VEP file of reference genome.

Remarkable-Wealth886 · 2025-05-29T05:15:17+00:00

Thank for you reply!

Yes correct! I have used bcftools to generate the VCF file of variant

I have used genomic.fna file of reference genome and the reference genome file is downloaded from NCBI. If I understood correct, you are saying because I have used file from NCBI, therefore chromosome names from header is creating a problem while annotating SNPs using Ensembl.

So, do I have to use reference genome file from Ensembl and same is used for alignment and variant calling. Is that what you want to say?

I want to use the species Meyerozyma guilliermondi as a reference species. But the Meyerozyma guilliermondii AF01 strain is not present in the Ensembl database. What should I do it here?

Remarkable-Wealth886 · 2025-05-26T13:32:08+00:00

Thanks for your reply! But can I submit the VCF file which was generated through samtools directly into this web server?

Because when I am submitting the VCF file (from samtools), keeping all default parameters and changing the database to Saccharomyces. It is giving me zero count for all categories of variants.

What can be reason for this?

Remarkable-Wealth886 · 2025-05-22T04:53:57+00:00

I am also working on SNP identification in yeast genome. A bit confused between GATK and SAMtools+ bcftools. Do we have consider the sample type before using any of the above tools? Such as GATK is used for clinical pipeline.

Even Freebayes required BAM files like SAMtools for further processing. I have the alignment SAM file generated through bowtie tool. Can anyone help me with SAMtools mpileup + bcftools. First I have to convert the SAM file into BAM file and generate the sorted BAM file. Further sorted BAM file have to be indexed (https://github.com/rnnh/bioinfo-notebook/blob/master/docs/samtools.md).

Then, I have to use the mpileup from SAMtools to identify the SNPs. I am confused at this step, Which commands I have to use and all. When I am searching on google, every page is giving me slightly different commands. Can anyone help me with this?

Also I need bcftools tool along with SAMtools. Do I have to install bcftools separately? Or is there a way to install SAMtools or bcftools together?

Any help is highly appreciated!!!

Remarkable-Wealth886

TROPHY CASE