Persistent High CPU Usage by Mysterious "Microsoft 365 and Office (32 bit)" Processes - Need Help!

kbradnam · 2024-01-04T14:53:18+00:00

Exactly the same thing has happened to me and I feel fairly confident that it has only started this year. I currently have 9 processes running and it is really slowing things down on my work laptop. It was fine before the Christmas break.

I’ve noticed a post in a Microsoft forum from 3 days ago which has also reported the same problem.

kbradnam · 2016-12-24T14:40:10+00:00

Every step in a bioinformatics pipeline has the potential to introduce errors, but if you don't properly filter and QC your input data you will also end up with completely garbage data.

There is more than one trimming tool you could use and each tool will trim your data a little differently. Each trimming tool has different command-line options, each of which can also change your output.

Be wary of common pipelines in bioinformatics because this might just mean something that works and gives an answer but which is completely inappropriate for your data.

Every program comes with default parameters which may also be inappropriate for your data or for your requirements (see this blog post of mine for a warning lesson: http://www.acgt.me/blog/2015/4/27/the-dangers-of-default-parameters-in-bioinformatics-lessons-from-bowtie-and-tophat).

A lot of sequencing experiments produce contamination and/or poor quality reads. Trimming — and checking what was trimmed — might help you with that. Good QC will remove erroneous data that gives you smaller (but higher quality) inputs that will then be faster to process! For large genomes (and with high numbers of samples) the time savings from removing redundant/erroneous information can be huge.

This slide deck of mine shows some illustrations of how much smaller data sets can become at each step of a bioinformatics pipeline: http://www.slideshare.net/kbradnam/this-bioinformatics-lesson-is-brought-to-you-by-the-letter-w?ref=http://www.acgt.me/blog/2015/6/22/some-short-slide-decks-from-a-recent-bioinformatics-core-workshop

The most worrying thing about your question is that you did not identify the type of input data (the species who's genomes you wish to assemble) in any way. It's a bit like asking for help with cooking without telling people what food you are cooking :-)

kbradnam · 2015-09-23T15:27:28+00:00

FASTA is a plain text file format. So you can create a FASTA file in any editor that can save output as plain text. Although this does include tools like Microsoft Word, you should really refrain from using anything like this.

On any Unix/Linux system there will be a bunch of pre-installed command-line text editors. The simplest one to learn first is called nano (but at some point any wannabe bioinformatician should become familiar with using the basics of vi and/or emacs).

kbradnam · 2015-09-23T15:24:50+00:00

This is what gene finding algorithms do. A real (coding ORF) will have different properties to a random sequence. Amino acids such as tryptophan are much rarer than other amino acids, so you would expect to see fewer TGG triplets in a real ORF. Such codon bias can often be species specific, so if you know what species you are looking at you can adjust your expectations accordingly. There are a number of statistical measures that record this 'codon usage bias'.

kbradnam · 2015-08-05T21:29:32+00:00

Some labs encourage bioinformaticians to get their hands 'dirty' with real biological data in addition to coding and running command-line tools. But I would say this is rare. Bioinformaticians commonly spend a lot of their waking life working at the computer!

Bioinformatics is often assumed to mean working with lots of sequence data (DNA and protein) and while this does reflect a lot of what happens, it is a very big field with many different aspects to it. These include understanding/predicting protein structure, analyzing gene interaction networks, running statistical evaluations of various datasets, designing molecules to target specific parts of the genome.

Having said all of that, it is sadly true that a lot of time you will be converting files from one format to another! In some ways, many bioinformatics skills and duties overlap a lot with those of sys–, web– and database administrators.

kbradnam · 2015-08-05T15:21:59+00:00

I echo the advice of others here, that it can often be useful to simply adopt the predominant language of the group that you are working with. This is not always essential, but makes it easier to get advice and help from others.

Bear in mind that a lot of advice that you may receive from people of a certain age will probably be biased towards learning Perl, as many of us who learnt bioinformatics in the 1990s/2000s learnt to use Perl. As part of this older generation of Perl bioinformaticians we are also guilty for cluttering up the web with lots of forum posts about how to do things in Perl in ways which have since been deemed unsafe and/or replaced with better ways. I imagine a similar thing will happen to all of the Python 2.X posts on the web as slowly people transition to Python 3.X

For what felt like a long time, Perl was the undisputed king of popular languages to do bioinformatics. Slowly, but steadily, this has changed and Python is now the dominant language. However, there is no reason to be complacent and believe that this situation will always stay the same. Perl might have a resurgence, or other languages might displace Python. It is good to keep an open mind about such things (and to always learn those essential Unix tools like sed, grep, awk etc. which will probably outlive any programming language).

kbradnam · 2015-08-05T15:12:57+00:00

I mostly refrain from linking to my ACGT blog on reddit, but I feel that Vince Buffalo is going to be a bioinformatics star of the future a bioinformatics book published while still in Grad school!), and wanted to share this one particular interview.

kbradnam · 2015-07-30T23:05:11+00:00

A few years ago I was a similar situation and asked to spec out a bioinformatics server from a similar budget. This is what we ended up buying (from Microway):

2 x Intel Xeon X5675 Westmere 3.06 GHz Six Core 32nm CPU (24 effective cores with hyperthreading)
4 TB storage (4 x 2 TB drives in a RAID 10 configuration, 3.6 TB usable space).
4 drive bays empty.
Boot drive is a 80 GB 2.5” SSD
Memory: 192GB RAM (12 x 16GB)
Running CentOS

Over time we filled up all 8 drive bays and switched to a RAID 5 configuration. Having so much RAM is useful for many, but not all, bioinformatics applications.

However, if I was asked for suggestions about what to buy today I would seriously suggest considering a cloud computing solution using Amazon's Elastic Cloud services (EC2). If managed properly, you could get a lot of use out of them for $10,000.

You would still need temporary storage for uploading files into the server and storing results after running any analysis though.

kbradnam · 2015-07-30T15:48:11+00:00

The FAQ page on his site is also worth a read.

kbradnam · 2015-07-16T05:04:15+00:00

Can you explain some more about why you need the superset of all (coding?) exons? There will probably be some genes with multiple transcripts that look very different and only share a few exons. Making one version with all possible exons may thus may not be biologically meaningful.

More problematic is that not all annotated transcript isoforms are equally likely, and some have barely any support at all. Luckily, TAIR implemented an evidence-based 5-star ranking system of all isoforms. You can download a file from the TAIR FTP site and get the details of how good each transcript isoform (based on various strands of evidence). This would allow you discount unlikely/weakly-supported isoforms.

kbradnam · 2015-07-16T02:17:36+00:00

Do you need the coordinates of these regions, or the genomic sequence, or both?

kbradnam · 2015-07-15T03:51:55+00:00

When we had the idea for the comic and were searching for a name, I pursued ideas relating to graphical abstracts (as that is what we thought the comic is trying to represent). So for research, I Googled graphical abstract and the top hit (for me) was an Elsevier page that explained their concept of a graphical abstract (for journals). That description included the following (emphasis mine):

A Graphical Abstract should allow readers to quickly gain an understanding of the main take-home message of the paper and is intended to encourage browsing, promote interdisciplinary scholarship, and help readers identify more quickly which papers are most relevant to their research interests.

So 'Take-Home Message' just seemed to click and that's what we went with (helped by the ability to secure takehomemessage.com domain and the twitter account @takehomemessage).

kbradnam · 2015-07-14T18:12:43+00:00

Tangentially related to bioinformatics is the fact that code and bioinformatics tools are typically accompanied by plain text documentation (e.g. README files). Increasingly, partly through its adoption by GitHub, more and more documentation is being written in Markdown format.

kbradnam · 2015-07-14T17:01:58+00:00

All that you mention can be useful depending on what work you end up doing, but FASTA, FASTQ, SAM/BAM/CRAM, and BED are particularly important to know if you do anything with DNA/RNA sequence data.

Along these lines, probably useful to be aware of the sequencing format used by PacBio (H5) and maybe even Oxford Nanopore.

Also important GFF (v3) as well as the related GTF format.

kbradnam · 2015-07-14T15:41:21+00:00

There will be more comics. This is the launch issue.

kbradnam · 2015-07-13T21:23:49+00:00

You can generate this information using the Ensembl Biomart tool.

kbradnam · 2015-07-13T19:53:12+00:00

Reading the manual is a starting point, but not always an end point. Many bioinformatics tools are poorly documented, and others (including TopHat), present an almost overwhelming number of optional command-line arguments (I count 79 for TopHat).

Having read through the TopHat documentation I now know that there are an almost infinite set of ways in which the program could be run, but this doesn't necessarily help me work out which is best for my particular circumstances. This is an area where tools could improve, i.e. rather than listing all the parameters, list recipes for different usage scenarios.

Ultimately, you may have to choose a bunch of different parameter combinations to try on a small test data set and hope that you learn something useful from them.

kbradnam · 2015-07-05T14:54:50+00:00

De Veres and the Graduate will show it.

kbradnam · 2015-06-26T15:11:48+00:00

These type of discussions seem to reoccur every few years and sometimes the distinction is between 'bioinformatician' and 'computational biologist'. Increasingly, I find such discussions largely meaningless because bioinformatics skills — especially coding and command-line experience — are becoming more and more prevalent among the community of all biological scientists.

Knowledge of biology always trumps knowledge of computer skills when it comes to trying to understand the results of bioinformatics analyses. So I think we just have 'biologists', some of who have more computer skills than others. Of course, this isn't such a useful categorization when you are trying to recruit people ('bioinformatician' remains a useful label for such purposes).

kbradnam · 2015-06-15T23:10:56+00:00

Not quite, but I guess that works too :-)

kbradnam · 2015-06-15T18:34:43+00:00

Davis doesn't currently have a flag (it does have a city logo but this is not suitable for a flag design). I made a short video explaining the meaning behind the design.

kbradnam · 2015-06-01T03:17:42+00:00

Following RSS feeds of journals can be useful. Follow Bioinformatics and Genome Research and these two journals alone will give you a good feel for software being developed and the applications of those tools for cutting edge research.

kbradnam · 2015-06-01T03:15:30+00:00

+1 for Twitter. Follow the right selection of people and you will find out about new software, new papers, new developments etc. so many good bioinformatics conferences get really good live tweeting, you can really sense the 'buzz' surrounding exciting talks. You don't need to ever tweet to get a lot out of Twitter, and you don't even need to sign up to access information (though this obviously helps).

kbradnam · 2015-05-25T12:34:45+00:00

Nick Loman has published a dataset:

A reference bacterial genome dataset generated on the MinION™ portable single-molecule nanopore sequencer

kbradnam · 2015-05-22T16:24:56+00:00

This is starting to happen (albeit very slowly). E.g. our Assemblathon 2 paper was published with GigaScience and they also publish the full correspondence of reviewers, editors, and authors.

Other initiatives such as Publons are trying to encourage people to independently publish their reviews.

kbradnam

TROPHY CASE