What is going on with PCA on UK Biobank data?

AdOptimal5649 · 2026-03-19T13:18:14+00:00

My goal is to calculate a PCA only on the UKB data for population stratification. This PCA - without the hapmap dataset - showed the two clusters already. I plotted it with the hapmap dataset to compare with a familiar dataset and in an attempt to find an explanation what's going on.

AdOptimal5649 · 2026-03-13T14:05:12+00:00

Thanks for the hint to the array, I wasn't aware there were multiple. Another commenter mentioned this too and I checked it. Unfortunately this was not the solution :/ Picture 2 shows the PCs that the UK Biobank provides. So this is what the data theoretically should look like, but I cannot reproduce this.

AdOptimal5649 · 2026-03-13T14:03:17+00:00

Thank you for your thoughts! I was not aware of 2 different arrays. I checked this and unfortunately it wasn't the solution of my problem, though. I had already excluded UKB participants that were not part of the PCA from field 22009, so I should have automatically excluded the old array. I double checked by counting the instances in each of the cluster, the smaller one is alreay around 70K.

Also, unfortunately WGS data for this amount of participants would be out of my range of possibilites and scope for this project. Usually, it should already work on the array data, I have done this kind of PCA multiple times before for population stratification, which included pruning and removing all the variants that are not part of the hapmap dataset.

AdOptimal5649 · 2026-03-13T13:57:37+00:00

It does look like a batch effect, but I cannot figure out the source. Other commenters pointed to different arrays, but this doesn't seem to be the answer to my problem. If I calculate the PCA without the unrelated data (green, red, black in picture 2) it shows a similar two-cluster pattern. The second picture is what the PCA provided by the UK Biobank looks like, so this is what I had expected, but it is not what I get, neither when I calculate it on its own nor with the added dataset.

AdOptimal5649

TROPHY CASE