Sunday, March 30, 2014

Eurogenes Genetic Ancestry Detective

Ever since starting the project I've received regular requests from personal genomics clients to investigate their genotype data in more detail and help solve puzzles about their ancestry. I've now decided to make this a more formal and structured process and charge money for it. This will enable me to spend more time and resources on such requests and thus produce higher quality results. It'll also help me to keep updating and improving the free tools offered as part of Eurogenes.

The service is aimed at both beginner and experienced personal genomics clients:

- people who were recently genotyped and are having a hard time making sense of their ancestry results, including from third party providers, will receive a straightforward but detailed brief about what their genotype data says about their origins, in the context of both recent and ancient ancestry. This will include a best fit prediction of ethnic origins and geographic coordinates for the genome.

- advanced users of third party ancestry tools can request a more detailed analysis of a specific issue, such as confirming minor Sub-Saharan or Ashkenazi admixture.

The price is USD $90. Please note, I'm solely focusing on ancestry and staying away from traits and medical genetics. The genotype data will be not be shared with anyone and deleted after analysis.

Please first make contact at dpwes [at] hotmail [dot] com before sending data and money. I'm accepting 23andMe, FTDNA, Ancestry and Geno 2.0 genome-wide (aka. autosomal) files.

Wednesday, March 19, 2014

Updated Eurogenes K13 and K15 population averages

I just sent off a new population averages spreadsheet for the popular Eurogenes K13 test to GEDmatch. When added to the analysis, the updated data should result in more fine grained and accurate ancestry predictions for many people, especially Northwest, East and South Europeans. Here's the spreadsheet:

Eurogenes K13 population averages

The Fst (genetic) distances between the K13 ancestral populations can be seen here. Below are a couple of cluster trees of West Eurasian populations based on the data. Note also the presence of two ancient genomes, La Brana-1 and MA-1, among the modern samples. I'm actually very impressed how accurate their placings are on the trees.

MA-1, the 24,000 year-old Siberian, is sitting between Eastern Europe and the North Caucasus, which is exactly where he was on a PCA in Lazaridis et al. (see here). On the other hand, La Brana-1, the 7,000 year-old Iberian, is showing strong affinity to circum-Baltic groups, like southwest Finns and northern Swedes, which correlates well with the results from Olalde et al. (see here and here).

Update 22/03/2014: Here's an MDS of Eurasia and the Americas, courtesy of one of my project members. It's actually based on population averages from the Eurogenes K15 test, which I've also just updated (see spreadsheet here). I haven't yet sent these off to GEDmatch, but will when the Ad-mix tools are back online. Note that this MDS also features the other two ancient genomes published recently, Anzick-1 and Saqqaq.


Saturday, December 28, 2013

EEF-WHG-ANE test for Europeans

This is a test that attempts to fit you to the three inferred prehistoric European populations as described in this recent preprint. The relevant Excel file can be downloaded here, and all you have to do is stick your Eurogenes K13 results into the fields provided to get the EEF-WHG-ANE ancestry proportions. A modified version for Near Eastern and Southeast European users can be accessed here.

The test is based on correlations between the average levels of the Eurogenes K13 and the ancient components among selected European populations (see here). Below is a brief description of each of the ancient components.

Early European Farmer (EEF): apparently this is a hybrid component, the result of mixture between "Basal Eurasians" and a WHG-like population possibly from the Balkans. It's based on a 7500 year old Linearbandkeramik (LBK) sample from Stuttgart, Germany, but today peaks at just over 80% among Sardinians.

West European Hunter-Gatherer (WHG): this ancestral component is based on an 8,000 year old forager from the Loschbour rock shelter in Luxembourg, who belonged to Y-chromosome haplogroup I2a1b. However, today the WHG component peaks among Estonians and Lithuanians, in the East Baltic region, at almost 50%.

Ancient North Eurasian (ANE): this is the twist in the tale, a component based on a 24,000 year old Upper Paleolithic forager from South Central Siberia, belonging to Y-DNA R*, and known as Mal'ta boy or MA-1. This component was very likely present in Southern Scandinavia since at least the Mesolithic, but only seems to have reached Western Europe after the Neolithic. At some point it also spread into the Americas. In Europe today it peaks among Estonians at just over 18%, and, intriguingly, reaches a similar level among Scots. However, numbers weren't given in the paper for Finns, Russians and Mordovians, who, according to one of the maps, also carry very high ANE, but their results are confounded by more recent Siberian (ENA) admixture.

It's important to note that this test is only likely to be accurate for people of European ancestry, and indeed only those who aren't outliers from the main European clines of genetic diversity. For details of what that means, please consult the aforementioned paper. However, roughly speaking, if you're of European origin and don't score more than 3% East Asian, Siberian, Amerindian, South Asian, Oceanian, Northeast African and/or Sub-Saharan admixture, then you should get a coherent result. Users from the Near East and Caucasus should run the version specifically designed for them, while those from Southeastern Europe might find it useful to run both calculators and then compare the results.

Thanks to project member DESUK1 for putting this together at such short notice, and MfA for the modified version. Please post your results in the comments section below and state your ancestry when you do. This will help us to improve the accuracy of the test. My results make perfect sense, considering my Polish ancestry, relative to those of the reference samples (see here and here).

EEF 42.012706
WHG 40.52702615
ANE 17.46026785

Below that is a PCA courtesy of project member PL16, based on the EEF-WHG-ANE test results for selected populations. The positions of the ancestral EEF, WHG and ANE groups reflect the PCA loadings (see here).

This is my interpretation of who these components represent. Of course, this model might change when more ancient genomes are analyzed.

WHG and WHG/ANE: indigenous European hunter-gatherers
EEF: mixed European/Near Eastern Neolithic farmers
ANE/WHG: Proto-Indo-European invaders from the Eastern European steppe
ENA/ANE: early Uralics from the Volga-Ural region
EEF/WHG/ANE: late Indo-Europeans (ie. Celts, Germanics and Slavs)


Thursday, November 21, 2013

Updated Eurogenes K13 at GEDmatch

The old Eurogenes K13 has been replaced by a new model with different, and hopefully more robust, ancestral clusters. The new version also includes Oracles as well as 2D and 3D Principal Component Analyses (PCA). The K13 population averages and genetic (Fst) distances between the inferred ancestral clusters are available here and here, respectively.

GEDmatch > Ad-Mix Utilities > Eurogenes > K13

Below is a 2D PCA based on the average K13 results of the European and Asian reference populations, courtesy of project member PL16.

Thus, Eurogenes now has four tests at GEDmatch with Oracles: the Jtest, EUtest, EUtest V2 and the K13. It's useful to keep in mind that these tests will differ in their interpretation of the data, and perhaps accuracy, depending on the ancestry of the user. For instance, the new K13 should be more useful for Central and South Asians than any of the others, because it features new reference samples from these regions.

Monday, October 7, 2013

Eurogenes K15 now at GEDmatch

This new test is essentially an upgraded version of the EUtest. Unlike the original, it includes an Amerindian component and five native reference populations from North and Central America. So obviously it should be a lot more useful for users from the New World who are wondering about Amerindian admixture.

GEDmatch > Ad-Mix Utilities > Eurogenes > Eurogenes EUtestV2 K15

I just tried it myself, and have say that the 4-Ancestors Oracle results were impressive. In other words, they were very accurate based on what I know about my recent ancestry. On the other hand, I'd say the default Oracle was picking up more ancient gene flows. However, this might not be the case for everyone, so let's hear some feedback, discuss the outcomes, and perhaps tweak the settings if necessary.

One of the most important things to keep in mind is to ignore all results under 1%. These are likely to be noise.

Here are the populations averages and Fst distances between the ancestral components. Below are gradient maps of the main West Eurasian components courtesy of Gui (FR7): Baltic, North Sea, Atlantic, East Euro, West Med, East Med, West Asian.

Sunday, August 25, 2013

Locating and characterizing minor exotic admixture on the X chromosome

I've recently been looking at ways to incorporate X chromosome data into my ancestry tests and experiments. Below I describe an analysis of the X chromosomes of two samples from the 1000 Genomes Project.

The X chromosome is not like the 22 autosomal chromosomes. For instance, males only carry one copy, which means it can be a challenging source of markers for some analyses. I overcame this problem by using only male or female X chromosomes, and creating more female-like X chromosomes by combining two male chromosomes into one from the same populations.

I ran a diagnostic global PCA (see here) to find outlier, and thus presumably admixed, X chromosomes in order to test them further. There were quite a few interesting results, including two from Kent, England. They're shown on the global PCA linked to above, and the one below, as Kent_HG00141 and Kent_HG01791. On both plots they're drifting slightly towards East Asia.

A supervised local ancestry test of these English X chromosomes with a Support Vector Machine algorithm shows that they carry several putative East and South Asian segments. Some of the segments are very small, and might well be false positives, but overall the results are in agreement with the PCAs and indicate minor East Asian admixture.

Running PCA on the markers from within some of the segments confirms the result for HG00141. But in this instance the outcome isn't as conclusive for HG01791. I ran two PCAs for each sample, using somewhat different reference samples and numbers of markers. The base pair positions of the markers are listed here.

The finding of likely East Asian admixture in two samples from Kent is an unusual outcome, but not an improbable one. It's possible these individuals have distant East or Southeast Asian ancestors, perhaps from Burma, Hong Kong, or other former British colonies.

The West Asian and Middle Eastern segments on the X chromosome paintings above can be safely ignored. Such segments often show up in European samples in local ancestry tests due to low Fst (genetic) distances between all populations native to West Eurasia. However, the general structure of the two chromosomes does reflect their global and West Eurasian PCA results. For instance, HG00141 has a typically Northwestern European X, while HG01791 a more Mediterranean one. This is evident on the aforementioned global PCA above and intra-West Eurasian PCA below.

I also analyzed the 22 autosomes of both individuals. This showed them to be overwhelmingly of English origin and, interestingly, didn't reveal any Asian admixture.

Saturday, March 9, 2013

Eurogenes K36 at GEDmatch

I've just put together a new test for GEDmatch called the Eurogenes K36. Obviously, the K36 means that it features thirty six ancestral clusters. It probably won't include any Oracles, mostly because the Calculator Effect would render these useless if they were based on the average results of the reference samples (see the sheet here for details), and it'd be very time consuming for me to test a wide variety of other samples in supervised mode using thirty six sets of allele frequencies.

The main purpose of the Eurogenes K36 is to help users unravel the ethnic origins of local areas of their genomes (aka. half-segments), hence the high number of ancestral categories, some of which are very specific. In other words, the test is mainly a chromosome painting utility. It's accessible via the GEDmatch Ad-Mix link below:

GEDmatch > Ad-Mix page > Eurogenes > Eurogenes K36

An important point to keep in mind is not to take the ancestry proportions too literary. If you're, say, English, and you get an Iberian score of 12% this doesn't actually mean you have recent ancestry from Spain or Portugal. What it means is that 12% of your alleles look typical of the reference samples classified as Iberian, and this figure might only indicate recent Iberian admixture if it's clearly higher than those of other English users.

Another way to look at it is that the ancestry proportions are like map coordinates, and they'll place you with a very high degree of accuracy on a genetic map featuring other users. Indeed, please feel free to post your scores and ancestry details in the comments below to help others get an idea of what their results might represent. My results are listed below. The scores put me squarely in Poland relative to those of other European samples I've run, which is correct.

Also worth mentioning is that this test focuses on much deeper ancestry than the Ancestry Composition at 23andMe. Hence, I expect that many Europeans will score a few percent in non-European clusters. However, like many ADMIXTURE results, this could give us strong hints about population movements into Europe during prehistory and early history, so it's worth keeping an eye on.