If you like making mitochondrial haplotype networks, you’ll also want to know how to make similar figures using multi-locus genotypes.

The most popular blog post I’ve ever written, by far, is the one on how to make haplotype network figures in R.

I’m not sure why this particular post receives so much traffic. But maybe I can recreate some of that success.

Today I am going to show you how to make this figure using SNP polymorphisms:

msn-page-001

 

If you like the aesthetics of haplotype networks, and want to make something like that for your SNP data I’ll show you how to do it.

But instead of a network, we’re going to make a minimum spanning tree, which just shows the shortest connections between genotypes (no reticulations). And we’ll use Euclidean distances between multi-locus genotypes as edges for the tree.

And just so you know, I’m sure you could do this for microsatellite loci too. You can get a genotype distance matrix for any multi-locus marker type.

To demonstrate, I’ll use some red snapper data I have going spare. These data look something like this:

1177     1    1 4 2 1 2 2 2 3
1177     1    1 4 2 1 1 2 2 3
1181     1    3 4 2 4 1 2 3 -9
1181     1    1 4 2 1 1 2 2 -9

This is structure format (.str). The four-digit numbers in the first column are individual sample IDs. The next value, 1, is the population factor. And the numbers after than are nucleotide calles, A G T C, respectively… I think.

Also notice that there is no row of marker names. That is intentional. I deleted it. The data load better without it. Also, this particular structure file has genotypes for each individual in two rows.

You can convert your data to this format from something else using a program like PGD Spider.

Now, lets load the data into R.

> library(adegenet)

> read.structure("RedSnapper.str") -> reds

> reds
/// GENIND OBJECT /////////

// 218 individuals; 171 loci; 342 alleles; size: 384.6 Kb

// Basic content
@tab: 218 x 342 matrix of allele counts
@loc.n.all: number of alleles per locus (range: 2-2)
@loc.fac: locus factor for the 342 columns of @tab
@all.names: list of allele names for each locus
@ploidy: ploidy of each individual (range: 2-2)
@type: codom
@call: read.structure(file = "/comb_outl.str")

// Optional content
@pop: population of each individual (group size range: 47-59)

So, I have 218 individual red snapper in this data set, and 171 loci. There are four populations, numbered 1-4 with 47-59 snappers in each group.

Now we’ll convert this to a genLight obeject and use adegenet to perform a simple principal components analysis of our genotypes, and keep the first two axes.

> reds2 <- genind2df(reds, pop = NULL, sep = "", usepop = TRUE, oneColPerAll = FALSE)

> reds2 <- as.genlight(reds2[,2:length(names(reds))])
> reds2
/// GENLIGHT OBJECT /////////

// 218 genotypes, 171 binary SNPs, size: 7 Mb
3516 (9.43 %) missing data

// Basic content
@gen: list of 218 SNPbin

// Optional content
@ind.names: 218 individual labels
@loc.names: 171 locus labels
@other: a list containing: elements without names

snap.pca <- glPca(reds)
Select the number of axes: 2

We do this because the first two axes of PCA will contain most of the population-level genetic variation in the data. You can keep more than two axes, if you want, based on what your data is like.

The next step is to generate the euclidian distances between each genotype, using our PCA results.

> D <- dist(snap.pca$scores[,1:2])^2 ### class dist 
                                     ### Euclidean distances

You could skip the PCA altogether, if you wanted, by using an R package that will compute genetic distances between all individual genotypes directly from genetic data. As long as you can supply distance data as class “dist” you should be fine.

> class(D)
[1] "dist"

Next we’re going to use the R package popper to make our minimum spanning tree. We are going to go back and use the “reds” object (not reds 2) and convert that into geneclone format, and add some strata.

> library(poppr)

> inds <- c(1:218)
> levels(reds@pop) <- c("pop1","pop2","pop3","pop4")

> pops <-c(rep(1,length(which(reds@pop=="pop1"))), rep(2,length(which(reds@pop=="pop2"))), rep(3,length(which(reds@pop=="pop3"))), rep(4,length(which(reds@pop=="pop4"))))

> hier <- cbind(inds, pops)
> dim(hier)
[1] 218 2

### The "strata" I've made here are nothing more than labels.
### The function rep() just creates repetitive lists.
### rep(1,length(which(reds@pop=="pop1"))) just means repeat
### the number 1 the same number of times as there are individuals
### in population 1.
### It makes the program happy, so I do it.

> snappy <- as.genclone(reds)
> snappy@strata <- as.data.frame(hier)

> poppr.msn(snappy, D, mlg.compute = "original", sublist = "All", blacklist = NULL, vertex.label = "", wscale = FALSE, showplot = TRUE, include.ties = TRUE)

And now we have our plot:

msn-page-001

Each circle is an individual red snapper. And the color indicates the population it came from… or rather the geographic location where it was sampled. The edges in the tree are proportionate to the Euclidean distances between genotypes.

A first glance suggests that there is a lot of mixing among the geographic locations, possibly indicating that this is one big intermixing population. But take a closer look, because the data don’t say that at all!

Most of these circles are positioned next to other circles of the same color. Those colors may be dispersed across the whole tree, but within the tree are many homogeneous clusters of genotypes.

What does this mean? It means that migration between geographic populations is relatively low, and much of it is probably historical. It means population connectivity is high on evolutionary time scales, but low on recent ecological time frames.

Honestly, I don’t think that this is the best plot to show these patterns, which is why I’m not putting it in the paper. But it’s still a useful thing to know how to make. For some data sets, particularly where there has been long-term isolation, something like this could really be the centerpiece of a paper.

The plotting of the tree is based on random functions, so if you don’t like the way everything looks, try plotting it again. Next time you may get some thing better.

I think there is a lot of unexplored potential with plots like this. I haven’t spent too much time playing around with it. If you end up getting creative with this, please share!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s