Identifying private SNPs in R

In any genetic data set it’s almost always desirable to find variants that are unique, or “private”, to a particular group. Most microsatellite-based papers report private alleles per population as part of their summary statistics, but now that genomic SNPs are the molecular marker in vogue, few papers I’ve seen tend to bother with private SNP alleles.

Some papers do report private SNPs, but if you look at their methods they always cite an ad hoc python script, or something.

Not helpful.

Maybe extracting private SNPs is a trivial matter. Maybe anybody with any bioinformatics skill at all can do it. But for those of you who would like a little help identifying private SNPs in your data I thought I’d give you my solution, in R.

Step 1.

Today we are going to be working with the R package hierfstat, which takes FSTAT formatted data.

If you don’t have FSTAT formatted data I recommend using the java-based program PGD-spider to format your input file.

> library(hierfstat)

Some of you old-timers might remember the stand-alone program FSTAT. I used to run it for microsatellite data. But that was before. Hierfstat is now.

I seem to recall that FSTAT would find private alleles for you, but hierfstat has lost that capability. It does, however, calculate the relative allele frequencies for each population with the function pop.freq(), and that will be good enough for us.

> read.fstat(“Data.dat”, na.s = c(“0″,”00″,”000″,”0000″,”00000″,”000000″,”NA”)) -> the_data

> relative_frequencies <- pop.freq(the_data)

> length(rel_freq)
[1] 93058                                   ### I have this many loci

> relative_frequencies[1]

x           1                   2                   3                   4                   5                   6                  7
2 0.66666667 0.80000000 0.80952381 0.81818182 0.72727273 0.83333333 0.90909091

4 0.33333333 0.20000000 0.19047619 0.18181818 0.27272727 0.16666667 0.09090909

            8                   9                   10                11

2 0.90476190 0.79411765 0.81818182 0.93181818
4 0.09523810 0.20588235 0.18181818 0.06818182

Note that when we read in the data we have to specify what our missing data (na.s) look like. Or, if you are like me and can’t be bothered to look inside your data, just offer R a range of possible missing data values.

The result of the pop.freq() function is a list of tables. The first column of each table (x) is the allele, which in Locus.1 is either cytosine (2) or thymine (4).  The other columns of the table represent the populations in your sample… eleven in my data.

Step 2

Now, how do we iterate over every table in the list to find private SNP alleles?

We use the lapply() function to “apply” a function to every item in the list. And the function we want to use needs to pull out all the zero frequencies from each population.

> lapply(relative_frequencies, function(x) which(x==0)) -> zeros

> zeros[1:2]

[1] 5 7 9 11 13 21

> relative_frequencies$Locus.2

x          1                   2                    3                    4                   5                    6                  7
1 0.02380952 0.05000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000

3 0.97619048 0.95000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000

8                9                   10                 11

0.02380952 0.08823529 0.06818182 0.00000000
0.97619048 0.91176471 0.93181818 1.00000000

Looking at our new zeros list object, we see that the first two loci in the list have naught and six zeros, respectively.

Upon examining the second locus, we can indeed see that six out of eleven populations have fixed allelic differences.

Step 3

We are interested mostly in loci where ten out of eleven populations have fixed differences, so now lets iterate over the zeros list and look for loci with ten zeros.

> lapply(zeros, function(x) which(length(x) == 10)) -> ten_zeros

> ten_zeros[3339:3340]

[1] 1

The new object ten_zeros is another list. In this list, loci with ten zeros register a 1 and those with less or more register 0.

Step 4

So now we just need to know which loci in the ten_zeros list have a 1.

> length(which(ten_zeros==1))
[1] 38

> private_SNPs <- relative_frequency[which(ten_zeros==1)]


x         1                     2                   3                   4                   5                    6                   7
2 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000

4 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000

8                    9                   10                11

2 0.97619048 1.00000000 1.00000000 1.00000000
4 0.02380952 0.00000000 0.00000000 0.00000000

And there we have it. Out of 93,058 loci there are only 38 private SNPs.

These 38 loci are now stored in the object labeled private_SNPs.

And that’s all there is to it!


2 thoughts on “Identifying private SNPs in R

  1. Hello, I am trying getting fstat format from 45k snps but thePGDspider seem not be able to run with this ammount of loci: please have you any alternative? Would be possible read any other kind of input in your procedure? thank you!


    1. You’re probably not giving Java enough memory allocation.

      I don’t know what system you’re using, but in a unix-based environment (mac or linux) you can allocate memory to PGDSpider in the command line like this:

      java -Xmx7024m -Xms512m -jar PGDSpider2.jar

      This gives 7.024 gigs of memory max, and 512 megs min. On my computer this is enough to process over 90,000 SNPs… but it does take a while.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s