SNP haplotypes for Mixed-stock Fishery Analysis: Microhaplot to Rubias conversion in R

If you’ve ever done any fisheries genetics, you’ve probably heard of Eric Anderson, who is a computational biologist at the NOAA southwest fisheries science center. His github page has over 100 repositories, many of which contain software programs that are broadly applicable. One of his more recent tools (co-written with Thomas Ng) is called Microhaplot:

This is an R package with an associated shiny app that can be used to group SNP alleles from PCR amplicons into haplotype loci. I have found it very useful.

When I say haplotype, I’m not talking about haploid DNA, like the mitochondrial genome. Sometimes people use the word “haplotype” to describe mtDNA variants, but the haplotypes I am talking about are diploid, recombining, genomic DNA. They are SNP loci found in at least some degree of linkage on the same PCR amplicon. Not all the SNPs will be tightly linked, which means that you can potentially have more than two allelic variants for each haplotype locus, even if all of the SNPs are biallelic.

There are some good reasons why you’d want to organize SNPs into haplotypes, the most important of which is often increased statistical power (see McKinny et al. 2017. Canadian J of Fisheries and Aquatic Sciences, 2017, 74(4): 429-434), but first you need to get the haplotype data into a useful format.

There are several things you could do with your haplotypes, but the program I have been analyzing them with is another Eric Anderson R package called Rubias, which performs mixed-stock fishery analysis and population assignment. The github page for Rubias is found here:

You can also use SNP haplotypes as markers for relatedness analysis, which Anderson and colleagues have done recently (Baetscher et al. 2019. Molecular Ecology. For this they used another one of their in-house software packages called, CKMRsim, which does not yet have good documentation, so I’ll leave it alone for now.

Basically, for mixed-stock fishery analysis, you have a sample of caught fish, from some fishery, that you presume contains individuals from several spawning populations. If you want to know which population each belongs to, you need a sample of reference fish to help you identify the fishery samples. Rubias takes both mixture and reference data sets and tells you the probability of any given fish coming from any given reference population.

At the moment there is no publicly available script to convert a Microhaplot output into Rubias format. I emailed Eric Anderson a while back and asked him if he had one. “It just takes a couple lines of R code,” was all he said.

Well, I don’t know about Eric, but for me coming up with an R function that would move my haplotypes into Rubias format took the better part of an afternoon, and evening. My R function is more than just a few lines, but it does work. And soon it will be yours too.

First, we’ll need some data. Specifically, you’ll need to download comma separated (.csv) text files from Microhaplot. These should be the “filtered” output—one for each population of reference fish, and one for each mixed fishery.

The data should look like this:

Screen Shot 2019-03-18 at 9.55.06 PM
Download your data from Microhaplot as a .csv file. Make sure that it is the “filtered” output.

In reality, you could download one large .csv file with all your baseline and mixed stock samples. But that would be a huge file, and I often find it annoying to deal with big files, especially if I am just going to be subsetting them anyways. In this case, at least, I prefer to take an additive approach to data science, rather than a deconstructive one.

If you decide to download your data as a single file, you can import it into R simply like this:

> read.csv(“my_big_fat_file.csv”, header=TRUE, stringsAsFactors=F, colClasses = c(“character”))) -> big_fat_fisheries.csv

If you have many files, make a list of file names and apply read.csv() to each one.

> ref_list = list.files(pattern = “*.csv”)

> datalist_ref = lapply(ref_list, function(x) read.csv(x, header=TRUE, stringsAsFactors=F, colClasses = c(“character”)))

This gives you an R object that is a list of data frames. Note that in this case I have all the Microhaplot outputs from my reference populations in the current working directory, so that the function list.files(pattern = “*.csv”) returns a list of all files with that suffix.

If you want to do what I did, put all the mixed fishery sample outputs in a separate directory, and then navigate to that directory inside R, using: > setwd(“../fishery_files”), and make a second datalist_mix file.

These files should look like this:

> datalist[[1]][1:3,]
X group indiv.ID locus haplo depth allele.balance rank …
1 ALR 384-3-12O 87515516 TC 1751 1 1 …
2 ALR 384-3-13B 87515516 TC 1739 1 1 …
3 ALR 384-3-12G 87515516 TC 1609 1 1 …

Just so you know, individuals that are heterozygous appear twice, once for each allele, while homozygotes only get one entry. Therefore, to convert this data into a genotypes file you need to find a way to account for some individuals having two entries, while others only have one.

Now I will unveil the R functions:

> make_rubias_ref <- function(df) {
df2 <- arrange(df[,1:8], group, indiv.ID, locus)  # arrange sample ID’s and loci in order
ids <- df2$indiv.ID[!duplicated(df2$indiv.ID)] # a list of your sample ID’s
rubias=NULL                                                           # an object to hold the formatted data
for (i in 1:length(ids)) { id <- df2[which(df2$indiv.ID==ids[i]),]    # loop through fish
dups <- duplicated(id$locus)      # identify heterozygous loci (duplicated locus names)
dups[which(dups==TRUE)-1] <- TRUE   # tag the first allele as TRUE as well as the dup
n.times <- as.numeric(dups)              # count number of duplicates per locus
n.times[which(n.times==0)] <- 2 # loci w/ 0 dups are homozygous, still need 2 alleles
df3 <- id[rep(seq_len(nrow(id)), n.times),]  # dataframe w/ right number of entries
df3$locus <- make.unique(as.character(df3$locus)) # make unique 2nd allele name
gt <- spread(df3[,3:5], locus, haplo) # put everything in the right shape
rubias <- bind_rows(rubias, gt) # now bind it to the rubias object
sample_type <- rep(“reference”, length(ids)) # the rest of these lines are just pasting
repunit <- as.character(rep(df2$group[1], length(ids)))                         # attributes into
collection <- as.character(rep(df2$group[1], length(ids)))                     # the final object
rubias <- cbind(sample_type, repunit, collection, rubias)
colnames(rubias)[4] <- “indiv”

> make_rubias_mix <- function(df) {    # same as above but for the mixture samples
df2 <- arrange(df[,1:8], group, indiv.ID, locus)
ids <- df2$indiv.ID[!duplicated(df2$indiv.ID)]
for (i in 1:length(ids)) { id <- df2[which(df2$indiv.ID==ids[i]),]
dups <- duplicated(id$locus)
dups[which(dups==TRUE)-1] <- TRUE
n.times <- as.numeric(dups)
n.times[which(n.times==0)] <- 2
df3 <- id[rep(seq_len(nrow(id)), n.times),]
df3$locus <- make.unique(as.character(df3$locus))
gt <- spread(df3[,3:5], locus, haplo)
rubias <- bind_rows(rubias, gt)
sample_type <- rep(“mixture”, length(ids))
repunit <- as.character(rep(NA, length(ids)))
collection <- as.character(rep(df2$group[1], length(ids)))
rubias <- cbind(sample_type, repunit, collection, rubias)
colnames(rubias)[4] <- “indiv”

Now you have everything you need to make the conversion. Make a datalist object for your reference populations, and another datalist for your mixed fishery samples, and run the respective R functions on them, like so:

> convlist_ref = lapply(datalist_ref, function(x) make_rubias_ref(x))

> convlist_mix = lapply(datalist_mix, function(x) make_rubias_mix(x))

After you’ve done this there is one last step. The convlist object contains a list of Rubias formatted data, and these need to be merged.

> fish_ref <- bind_rows(convlist_ref[1:length(convlist_ref)])

> fish_mix <- bind_rows(convlist_mix[1:length(convlist_mix)])

There will be a few warnings when executing this, but these warnings can be safely ignored.

Your Rubias-formatted data now looks like this:

> fish_ref[1:3, 1:6]
sample_type repunit collection indiv 86908655 86908655.1 …
1 reference ALR ALR 384-3-10N GT GT …
2 reference ALR ALR 384-3-10P GT GT …
3 reference ALR ALR 384-3-11A GT GT …

Notice that the genotypes for each allele are just “GT”. Rubias takes any sequence of alpha-numeric characters, as uses that to represent allelic variants.

Also notice that the second allele of that first locus has a unique name, with a ‘.1’ appended on the end. That is the result of the, make.unique(as.character(df3$locus)), line of code. The unique allele names are required in Rubias format.

So, how many haplotype variants per locus do you have? Here is an R script that will plot a histogram for you.

> length(names(fish_ref))
[1] 616

> the=NULL
> odds=NULL
> new_the <- for(i in 1:615) {                                      ### we use 615 instead of 616
hello <- c(fish_ref[,i], fish_ref[,i+1])                          ### or else i+1 will be out of bounds
the <- c(the, length(levels(as.factor(hello))))
odds <- the[c(TRUE,FALSE)]                                       ### only odd numbers are real loci.

> odds <- odds[3:308] ### we have 306 loci, the first two values are from the label columns

> max(odds)
[1] 14                 ### Our most polymorphic locus has 14 haplotype variants, not bad!
> hist(odds, breaks=seq(1,14,1), labels=TRUE, xlim = c(1,14), xlab=”number of haplotypes per locus”)

Screen Shot 2019-03-19 at 8.50.19 AM
About two thirds of our loci are no more variable than biallelic SNPs, but one of our loci now has 14 variants.

Now that I see it… I’d better double check that locus with 14 haplotype variants, to make sure the polymorphism in this amplicon is real and not noise. And there you have another reason for haplotyping your SNPs—it helps you detect problem loci.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s