Lately I’ve been struggling to find a good way to visually represent annotations for genomic loci. Specifically, I have a set of outlier loci SNPs that I need to describe in a way that is both reductive and hierarchical, for publication. More importantly, it has to look awesome, because everybody who reads my paper needs to remember it.
Well, after several days of scratching my head at icicle plots and sunburst plots, I finally found something that is truly pleasing to the eye.
Lucky for me, there is this great new R package called treemap, that draws… well… treemaps. The best way to describe a treemap is just to show you one:
Every color on the plot represents a category. And within each color space are sub-categories. In the above example, most of my outlier SNP loci were associated with intron sequences. And a small portion of those intron sequences were in splice site donor regions (at the 5′ end of an intron). Likewise, the exon box is divided up into exon, 3 prime untranslated region, and 5 prime untranslated region, according to the proportion of SNPs found in those parts of the gene.
The plot effectively conveys the information I need it to, but it also looks like a piece of mid-century modern art. Looking at it makes me feel a bit… groovy.
So, how did I make it? There is a useful online tutorial found at: https://rpubs.com/brandonkopp/creating-a-treemap-in-r
But, since you’re already here I’ll show you specifically what I did.
> assoc1 <- c(rep(“Intron”, 2), rep(“Exon”, 3), “Upstream < 5 kbp”, “Downstream < 5 kbp”, “Intergenic”)
> assoc2 <- c(“Intron”, “Splice site donor”, “Exon”, “5′ UTR”, “3′ UTR”, “Upstream < 5 kbp”, “Downstream < 5 kbp”, “Intergenic”)
> assoc3 <- as.numeric(c(39,3,11,2,4,15,12,14))
What I’ve just done above is make three lists: two levels of categorical variables and one list of quantitative variables. You might notice that all the numbers add up to 100. That’s right, they are the percentages of loci found in each category… but I suppose raw numbers would work just as well.
Next I combine these lists into a data frame and plot.
> assocX <- as.data.frame(cbind(assoc1, assoc2))
> assocX <- cbind(assocX, assoc3)
> names(assocX) <- c(“association”, “sub-assoc”, “proportion”)
association sub-assoc proportion
1 Intron Intron 39
2 Intron Splice site donor 3
3 Exon Exon 11
4 Exon 5′ UTR 2
5 Exon 3′ UTR 4
6 Upstream < 5 kbp Upstream < 5 kbp 15
7 Downstream < 5 kbp Downstream < 5 kbp 12
8 Intergenic Intergenic 14
> treemap(assocX, index=c(“association”,”sub-assoc”), vSize=”proportion”, type=”index”, palette = “Set1″, title=”Gene Associations”, fontsize.title=14, fontsize.labels=14)
And that’s it. There’s nothing else to it.
Here’s another example. This time, instead of gene association, I made a treemap of putative gene function.
Graphs like these are information sugar–––people absorb them quickly and enjoy the stimulation. Bold colors and geometric patterns drive information into peoples brains as effectively as any drug. Ergo, I dare say that these treemaps are some of the most effective figures I’ve ever come across.
Expect to see them in virtually all my future publications.