It is easy to flash overlapping DNA sequences together using Biopython

A very common task in Bioinformatics is to take two overlapping DNA sequences and “flash” them into a single sequence. You know, kinda like:

Before:

AGTCGGTACAAATTACACAGCATATGCTTTACTGTTAACCACA

                                 GTTAACCACATGCGTAGCCACTTGCAATGTCTC

After:

AGTCGGTACAAATTACACAGCATATGCTTTACTGTTAACCACATGCGTAGCCACTTGCAATGTCTC

But what program do you use to do it? There are a number of software packages that can do this kind of thing for you, but if you just want to flash some DNA together as a one-off, or include a flash step into a pipeline, then you should write a short Biopython script.

Biopython is a collection of bioinformatics tools written in python and combined into a package. From a unix terminal it is super easy to install, just type:

pip install biopython

I use biopython a lot. It is my go-to package for parsing any fasta/fastq file. From the python terminal simply type

>>> from Bio import SeqIO

It makes it super easy to read the records of a fasta file one at a time, something that isn’t so straightforward otherwise, because each record takes up two lines

>sequence_1a
AGTCGGTACAAATTACACAGCATATGCTTTACTGTTAACCACA

To read a fasta file you can type:

>>> for records in SeqIO.parse("An_awesome_fasta_file.fasta", "fasta"):
...    for record in records:
          etc. etc.

But to flash two overlapping sequences together we’ll have to import some other modules from the biopython library as well.

>>> from Bio import pairwise2   
### a pairwise sequence alignment tool

>>> from Bio.pairwise2 import format_alignment   
### for nice and neat printing

So, lets say you have two fasta files of the same length, where the sequence in one file corresponds to the counterpart sequence in the other file, and the two overlap.

Let’s take a look at what this data will look like in biopython by reading all fasta records into two lists, one for each file.

>>> file1=[]

>>> file2=[]

>>> for record1 in SeqIO.parse("Awesome_fasta_file1.fasta", "fasta"):
...    file1.append(record1)

>>> for record2 in SeqIO.parse("Awesome_fasta_file2.fasta", "fasta"):
...    file2.append(record2)

Each item in the list is a custom data class just for fasta sequences.

>>> file1
[SeqRecord(seq=Seq('AGTCGGTACAAATTACACAGCATATGCTTTACTGTTAACCACA', SingleLetterAlphabet()), id='sequence_1a', name='sequence_1a', description='sequence_1a', dbxrefs=[])]

This data class is itself a list that includes the sequence, the id, name, description, and other database reference numbers. In this case the last is blank, and the three previous are all identical, because I’ve given the sequence a simple name (sequence_1a). But fasta names can get complex, and if you download fasta sequences from NCBI or other data bases these fields will be unique.

And just so you know, for this example I’m only using fasta files with one sequence each.

>>> len(file1)
1
>>> len(file2)
1

You can access any element of the record just by adding the element’s name to the end of your object, with a dot. Like so:

>>> file1[0].seq
Seq('AGTCGGTACAAATTACACAGCATATGCTTTACTGTTAACCACA', SingleLetterAlphabet())

Too ugly for you? Me too. Let’s turn that into a string object

>>> str(file1[0].seq)
'AGTCGGTACAAATTACACAGCATATGCTTTACTGTTAACCACA'

But I digress.

Back to flashing.

Now we want to align these two sequences together. We can do this with biopython’s pairwise aligner.

>>> a_good_alignment = pairwise2.align.localms(file1[0].seq,file2[0].seq, 2, -1, -3, -3)

Let me break this down for you. First, it is important to note that this is a local alignment:

pairwise2.align.localms

That means that the two sequences are not compared end-to-end, and can align anywhere along the sequence… like, in the middle. Pairwise2 also does global alignments, but that will seriously mess up what we are trying to do here.

local alignement:

>>> print(format_alignment(*a_good_alignment[0]))
AGTCGGTACAAATTACACAGCATATGCTTTACTGTTAACCACA-----------------------
                                 ||||||||||
---------------------------------GTTAACCACATGCGTAGCCACTTGCAATGTCTC

global alignment:

>>> print(format_alignment(*bad_alignment[0]))
AGTCGGTA-C-A-AAT---TA-CACAGCATATGCTTTACTGTTAACCACA
||||||||||||||||||||||||||||||||||||||||||||||||||
-GT---TAACCACA-TGCGTAGC-CA-C-T-TGC-A-A-TG-T--CT-C-

Thus, local = good, and global = bad. Let’s move on.

You can get multiple alignments for any two given sequences. In fact pairwise2 is more than capable of finding hundreds of possible alignments.

>>> len(bad_alignment)
572                    ### Yikes, that's a lot of alignments!

So, you want to give it very strict parameters that penalize opening and extending gaps in the alignment. If we look at the function again

pairwise2.align.localms(file1[0].seq, file2[0].seq, 2, -1, -3, -3)

The first two variables are the two overlapping DNA sequences.

The second set of variables give 2 points (points are good) for every matching character and take away one point for every mismatching character.

And the third set of variables are the penalties for opening a gap and extending a gap, respectively.

For DNA sequence alignments, one must use the stick instead of the carrot, and so you need to set the penalties quite high.

>>> len(a_good_alignment)
1                     ### Ah... much better.

But an alignment is not a flashed consensus sequence. And so, we’ll use a bespoke function to convert the alignment into a single contiguous sequence.

>>> def make_me_a_consensus(algn):
...    consensus=[]
...    for s1, s2 in zip(algn[0][0],algn[0][1]):
...       if s1 == s2:
...          consensus.append(s1)
...       elif s1 == '-':
...          consensus.append(s2)
...       elif s2 == '-':
...          consensus.append(s1)
...    return ''.join(consensus)

Let’s test.

>>> make_me_a_consensus(a_good_alignment)
'AGTCGGTACAAATTACACAGCATATGCTTTACTGTTAACCACATGCGTAGCCACTTGCAATGTCTC'

Pefect. Now, one last thing before I end:

Since you’ll almost certainly want to do this for a whole bunch of sequences, and not just a single pair, lets put it into a loop and write the consensus sequences to a new file.

>>> my_iter = iter(range(len(file1))) ### an iterable to loop through
>>> with open("consensus_loci.fasta", "w") as file: ### new file
...    for i in my_iter:
...       algn=pairwise2.align.local(file1[i].seq,file2[i].seq, 2, -1, -3, -3)
...       con = make_me_a_consensus(algn)
...       file.write(">%s-%s\n%s\n" %(file1[i].id, file2[i].id, con))

And there you go. The whole thing boils down into a relatively small block of code so it is easy to implement on the fly, or fit into larger scripts.

 

 

 

 


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s