The Genealogical World of Phylogenetic Networks: A BLAST search doesn't do what you think it does

BLAST is a computer program that searches a database for similarity matches to a given query sequence, either DNA or amino acid. It is most commonly used to search the GenBank database for matches to any new sequence that we might happen to have, in the hope that we will find one or more homologous sequences.

To most of us BLAST is a black box, in the sense that we have little idea about the details of how it does what it does. So, maybe we should at least look at what it does, just in case we ever need to know.

About 10 years ago I was working with some EST data. For those of you not old enough to know, ESTs consist of short DNA reads from arbitrary primers. In the hope of identifying the coding gene represented by each EST, BLASTX is used to search the GenBank protein database using each translated nucleotide query (in all six possible reading frames). BLASTX produces an E-value for each matching sequence, representing the strength of the match to the query. An E-value is not a probability (as they can vary from 0 to infinity), but at p=0.050 the expected E-value happens to be E=0.051. There is no consensus for what E-value should serve as indicating a "significant" match.

I decided to find out what happens if a DNA query sequence varies in either length or GC content. I used both random sequences (which were thus not in GenBank) as well as real sequences (which were in GenBank). The short answer is that the BLASTX results vary a lot. I never published these results because I figured the first thing a referee would do is ask me to explain BLASTX's behaviour, and I did not have an explanation (and still don't).

I present the results here for what they are still worth. Obviously, the results are not restricted to EST data, but apply any time that we use BLASTX.

Experimentation

The content of GenBank is quite different today to what it was back in late 2003, and so maybe the results will vary if the work was to be repeated. For reference, the first graph shows the GC content of the GenBank protein-coding sequences at the time of my work. Also, it is possible that BLASTX is different as well — I used v. 2.2.6 with default parameters (BLOSUM62, edge correction, length correction, SEG filtering, universal genetic code, gap penalty 11+k). Maybe some intrepid soul will be inspired to find out what happens nowadays.

Random sequences

I generated sets of 1,000 replicate "ESTs" using the perl script Randseq by M. Raymer (5/27/2003). These sets varied in DNA length (100–1,000 nt) and in GC content (0–94%), but were otherwise random sequences of nucleotides. These sequences are not expected to be homologous to anything already in GenBank, and should thus form BLASTX matches only by random chance.

The results for varying the sequence length are shown in the next graph, with each point representing the mean E-value observed. The lines represent four somewhat different GC contents; and the anticipated E-value for random data (0.051) is also shown. Clearly, very few points are near the expected value. The lines all show the same shape, with a minimum E-value near 450 nt, and rising slowly with longer lengths and rising rapidly with shorter lengths.

A more detailed assessment of the results for varying the GC content is shown in the third graph. The lines represent two somewhat different sequence lengths; and the anticipated E-value for random data (0.051) is also shown. It is clear that the E-value is capable of varying by up to seven orders of magnitude in response to variation in the GC content of the sequence.

Real sequences

I used the sequences contained in the Poxvirux Orthologous Clusters database (POCs), which used to be available at: http://athena.bioc.uvic.ca/pbr/POCs/pocs.html. This has since been replaced by the Viral Orthologous Clusters database (VOCs). These virus protein sequences are expected to already be in GenBank, and they should thus form good BLASTX matches.

The POCs database could be queried by both sequence length and GC content, and it was the only such database that I could find at the time. For each combination of length (in 50-nt bands) and GC-value (in 10% bands) I gathered a minimum of 20 sequences. There were few sequences for the shortest lengths, so I chopped up the longest sequences (longer then needed) to increase the sample size. There were also few sequences at the greatest GC values, so I used sequence AE004437.1 from GenBank (a Halobacterium sp.) to increase the sample size.

The results are shown in the final graph, with each point representing the mean E-value observed. The E-values are all small, since they represent actual database matches. Clearly, variation in sequence length can lead to orders of magnitude variation in E-value, while variation in GC content has an effect only at longer sequence lengths.

Conclusions

For a program that is supposed to produce comparable results, no matter what the sequence, these BLASTX results are disquieting. After all, BLAST is one of the most cited programs ever (see Massive citations of bioinformatics in biology papers), and yet I suspect that most people do not realize that it behaves like this.

The random sequences assess the effect of false positives. That they vary so much in E-value is amazing. Clearly, BLASTX E-values are not comparable between sequences. It is interesting that GC content seems to have a bigger effect than sequence length — for any given GC content the effect of length is relatively small for sequences longer than c. 600 nt. However, variation in GC content can produce orders of magnitude of effect at any given sequence length.

The real sequences assess the effect of true positives. That they vary in E-value is also not good — the E-values all represent true database matches (and presumably exact ones). Nevertheless, the effect of variation in sequence length and GC content is repeated for these real sequences. However, variation in GC content only has a large effect for the longer sequences, and instead it is the sequence length that produces the orders of magnitude variation in E-value.

You can make of this what you will.

1 comment:

Christopher HogueJanuary 20, 2015 at 11:41 PM
You are effectively, probing the effects of Compostitionally Adjusted Scoring Matrices, which were described by Alstchul in 2005:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1343503/

and in translated sequence searching in 2006:
http://www.ncbi.nlm.nih.gov/pubmed/17156431

As you are with testing with non-natural random sequences, and varying composition, I am not surprised by what you are reporting.

BLAST here, is doing exactly what I think it is supposed to do, and I'm certain if you dig into the above noted papers you will understand why.

Wednesday, January 14, 2015

A BLAST search doesn't do what you think it does

1 comment: