Saturday, October 27, 2012

Computational Biology CB_J0002


Title : On the Average Sequence Complexity
Author : Svante Janson, Stefano Lonardi and Wojciech Szpankowski
Year Publish : 2004
Place of Publish: Springer Berlin / Heidelberg
Abstract :

In this paper we study the average behavior of the number of distinct substrings in a text of size n over an alphabet of cardinality k. This quantity is called the complexity index and it captures the “richness of the language” used in a sequence. For example, sequences with low complexity index contain a large number of repeated substrings and they eventually become periodic (e.g., tandem repeats in a DNA sequence). In order to identify unusually low- or high-complexity strings one needs to determine how far are the complexities of the strings under study from the average or maximum string complexity. While the maximum string complexity was studied quite extensively in the past, to the best of our knowledge there are no results concerning the average complexity. We first prove that for a sequence generated by a mixing model (which includes Markov sources) the average complexity is asymptotically equal to n 2/2 which coincides with the maximum string complexity. However, for memoryless source we establish a more precise result, namely the average string complexity is n 2/2–nlog k n+(1+(1–?)/ln k +? k (log k n)+o(1))n where??0.577 and ? k (x) is a periodic function with a small amplitude for small alphabet size.

No comments:

Post a Comment