Tools for extracting and displaying the DNA duplicates in a complete genome

Buffat L.1, Vincens P., Hazout S.

Centre de Bioinformatique, INSERM U444 et U155, Universite Paris 7, case 7113, 2 Place Jussieu, 75251 Paris Cedex 05, France; 1E-mail : buffat@urbb.jussieu.fr

Key-words: DNA duplicates, motif extraction, yeast sequences, genome analysis, suffix tree.

The project Genome being in full development, in relation to the rapidly increasing number of nucleic sequences in database, efficient computational tools are required for a systematic search for duplicated motifs in large sequences, particularly in a complete genome. The yeast genome has been completely sequenced; the whole of the sequences of the 16 chromosomes (whose sizes vary between 200 and 1500 kb) represents a set of 12 Megabases. We propose a program in order to find all the regions presenting a certain homology (65%) within the whole genome. We begin by extracting k-duplicates (i.e. the exact motifs present k times in a sequence set) and we regroup and extend these k-duplicates into larger similar sections.

We will briefly describe the principle of construction of the tree of the k-duplicates, and the rules for merging and extending, then we will present the graphic tools developed under the software Splus. The program ODP (for 'Occurrence Distribution Process') we developed in language C++ is based, in a first step, on the search for exact duplicated motifs present at least k times in a set of N DNA or protein sequences (the approach can be used for a sequence set using an alphabet of C characters; C = 4 for nucleic sequences and 20 for protein sequences), and in a second step, on the extension of the exact duplicated motifs into similar regions called "extended duplicates without gap" (or EDWG). The tree of k-duplicates is the subtree of the suffix tree (Weiner, 1973) composed of nodes presenting at least k leaves.

Initially the N sequences are concatenated in a unique large sequence and we create the list of the position indexes linked at the tree root. The main principle consists of distributing the list of the position indexes contained in a given node (representing a given n-word) to the "sons" of this node (i.e. the words deduced by adding a new character as suffix). The process is repeated for the nodes presenting at least k position indexes. At the end of the processing, we obtain the tree of the k-duplicates. Then, we build all the possible pairs of positions for every motif. Every pair corresponds to a putative similar region between two sections of the genome. From the pair set, it is possible to determine the EDWG. The method consists in joining the pairs of motifs when the shift between the position indexes are identical and when the section length between these motifs (i.e. the number of characters not matched) is less than a user-fixed value. After this step of motif joining, we perform an extension at the extremities of each EDWG by using a statistical approach based on the proportion of base identity.

Finally we obtain the database of similar regions which we can display thanks to different graphic tools developed under the software Splus such as, (i) a boxplot of location of duplicates along a given sequence, present in the other sequences, (ii) a dotplot of k-duplicates between two sequences with the possibility of making a zoom within a particular region, (iii) the distribution of the motif occurences along a given sequence. This database of similar regions can be used for studying the genome organization. As illustration, we shall present different macroduplications in the yeast genome.