Java Degenerate Search Algorithm

JDSA (Java Degenerate Search Algorithm)
By Thomas Boulay, Kadonaga Lab, UCSD
Current release: 0.1 (February 2004) - All Rights Reserved
[ intro - downloads - theory - instructions - source parameters - notes - bugs - future additions]

Notice: This project is in sporadic development. If you need the ability to search for promoter elements or degenerate sequences and the JDSA is problematic for you, I highly recommend Mark Rebeiz's Gene Palette Software (Posakony Lab, UCSD). It is also written in Java (and can run on Macs or PCs), has a very impressive graphical interface, it can be set to search for degenerate sequences and is in regular development. --TB

Disclaimer: No permanent problems have occurred to any test computer while using the JDSA program. However, you assume all risk when you download and use this program. Source code is available here. This program is free-ware and available to all. We only ask that if you use the program or modify the source code you please cite CY Lim et al (2004).

JDSA is a Java program that will degenerately search DNA sequences. This program searches sequences for multiple nucleotides at a given position (positional degeneracy), limited overall sequence accuracy (group degeneracy) or variable spacing between multiple DNA sequences (spacing degeneracy). This program was designed to search the S. pombe genome or the D. melanogaster genome, but custom searches can be performed provided the input is in the correct format. This program has been written in Java 2, SDK 1.4.2_03, contains a graphical interface (Swing v 1.1) and has been tested on both a Win98/XP PC, Mac OS 9.1. and Mac OS X (10.1.4)

Necessary downloads:
For any java application, you need to have the proper virtual machine installed before an application will run. These are platform (operating system) specific, and can be obtained by following the links below.

Win 9x/2000/XP:
1. You will need to download and install the Java Runtime Environment (JRE). It is available here.
2. Download the program file (JDSAv01.jar) and run it (see instructions section below.)
3. You may also need the Java Foundation Classes (JFC)/Swing Package. That is available here.

Mac OS 9.x:
1. You will need to download and install the Macintosh Runtime for Java (MRJ) 2.2.5. It is available here.
2. You will need to download a file called swingall.jar. This file will need to be placed in the folder: System Files: Extensions: MRJ Libraries: MRJClasses
3. Download the program file and run it (see instructions section below.)

Mac OS X:
1. Download the file: JDSAv01.jar. Double-click the icon and go.

Additional Files: If you want to search an entire genome and the files are not stored locally, you must create a file containing all of the GI numbers for every genome sequence file and point the filechooser to that file. Here are the Drosophila melanogaster Release v3.2 and S. pombe genome GI lists. These may not be the most up to date annonations of the sequencing results.

Principal:
This project was originally begun to search the Drosophila genome for promoter regions that contain the Downstream Promoter Element (or DPE) or Motif Ten Element (MTE). The project design was to create a search algorithm that contained the ability to search for DNA sequences using the following three kinds of degeneracy variables:

1. Positional degeneracy: If you wanted to search for a DNA element that contained degeneracy at a given position - the initiator (Inr) region of Drosophila, for example. The Initiator of Drosophila is the consensus sequence from which transcription starts (the +1 of an RNA transcript.) This sequence is T-C-A-(G or T)-T-(T or C). Searching each permutation (by BLAST search for example) is very inefficient. Instead, using the standard IUPAC designation for degenerate nucleotides, the JDSA algorithm will return all permutations of the Inr sequence. For example, the above Inr sequence can be written: TCAKTY. The Downstream Promoter Element (Burke & Kadonaga(1996); Kutach & Kadonaga(2000)) would be written as: RGWYG. For more information regarding the DPE and promoters, click here.

The IUPAC degeneracy codes are below:

IUPAC Nucleotide(s) Complement Nucleotide(s)

A A T

C C G

G G C

T T A

M A OR C K

R A OR G Y

W A OR T W

S C OR G S

Y C OR T R

K G OR T M

V A OR C OR G B

H A OR C OR T D

D A OR G OR T H

B C OR G OR T V

N A OR C OR T OR G N

2. Group degeneracy: When searching for a DNA binding element, a perfect match may not be necessary because the entirety of the element may not be necessary for binding the corresponding protein. That is to say, if an element is 6 nucleotides long, sometimes a given protein that normally binds to this region will bind if there is only a 5/6 match to the consensus sequence. Binding to a perfect match versus an imperfect match could be an integral part of regulating activity. Again using the Inr region, it is possible that transcriptional machinery will bind to an Inr region if only 5/6 nucleotides are a match. In this case, we would say that the maximum allowable mismatch is 1. If you ask the JDSA program to search for a given sequence with a maximum mismatch of 1, the mismatch has the potential to appear anywhere within the sequence.

3. Spacing degeneracy: This type of degeneracy has to do with the spacing between two DNA elements. Returning to the promoter example, if you wanted to look for a TATA box that was close to an initiator, you could do so with the JDSA program. A TATA box is the most biologically pertinent when it appears further away than 30 nucleotides upstream of the initiator, and no closer than 10. To have the JDSA program search for these results, you would enter the following:
- New Search, 2 DNA elements.
- For element #1: TATAAA (the TATA box consensus) with any desired maximum allowable mismatch (see group degeneracy)
- For element #2: TCAKTY (the Drosophila Initiator sequence) with the given maximum allowable mismatch.
- For element #2: Maximum distance away from element 1 would be 30
- For element #2: Minimum distance would be 10.

Please note: the number signifies the nucleotides in between the given elements. For example, if you wanted to search element A and element B, and the end of A could be 0 nucleotides away from the start of element B, then the minimum distance would be 0.
Likewise, if the distance between two DNA elements is fixed, simply enter the same number for both the maximum and minimum distance.

Instructions:
Starting the program: This can be done from a DOS prompt on a Win 9x/XP machine from the directory that JDSAv01.jar is in by typing
java -jar JDSAv01.jar or from a Mac by clicking on the JDSA icon.

Searching for a sequence: To search for a given sequence or set of sequences, click "New..." from the MenuBar, and then "New JDSA Search..." from the pull-down menu.

A pop-up menu will appear and ask you "How many fragments?" From the pull-down menu, you should select the number of separate DNA elements that you wish to search. In the promoter example, if you just wanted to search for the initiator, you would enter 1. If you wanted to search for a TATA box and an initiator, you would enter 2, and so on. Then click "OK". Clicking "Cancel" will abort the search.

A new screen titled "JDSA input" will appear, and its appearance will depend upon how many fragments you said that you needed to search. In the separate top subpanels, you can enter the needed information: what is the sequence of a given fragment, how many mismatches will you allow, how far is it from the previous fragment, and so on).

Please note, when you start, you should not be able to click the OK button (bottom right). The OK button will only become enabled when you have entered enough VALID information to proceed. If you enter a character that the program does not recognize (a non-IUPAC character or a letter where a number is expected) the OK button will be disabled and remain disabled until the problem is corrected.

Parsing and Filtering: There are several options available to cut down on unwanted reported results, as well as attempting to maximize the information returned so that the results have more meaning. These options are in the lower right-corner, just above the Cancel/Proceed buttons.

Parsing: This is an attempt to place the results of a genomic search in its genomic context. If the Parse the results? checkbox is checked, the results will look like this:

1. ggatggattgatttgcctattgcatttata [C]SPAC7D4 {5646}:In ORF: SPAC7D4.12c; Start of exon <-- {1373 bp} SEQUENCE FOUND complement strand {906bp} --> end of exon
6. tataaactgcatatttatactccttttccaatt SPAC7D4 {13918}:Extragenic: Previous gene:SPAC7D4.15c [C]<-- {649 bp} SEQUENCE FOUND {355bp} --> SPAC7D4.08 [C]

In both of these results, you can see how the formatting is returned. The result lists the sequence, the strand (complementary strand results are designated with a [C]), the file name (in this case the pombe file name SPAC7D4), the nucleotide number (5646), and where those sequences are in the genome, either In an ORF, Extragenic, Intronic or some basic combinations of these possibilities. It also lists how close the resultant sequence is from those surrounding elements. In the extragenic example, the sequence is 649bp from gene SPAC7D4.15c, and 355bp from gene SPAC7D4.08.

Parse Filter: by default is set to No Parse Filter but can be changed to extragenic only, in ORF only, in intron only, or extragenic/in intron.

Strand Filter: by default is set to No Strand Filter but can be changed to Forward Strand Only (especially useful if you have a custom search to perform, see below) or Complement Strand Only.

Starting the search: After you've entered all of the information, click Proceed.

That window will disappear and another small window will appear saying "Click here to START". Click when you're ready to begin. The program will begin its search of your query and return your results to you when it is done.

Input Source Parameters/Selecting a Source File

Three types of files are allowed as valid inputs.

FASTA formatted files

Downloaded .HTML files (regardless of name) from NCBI

Files that contain a list of GI numbers to search at NCBI. This list must be in plain text format and organized as follows:

23095176 "AE003474"
23092840 "AE003475"
2894275 "Pombe cosmid c6B1"
6689257

There must be only one entry per line. The GI number must precede any description you wish to provide. The descriptions are optional. If you include a description, it must be in quotations. This way, when JDSA returns the results of the search, the results can be listed along with their respective title.

Notes on usage:

The speed of this program is dependent on several factors. Since it is a degenerative search, it may have to do the functional equivalent of searching the genome multiple times to find what you're asking for. And this takes time.

Also, this program pulls the genome files from NCBI as it needs to search them. Therefore, internet traffic and the load on NCBI will impact performance.

In heavy traffic, a complex search of the Drosophila genome using the Internet as the source can and has taken up to an hour to run. There is a status bar, but just be forewarned.

Known Bugs:

The file(s) to be searched before entering the sequence fragments.

Sometimes FASTA files that are too large will crash. Checking on this.

Future Directions

To incorporate a batch client

Contact info: This project is in non-continual development. Any input is welcomed. If you have any ideas as to features that should be implemented, deleted or you want to report a bug or offer support, please direct all contacts here: tboulay at biomail dot ucsd dot edu.

Last updated, October 2004

IUPAC	Nucleotide(s)	Complement Nucleotide(s)
A	A	T
C	C	G
G	G	C
T	T	A
M	A OR C	K
R	A OR G	Y
W	A OR T	W
S	C OR G	S
Y	C OR T	R
K	G OR T	M
V	A OR C OR G	B
H	A OR C OR T	D
D	A OR G OR T	H
B	C OR G OR T	V
N	A OR C OR T OR G	N