The Doyle Lab ESPSearch    



Research

Donald Doyle

Group Members

Recent Publications

Lab News

ESPSearch

If you publish results obtained using ESPSearch, please cite:
Watt, T. J. and Doyle, D. F. ESPSearch: A Program for Finding Exact Sequences and Patterns in DNA, RNA, or Protein. Biotechniques 2005, 38, 109-115.


NEW: ESPSearch Graphical User Interface (ESPSearchGUI)
Current Version: 1.01 (30-May-2005) (changes)
ESPSearchGUI is a graphical interface for ESPSearch to load, save, and change settings. It does NOT add any new functionality to ESPSearch, merely provides a more convenient method for interacting with it. To use it, run espsearchgui.py instead of espsearch.py.


ESPSearch Current Version: 1.01 (24-Apr-2005) (changes)
NOTE: Version 1.01 adds a non-critical option to the espsearch.ini file, so you should download a copy of espsearch.ini in addition to espsearch.py if updating.

What is ESPSearch?

 
ESPSearch, which stands for Exact Sequence and Pattern Search, searches a DNA, RNA, or protein sequence for specific target sequences, such as the 3 base triplets recognized by specific zinc fingers. An unlimited number of target sequences can be searched for simultaneously. Pattern matching can also be performed, to look for patterns of hits, such as identifying direct repeats or other long stretches that are recognized by zinc fingers.

ESPSearch is written in Python (www.python.org). Python is a scripting language, so the program is source code rather than a compiled program; this means you are free to edit the code if you desire (subject to the terms of the GPL). Because Python can run on essentially any platform, ESPSearch should work on essentially any computer.

What ESPSearch is Designed to Do

 
ESPSearch is designed to identify essentially any target sequence within any source sequence to find the exact desired sequence(s). Moreover, it does so using a simple interface that is fast and easy to configure, works with nearly any operating system, and can be modified if the need arises for additional functionality. Specific abilities include:
  • Search arbitrary source sequences for many, possibly complex, target sequences simultaneously. ESPSearch can search for 1 or 1000 target sequences simultaneously with little or no difference in speed, and is unique in its ability to do so with arbitrarily complex sequences.
  • Identify complex target sequences that may be difficult or impossible to identify with other tools (e.g., BLAST) due to length restrictions, wildcard constraints, variable regions, or specific mismatch levels. ESPSearch has no length restrictions, and a great deal of complexity may be specified for target sequences.
  • Analyze target hits for specific patterns. ESPSearch is unique in its ability to analyze the relationship of target hits according to user-specified patterns constructed from target sequence hits. Patterns can be simple or complex as needed.
  • Provide detailed output, including hit locations, mismatches, and size of variable regions. ESPSearch can provide an annotated sequence indicating where, and in which frame, all target sequences are found, or generate a separate file for each target sequence listing hits.
  • Allow customization of search parameters at all levels. For example, it is possible to search for non-standard DNA bases, arbitrary groups (e.g., hydrophobic amino acids), or use non-standard complements.

How ESPSearch is Different

 
ESPSearch is a general tool for identifying any target within any source sequence with complete control over all aspects of the search, but may not be the best choice for specific applications where specialized tools exist for the search.
  • ESPSearch is not a general alignment tool such as BLAST. BLAST is generally faster and more efficient for aligning most sequences than ESPSearch. However, ESPSearch will locate sequences that BLAST cannot find, such as very short sequences and sequences containing many wildcards or gaps.
  • ESPSearch can, like many other tools in existence (such as the TRANSFAC tools), identify a specific class of binding site (e.g., transcription factors). However, ESPSearch is not limited to a particular data set or scanning a particular DNA sequence. On the other hand, these other specialized tools may be faster for certain specific applications that take advantage of their specialization.
  • ESPSearch differs from most software that search "patterns" (such as TRANSFAC's "Patch" program) in that the "patterns" searched by ESPSearch are combinations of specified target sequences, determined entirely by you. Most other programs are highly restrictive in their "patterns." Moreover, complex pattern searches in ESPSearch are very straightforward through the use of specific sequences in a database and one or more patterns that arrange them as necessary.

Downloading ESPSearch

 
Zip archive of all the following files: ESPSearch.zip

You can download the following files by right-clicking on the file name and choosing "Save As..." or "Save Target As...".
Program Files
ESPSearch.py - program file.
ESPSearch.html - manual & license.
ESPSearchGUI.py - program file for graphical user interface.
ESPSearchGUI.html - manual & license for graphical user interface.
ESPSearch.ini - the configuration file.
ESPSearch_Sample.ini - a configuration file already set to use the DNA rules and Human Zinc Fingers database (below). To use it as a demonstration of ESPSearch:
  1. Rename the file to "espsearch.ini."
  2. Download any DNA sequence (plain, FASTA, EMBL, or GenBank formatted, perhaps a few hundred to a few thousand basepairs).
  3. Name the sequence "source.txt" and place it in the same directory as espsearch.py, espsearch.ini, dna.txt, and humanzif.txt.
  4. Run espsearch.py.

Databases
HumanZif.txt - human zinc fingers database. Use a DNA rules file.
BAXsiRNA.txt - possible siRNA sequences for BAX. Use a DNA rules file.
LXXLL.txt - LXXLL motifs for nuclear receptor-coactivator interactions. Use Protein_Ex rules file.


Rule Files
DNA.txt - rules for searching standard DNA sequences.
DNA2.txt - rules for searching standard DNA sequences when the source sequence contains IUPAC wildcards.
RNA.txt - rules for searching standard RNA sequences (expects and outputs U instead of T).
RNA2.txt - rules for searching standard RNA sequences when the source sequence contains IUPAC wildcards.
Protein.txt - rules for searching standard amino acid sequences.
Protein2.txt - rules for searching standard amino acid sequences when the source sequence contains IUPAC wildcards.
Protein_Ex.txt - same as Protein.txt rules, but adds J = I, L, or V and U = S or T.