(Back to Article Homepage)

Molecular Biology Databases on the Internet

(Biotechniques Article - September 1996)

Does the Net measures up to the hype? A lot of people think the Net is full of silly stuff (Internet Doom comes to mind), and that you can't do serious things on it. This article introduces some of the most useful molecular biology tools on the internet, the availability of genomic databases.

Most of us have heard of the GenBank or SWISS-PROT databases. Many of us use them on a daily basis for finding sequences of interest or designing PCR primers. These databases are commonly accessed from CD-ROMs, and packaged with sequence analysis software. Usually updated quarterly, these packages present a useful and necessary resource to the molecular biologist. Several drawbacks exist, however, including cost, lack of continuous updates, as well as limited and varied search engines. While the outright replacement of these functions by tools available on the World Wide Web (WWW) is unlikely for heavy users due to potential delays in recieving results, the WWW now offers services that address all the aforementioned drawbacks besides performing nearly all the tasks that these commercial packages offer.

The expanding resources on the WWW in this area include search and analysis tools for: proteins, nucleic acids, multiple sequence alignments, PCR primers, gene features and secondary structures in proteins. Due to the brevity of this column, I will summarize the tools available at some of the best starting points on the web. The means of interaction with these tools on the WWW vary greatly. Most of these sites accept data submission directly from the browser, through a “forms” interface. This means that you type (or cut and paste) the sequence data directly into electronic fill-in-the-blank forms in the browser window, and then click a submit button. If you do not have access to a forms-capable browser such as Netscape or Internet Explorer, alternate types of data submission include email, ftp, and even some gopher interfaces. Quick types of searches may be set up to display the results on a web page, whereas more lengthy types of searches may send you the results by e-mail. On-site instructions and documentation are quite good and make these sites very user friendly.

A good place to start a tour of molecular biology services on the WWW is the National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov/). The NCBI site contains several major resources. The most well known among these is probably GenBank. GenBank is the NIH genetic sequence database, an annotated compilation of all publicly accessible DNA sequences (835,000 as of June 1996). GenBank at NCBI is part of the International Nucleotide Sequence Database Collaboration, in conjunction with the DNA DataBank of Japan (DDBJ, http://www.ddbj.nig.ac.jp/) and the European Molecular Biology Laboratory (EMBL, http://www.embl-heidelberg.de/). These organizations exchange data daily and maintain nearly identical sequence databases.

Searching GenBank for sequences of interest is a remarkably simple process. By following the links from the NCBI page to the GenBank search page (http://www2.ncbi.nlm.nih.gov/genbank/query_form.html), you will be presented with a simple search utility. The form boxes on this page allow you to enter descriptive information about the sequence you are seeking. For example, by entering “HIV” on the first line, “gp120” the second line and clicking on the “Run Query” button, the first one hundred GenBank sequences for HIV gp120 will appear on the next screen. Additional features of this interface allow for field restrictions as well as use of the “or” or “but not” operators. In general, this is how most sequence retrieval pages function and it is a good example of the simplest type of sequence analysis tool available on the web.

The WWW Entrez Database (http://www3.ncbi.nlm.nih.gov/Entrez/) allows the user to access three databases: the National Library of Medicine's MEDLINE database, the NCBI protein database, and the NCBI nucleotide database (GenBank). The MEDLINE subset database is compiled from journal publications, and genetic sequences located in this database have the added benefit of being linked to the abstract of the journal article that first reported them. The Protein and Nucleotide entries in Entrez come from several databases: GenBank, EMBL, DDBJ, PIR, SWISS-PROT, PRF, and PDB. Cross-referencing attempts have been to reduce sequence duplication.

The Basic Local Alignment Search Tool (BLAST) Gateway (http://www.ncbi.nlm.nih.gov/BLAST/) is another means of accessing the databases at NCBI. BLAST searches can be performed at a basic or advanced level. The basic search permits the choice of a search program and uses default filtering parameters. Advanced searches allow the selection of various additional filters. There are five different BLAST programs available, which allow everything from a simple sequence homology search, to complex comparisons of the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. The interface is relatively simple, with the requirement that the sequence be in FASTA format (http://www.ncbi.nlm.nih.gov/BLAST/fasta.html), which is essentially the standard IUB/IUPAC amino acid and nucleic acid codes. Results are e-mailed to users in either a text format or HTML. Options for advanced searches allow the results to include histograms of scores for each search, set statistically significant thresholds, limit the number of alignments considered, specify the use of alternate scoring methods (matrices), or limit analysis to one strand among other options.

Other NCBI resources demonstrate additional uses of genetic information. The first is the OMIM: Online Mendelian Inheritance in Man. This database is a catalog of human genes and genetic disorders and was developed for the WWW by NCBI. In this database you will find pictures and references, as well as text-based information. It also contains copious links to NCBI's Entrez database of MEDLINE articles and sequence information. The NCBI Taxonomy database contains information about all the organisms present with at least one sequence in the genetic databases. This tool, like OMIM, demonstrates the power of interlaced or cross-referenced data, combining taxonomy with raw genetic information. Other databases available for searching through the NCBI site include: dbEST: Database of Expressed Sequence Tags, dbSTS: Database of Sequence Tagged Sites, and the MMDB: Molecular Modeling Database.

The European Molecular Biology Laboratory (http://www.embl-heidelberg.de/) is another great site for accessing genomic databases. The European Bioinformatics Institute (EBI, http://www.ebi.ac.uk/), an EMBL subsidiary, offers the Nucleotide Sequence Database--a comprehensive database of DNA and RNA sequences extracted from scientific literature, patent applications, or through direct submission from researchers world wide. This site offers a similar interface for access to information that is nearly identical to that found in GenBank.

EBI is also the home of the SWISS-PROT Protein Sequence Database (http://www.ebi.ac.uk/ebi_docs/swissprot_db/swisshome.html). The SWISS-PROT Protein Sequence Database contains protein sequences (52,205) produced from translations of EMBL Nucleotide Sequence Database sequences. It contains extensive notes and is cross-referenced to the EMBL nucleotide sequence database and the PROSITE pattern database (described below). The BLITZ ultra-fast protein database search (http://www.ebi.ac.uk/searches/blitz.html) found at this site allows the user to select the number of alignments, number of hits, gap and substitution penalties while entering data in a simple browser cut and paste format.

PROSITE pattern searches (http://www.ebi.ac.uk/searches/prosite.html) at EBI allow a comparison of a given protein sequence with all protein patterns stored in the PROSITE pattern database. Knowledge of the known protein patterns or motifs in your sequence can be invaluable in finding it’s biological function. The site includes much information about how to format your submission, which can be accepted by the browser forms, or by email.

Another Sequence Data Analysis page at the EBI site features FASTA homology searches (http://www.ebi.ac.uk/htbin/fasta.py?request). This powerful tool allows the user to set sensitivity, number of matched sequences, number of aligned sequences along with selecting a database to which the query is submitted.

One of my favorite web sites for protein analysis is the PredictProtein server (http://www.embl-heidelberg.de/predictprotein/predictprotein.html) at EMBL. This automated utility functions from amino acid sequences submitted by browser or email. By performing multiple sequence alignments, PredictProtein calculates secondary structural interactions, solvent availability for individual residues, and the position of transmembrane elements and helices. Three major features of this algorithm are available: (1) fold recognition by prediction-based threading using data obtained from proteins with remote homologies (sequence identity 0-25%) to the sequence of interest; (2) prediction of structures for helical transmembrane proteins; (3) an evaluation of prediction accuracy, utilizing predicted and observed secondary structures with a results breakdown per-residue or per-segment.

BEAUTY (BLAST Enhanced Alignment Utility) is an enhanced version of the NCBI's database search tool. BEAUTY, which can be found at Baylor College of Medicine (http://dot.imgen.bcm.tmc.edu:9331/seq-search/Help/beauty.html), includes additional results into BLAST searches. It can, for example, identify conserved domains, incorporate data on family membership and add the locations of any annotated domains. These BLAST additions yield significant power to the search utility, particularly when the user is interested in finding regions of weak homology.

The Genome Database (GDB, http://gdbwww.gdb.org/) at Johns Hopkins University School of Medicine, is the cyberspace version of the Human Genome Project. This database (GDB 6.0) maps introns, exons, gene families, regulatory elements and gene products. All of these data types are queryable, and on-line-users are allowed to submit additional data as updates or annotation. This is a very user friendly database, and represents an innovation in object-oriented programming.

To find even more genomic data sites, take a look at Pedro's BioMolecular Research Tools (http://www.public.iastate.edu/~pedro/research_tools.html). Pedro has amassed a huge collection of WWW links for genomic search and analysis sites, along with other molecular biology tools. The newsgroup “bionet.software.www” also lists new sites on the WWW of interest to biologists.

Another great site is sponsored by the Human Genome Center at Baylor College of Medicine. The BCM Search Launcher (http://dot.imgen.bcm.tmc.edu:9331/seq-search/seq-anal-resources.html) is an attempt to organize molecular biology related search and analysis services available on the WWW by function through a single access point for related searches. For example, a single page allows access to most protein databases and searches available world wide. This site is a great example of the power of the internet.

These unique resources and others like them are becoming increasingly important and useful for molecular biology research. They are worth spending the time to learn to use effectively, and more than justify infrastructure investments to provide adequate internet connections in a molecular biology laboratory. Plus, with enough connections the whole lab can play Internet Doom (http:// ? ). Enjoy the web!


If you have any comments regarding this page please contact:

David M. Sander, Ph.D.
(david.sander@virology.net )

Don't forget to sign the Sign our Guestbook!


Article Homepage | Home | Table of Contents | Submit a Site | Search

Tulane University | Garry Lab Contact Info | FAQ | Garry Lab Home | Graffiti Wall | Tulane Medical Center


© D. Sander 1995-2007. Established 5/95.