Download and save the mystery sequence linked here.
Search a Genome Database
Recall (and perhaps look again) at my lecture on databases. In a moment, we are going to search a genome-scale database using the mystery sequence. The take home point here is that these types of database integrate and display large quantities of information, derived manually and using algorithms, that is used to annotate sequences at the level of the genome.
My lecture introduced the Ensembl Genome Browser, but let's look first at one of its rivals, the UCSC (University of California at Santa Cruz) Genome Browser, the European mirror site for which can be found at this page.
From the UCSC Genome Browser homepage launch a BLAT search. Read the BLAT landing page to learn a little bit about what this algorithm does and how it differs from BLAST. When you're reading paste in your nucleotide sequence and search the human genome (hg38). (If you are wondering what the hg38 means, scroll down this page and read about genome releases).
Browse the top scoring alignment (if you weren't feeling lucky). Can you identify the gene?
I also strongly recommend keeping this tab open.
(Optional) Try the Ensembl Genome Browser
Try the same BLAT search using the the Ensembl Genome Browser. Do you get the same answer?
Exploring Genome "Tracks"
Go back to UCSC page. Let's take a closer look at the graphics. Each row in the figure is a track showing some information about the sequence (from gene locations to comparisons with other sequences). See if you can find the following tracks and expand them to the "full" view:
- GENCODE (gene annotation from the ENCODE project)
- RefSeq genes (recall from the previous workshops)
- Conservation (derived from multiple alignment)
- OMIM (Online Mendelian Inheritance in Man)
Hint: you can use the options below the graphic + refresh buttons or right click on the tabs to the left.
Let's look in detail at the last two tracks indicated above, which may be less familiar, and see what kind of data they are displaying. You may also find this site helpful.
SNPs: Single-Nucleotide Polymorphisms
If you don't know what a SNP is, look at the Wikipedia page and remind yourself of the different functional categories of point mutation. The reference database for SNPs is called dbSNP and can be found via this NCBI portal. This can give you information about the frequencies of a SNP in different populations. Although not as comprehensive right now, there is also a site called SNPedia for crowd-sourcing information about SNPs.
Let's see if we can identify a functionally significant SNP in our gene. First, make sure you have the "full" view of SNPs displayed. Next scroll down to the Variation section below the graphical display and click on the "Common SNPs" link. Learn about the "Coloring Options" (yes, it's a perfect mirror of an American website).
Armed with this information let's pick out an interesting SNP. Go back to the main display and select the only red SNP that you can see (rs61740803). Take a look at the information on this page and also click through to dbSNP. What kind of SNP is this? Write down what effect this SNP has on the sequence of the (mystery) gene's protein product.
Repeats and RepeatMasker
As described at its homepage, RepeatMasker is a program that identifies repeat sequences (commonly derived from transposons) using a database called Repbase (or Dfam, a Repbase-derived database enhanced using a Hidden Markov Model). It can be used to "mask" these sequences in order to exclude them from certain kinds of analysis, but it can also be used to identify their positions and identities.
Although RepeatMasker makes use of an algorithm called Tandem Repeats Finder to identify simple repeats, it is dependent on the database. (This is one good reason why you should record the version of programs you use in analysis or at least the access data, if using online tools). Alternative methods exist that rely only on search algorithms, e.g., PClouds which decomposes the genome into short oligonucleotides, groups them (into "clouds") and looks for those that are probably over-represented. This yields an estimate of >2/3 of the human genome as consisting of repetitive elements. Methods that rely solely on algorithms, and not on prior information in databases, are called de novo methods.
Look again at the main UCSC display for the region identified by your BLAT search. Making sure that the RepeatMasker track is not squished (it is in "full" mode), identify the only repeat element that is not a simple repeat. Answer these questions:
- What is it? Is it a microsatellite? Or is it a transposon?
- Give a little more detail about this family of elements.
You may find these references helpful if you want to know more about:
Repeats and conservation data can be intersected. Repeats shared by a group of lineages are called ancestral repeats; those limited to a particular taxon are called lineage-specific repeats. A famous example of the latter are the Alu elements shared by primates.
Further Exploration of the Mystery Gene
OK, so you know what the gene is. Look it up on NCBI's Gene db and choose the human version of the gene. Navigating down to the graphic, hover over the gene until a context menu appears. This gives you the option of downloading the protein sequence (beginning NP) or the RNA sequence (beginning NM), but instead look down and select "FASTA View" which gives you the genomic interval including the gene. In case you want to check, this is what you should see.
Look up to where it says "Send" with the down arrow. Click on this, select "Complete Record", "File" and then choose FASTA as the format. Hit the "Create File" button and your sequence will be downloaded. Rename it if you wish (I called mine "interval.fa"). It should open in your preferred text editor and you will see a standard FASTA file with the header line indicating the chromosome coordinates. From now on, I will refer to this as your interval sequence file.
Go to the RepeatMasker homepage and choose RepeatMasking under Services. Input your interval sequence file and run with default parameters. What kinds of repeat can you see? Click through and save the Annotation file.
Yes, you are not wrong, you cannot find that repeat. What's going on? Do you recall above I mentioned about versions of programs? Look at the version of RMLib used by the web-based tool (on the submission page). Compare that to the statement here. Best practice in bioinformatic work is to download a local copy of RepeatMasker and the relevant database and use offline. Consider yourself warned!
Gene Structure Prediction
Let's see if we can have more luck predicting gene structure using your interval sequence file. Obviously we could just refer to the annotation tracks on the UCSC or Ensembl Genome Browsers to see the current best information on the genes present in this interval, but what if we had completely new sequence? Then we need to use a ab initio gene prediction method (see this page for a description of types of gene prediction).
Again some of the good software for gene prediction is best run stand-alone on your own (unix-based) box (see the list of software here. Let's have a go with GENSCAN. Go the GENSCAN website and run it with your interval file and default parameters.
You can see it has made some predictions. Scroll down and you will see it has predicted a polypeptide sequence. Copy this into a text file (yes, it's finicky -> old site) and create a correctly formatted protein FASTA file. Give it a sensible name (I called mine "predicted.fa").
Go to NCBI BLAST and run a BLASTP search with the predicted sequence. Some questions for you:
- What did the BLASTP search recover?
- Do you think GENSCAN made an accurate prediction?
- Can you explain any deviations?
For completeness, you might also be interested in searching for domains in your protein of interest. Go to the Pfam homepage and carry out a sequence search using your predicted protein sequence. What do you see?
Coda: Other Resources
OK, that's it for these workshops. Of course we've only scratched the surface.
There are so many more programs and databases out there. For example, in alignments you can use this program to align mRNA to genomic DNA (a challenging thing to do) and many new methods exist for handling next-generation sequencing data.
Many excellent stand-alone programs exist for phylogeny reconstruction (e.g., BEAST, MrBayes, HyPhy and MEGA) and again such approaches are recommended for obtaining the best results. However, as I hope you can see, there is a lot you can do with widely available, and free, web tools provided by the scientific community.