Day 2 3 4 5 6 7 8 9

Questions bioinformatics III, day 1

Doing Blast searches can easily be done on the Internet at many different web sites. Try any of these, or others:


 

 

Question

Hints

 


 

1

In a somewhat simpler biosphere than ours (on the planet Krypton), the DNA only contains 2 bases, G and C, that code, in triplets, for 5 amino acids (the code is a bit redundant). The genetic code has been unraveled, and the sequence of 6 genes has been determined. The nucleotide frequencies in the non-coding sequences are equal (i.e. 50% G and 50% C in each strand).
Genetic code: CCC: A or start; CCG: C; CGC: D; CGG: D; GCC: P; GCG: P; GGC: R; GGG: stop.

>gene_1
CCC GCC CGC CGG CCC GCC GGG
 
>gene_2
CCC CGC CCG GCC GGC GGC GGG
 
>gene_3
CCC GCC CGC GGC CCG CCC GGG
 
>gene_4
CCC CGG GCC CGC GCG CGC GGG
 
>gene_5
CCC GCC GGC CCC GGC GCC GGG
 
>gene_6
CCC CGG GCC CGC GCG CGC GGG
                                                

 

 

  a

Make a table with the codon usage on Krypton.

 

 

  b

What are the expected frequencies of the "codons" in non-coding sequences? Is the codon usage in the coding sequences different from that in the non-coding sequences?

 

 

  c

As is the case in the standard genetic code of life on earth, the redundancy in this code resides (mainly) in the third codon positions. In life on earth this generally results in a bias in the third coding positions relative to the first two. Do you observe a bias here?

 

 

  d

Draw a simple HMM for the genome. In principle, the codon usage table is the main information that is used for a Hidden Markov Model for the gene-coding regions. Which probability is missing from the data?

No fancy mathematics, just a diagram with tables.

 

  e

How can you include the extra knowledge that the encoded sequences are all 6 amino acids long?

Remember the large diagram of the HMM.

 

  f

Two of the genes (4 and 6) are identical. How does this affect the HMM? Can you think of a solution to the bias in the table that this gene gives?

 

 


 

2

High throughput alternative splice detection suffers from a number of problems that can lead to over or underestimates of the amount of alternative splicing. Argue whether and how the following problems lead to an over- or an underestimate of the amount of alternative splicing, and how they can be solved.

 

 

  a

Chimeric ESTs: Some ESTs are actually combinations mRNAs from different genes that somehow got linked into one EST during the reverse transcription. How could one detect such cases?

 

 

  b

Recent gene duplications in the human genome can give rise to genes with a very high level of sequence identity (>95%) between them. How can this cause errors in the prediction of alternative splicing in methods that do not check against the genomic DNA?

 

 

  c

Lack of complete genomic coverage of ESTs and biases in the distribution of the ESTs on the genes (relatively many ESTs cover just the 3' end of the gene).

 

 

  d

Genomic contamination (the presence of unspliced introns in the ESTs). Can you think of a strategy to tackle this?

 

 


 

3

Determine the following for these cDNA sequences:

>sequence_1
ATTTTCACCC TCCGTGGGAT TTCAGGGAAT TTGAAGTAGA AAAACAGACT GCAGAAGAAA
CGGGGTACGC CATTGGAAAC CTCAAGGAAA ACTCCAGATT CCAGACCTTC CTTGGAAGAA
ACCTTTGAAA TTGAAATGAA TGAAAGTGAC ATGATGTTAG AGACATCTAT GTCAGACCAC
AGCACGTGAC TCCAGTCAGT GGTCCTGGTC CCACTGTCCC AGTGTAGGTT AGTATTCCTT
CACATCCTCT CCATGGCTTA AGAATGTCCC ACTTCCTAAC GTGACTCCAA ACTGCATCTC
TACATTTAGG AACAGAGACC CGCCTTAAGA GACTGGATCG CACACCTTTG CAACAGATGT
GTTCTGATTC TCTGAACCTA CAAAATAGTT ATACATAGTG GAATAAAGAA GGT
 
>sequence_2
AGACATTATC AGCTCTTTAA GGATTGCAGN AGAATAGGCT ACTTTATTTT CTGAAAAGGA
GGGAGTTCCT GCTACCCATC GTGGGAGGCC ACCATCAGGA CTGCGAAGAT GGTGACCCTG
CGGAAGAGGA CCCTGAAAGT GCTCACCTTC CTCGTGCTCT TCATCTTCCT CACCTCCTTC
TTCCTGAACT ACTCCCACAC CATGGTGGCC ACCAC
 
>sequence_3
CGGGGGNGNT GGGGTTGTGT GNATGCTGAT TTTGNATTGN NGTNGGTGAN GATCTGGAGG
CGCTCCTTCG ACATCCCGCC GCCCCCGATG GACGAGAAGC ACCCCTACTA CAACTCCATT
AGCANGGGAT GTCAGACCAG GCGATCATGG AGCTGAACCT GCCCACGGGG ATCCCCATTG
TGTATGAGCT GAACAAGGAG CTGAAGCCCA CCAAGCCCAT GCAGTTCCTG GGTGATGAGG
AAACGGTGCG GAAGGCCATG GAGGCTGTGG CTGCCCAGGG CAAGGCCAAG TGAGGGGTGG
GCTTGGGCAA TAAAGGCACC TCCCA
 
>sequence_4
GTTTCTTCTA TTCCCCACGT TTAAAGCGAT GGCACCTCCG TCCCAGGGTG GTGTGAGGAT
TACCCAGTGT GGGAACAGCT TTGGGGCTGG GGGAACTAGA ACCCACATGT TGGTCTAAAC
CCTGAGAAGG TGGCAGTGAG GAAGTATCCC CTCAGGTGAC TGGATCTGTG TTCCTCCTTA
ACATCATCTG ATGGAATGGC AATGAAAAGC GTGGATTGTG GAAAATACAG AAAAACATAA
AGGAAAAAAC TCCAATCCCC AGCCC
                                  

 

 

  a

Which protein do they encode?

Use BlastX.

 

  b

Is there evidence for alternative splicing here?

Blastn with human genomic DNA, try the BLAT search at genome.ucsc.edu. Click on “browser” to obtain a graphical display, and select in that display the “Human ESTs full”. Be sure to use the consensus splice junction signals (exon|GT..intron..AG|exon) to determine the splice junctions. You can “see” the alignment of your sequence relative to the genomic DNA by “clicking” on “details”.

 

  c

What is the splicing pattern, do the splice sites fit the consensus?

Click on “your” sequence in the genome browser to obtain the alignment with the genome sequences.

 

  d

What is the effect of the alternative splicing on the encoded proteins?

Realize that the alternative splicing is not always in a coding region, it can also be in a 5’ or a 3’ UTR. You can see whether something is coding or not from the “width” of the EST.

 


 

4

Gene prediction programs are particularly bad at finding short genes/exons. A typical example is the Lipoprotein LPPL in Pseudomonas aeruginosa (only 46 AA long).

 

 

  a

Find the protein(s) in the following DNA sequence.

CTGCGAGCCG TTCTCGAAGC GTACGGTGAG CACGCCGCCG GCGTTCTCAG GTCGAGGTCA
GGTCGCTGTC ATCGAAGGCG TCTTCCACCT GCTGCTGGAG CGCATCGACC AGGTCGTGGA
AACGCGCTTC ATTCATCGTA CTCATGCATT GCCTCGGCAT TCGCTAACGG GAAAAAAGGC
GGACGACTAC CTTCGTCTTG CCCTATTCCG ACAATTCTAC GCTGGTATTG CCGTAGTCCG
CGCTGTTTTG CCGCGACGCT CGCGGAAAAC GCCGGCATCC CCTCTGCCAC AGGCCATTCC
CCTGCAAGCC CCGGCACACC TGATCCGGCT CGCATAGGCA AGGCGCCGGG GGGTCGGTAT
ACTCCGGAAC AATTCACGTT TTCTACAAGG ATTCCGTCAT GAAGCGGCTG TTCCTGTCCT
TCGTCGCGCT CGCCCTCCTC GCCGGCTCCA TCGCCGCCTG CGGCCAGAAA GGCCCGCTGT
ACCTGCCGGA CGACGAAAAA GCCAAGAAAG AACACAGCAA AGACCGCTAC GGTTTCTGAG
AGAGCGCCCA TGGACACATT
                                  

Use the microbial genomic blast pages at www.ncbi.nlm.nih.gov/blast to search in the genomic DNA and in the predicted proteins.

 

  b

Determine for which species homologs of LPPL have been missed in the genome annotation. Find at least one species.

Compare the results of Blastp searches with those of tBlastn searches on the microbial blast pages at www.ncbi.nlm.nih.gov/blast. Make sure to turn the low complexity filter off. Search just in proteo-bacteria to save computation time.

 

  c

You will see that some of the hits are "borderline" (the E-value is close to insignificant, i.e. E>0.1). Can you think of a strategy to determine whether those hits are "real"?

 

 

  d

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

a

 

 

b

 

 

 

c

 

 

  

d

 

 

 

e

 

 

 

 

f

 

 

 

 

 

 

 

G

 

 

 

 

 

 

 

 

 

 

6)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

a

 

 

b

 

 

c

 

 

 

d

What is the phylogenetic distribution of the gene (in which species, genera, and higher order taxa does it occur). Where did the gene likely originate in evolution?


For the following piece of DNA:

gacgacaact tgtgctttcg gtcttctctc gcgagtcgcc cacatctagc agcaatgcct

actcccgaat cggcggcctt cctggccaag aagcccaccg tcccgcccac ctttgacggc

gtcgactaca atgatacgaa gcgcctgaag caggcccagg atgccattat ccgcgagcaa

tgggtccgag ttatgatggg tcgcctggtg cgggaggagt tgtccaagtg ctattatagg

gagggcgtga accatctgga gaagtgcgga catctgagag aacgctacct ccaactacac

tccgaaaacc gtgtccaggg ttatcttttc gagcagcaga accatttcgc gaaccagcca

aagcaatgag ttctcctcca tcattctggg agcagaccgg agggtacaag ctggagcgag

ctgatttggg gctgaaatga tctgggcagc gagagttttg gcagggcagt gcgctcgaga

accggagagc gagtcagtca tggtcgaacc gggaaggggc tccataacat cgcggaccac

cctttcagac gttcggcttc agcaaccccc agaggagtgc tcgggccctt taatgagact

tacgagcagg catgtaaata ttaacagaac tcggcaacaa cagacacacg ggggactaga

ggaataacag gactcaattc ggcctcccat tttt

 

Translate it into 6 reading frames

 

 

Is there one reading frame that contains a long ORF? (longer than the rest) , if so, which one ?

 

 

The protein encoded in this frame homologous to any known protein? If so, which one? From which species is the DNA above (The above sequence is actually from a mRNA)?

 

Is the protein that you found part of a protein complex? And if so, which complex?

 

 

Now use the automatic tool for comparing a piece of DNA in all 6 reading frames with a protein database (BlastX).  Do you find the same protein as with the previous search?

 

 

Does the protein encoded in the DNA above have any homologs in the predicted proteome of the fungus Yarrowia lipolytica?

 

 

 

 

 

 

If it does not, this result is a bit weird. The protein you have been looking with, and for, is a member of the protein complex “complex I”, or “NADH dehydrogenase”, that forms the first step of the oxidative phosphorylation. Yarrowia lipolytica does have other proteins of this complex, and is actually used as a model species to study the complex. Now we would like to find whether this protein is actually encoded in the DNA and has maybe been overlooked by the annotation software. So: find whether there is DNA in Yarrowia lipolytica that could code for this protein.

 


For the following mRNA:

 

cggcgtctgc gcagctgcca gcgcctttaa gcccgggctc gcgctctcgg accgtgcttt

cgccgcctgg gagccgtccg gcgcagcagt ttctaggtcc ccactgtccc cgccgtcccg

ccccttcgcg tcccgggaac cggctggctt ccgagccgca ctcgccgatc ctccaggcat

gccccgctac gagctggctt taatcctgaa agccatgcag cggccagaga ctgctgctac

tttgaaacgt acgatagagg ccctgatgga cagaggagca atagtgaggg acttggaaaa

cctgggtgaa cgagcgcttc cttataggat ctctgcccac agtcagcagc acaacagagg

cgggtatttc ttggtggatt tttatgcacc caccgcagct gttgaaagca tggtggagca

cttgtctcga gatatagatg tgattagagg gaatattgtc aaacaccctc tgacccagga

actaaaagaa tgtgaaggga ttgtcccagt cccactcgca gaaaaattat attccacaaa

gaagaggaag aagtgagaag attcgccaga ttttagcctt atatgtaatt ccttcacatt

tgggcagcat ggacgagaag gaagaatttg caagtttggc ctttatataa gcatgtgttg

caggtgctgt ttgatttttc taaggtattt ttagcccttg atcccctttg cttgcgagag

gtggggaact gctcactgac agcttctctg taacctgcag taccagtgga tcattcttga

ttttgttttc attagtgtca tttctttgtc attgaggact tttcccctta caacagtaac

accatttttt gaagagcaaa acttataata cctcctggga ttgtgagcta gtcattcagc

ctgtgtaacc atgtggaaat aaaaattgac gaccaatgta ttatatggac aacttttgct

ttgagtaata aacttgattg taggaatgtg aaaaaaaaaa aaaaaaaaaa aaaaaaa

 

 

Which protein does it encode? From which species ?

 

 

Is the protein part of a protein complex? If yes, which complex ?

 

 

Does the protein have a homolog in the (predicted) proteome of Tetraodon negroviridis (pufferfish)

 

 

If it does not have a homolog in the (predicted) proteome of Tetraodon negroviridis, is there a homolog potentially encoded in its DNA or one of its cDNAs? If yes, in which one?

Click on the sequence identifier and find the sequence file to check the taxonomy of the organism. Check www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html for the complete taxonomy.

 


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Tip: use the translate server at expasy (http://www.expasy.org/tools/dna.html)

 

 

 

 

 

 

A simple Blast will give you the answer.

 

 

 

Read the annotation with the protein. To get to the annotation you have to “click on” the protein entry in the Blast output

 

 

 

 

 

 

 

Just examine whether Yarrowia appears in the list of proteins you find with a Blast search. To reduce the amount of output you have to examine you can either type in the name of the species you are interested in in “Limit by Entrez Query” in the BlastP page, or you can select “Fungi” (Yarrowia is a fungus) under “Genomes” and then select Yarrowia lipolytica to search specifically against that species.

 

 

 

You can do a tBlastN search: search with the protein against the DNA (automatically translated in 6 frames) of the species you are interested in. In principle you are now doing gene prediction by homology search.

 

 

 

 

 

 

 


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

This question is very similar to question 5, you can get all the hints from there….