compbio.data.sequence
Class SequenceUtil

java.lang.Object
  extended by compbio.data.sequence.SequenceUtil

public final class SequenceUtil
extends Object

Utility class for operations on sequences

Version:
1.0
Author:
Petr Troshin

Field Summary
static Pattern AA
          Valid Amino acids
static Pattern AMBIGUOUS_AA
          Same as AA pattern but with two additional letters - XU
static Pattern AMBIGUOUS_NUCLEOTIDE
          Ambiguous nucleotide
static Pattern DIGIT
          A digit
static Pattern NON_AA
          inversion of AA pattern
static Pattern NON_NUCLEOTIDE
          Non nucleotide
static Pattern NONWORD
          Non word
static Pattern NUCLEOTIDE
          Nucleotides a, t, g, c, u
static Pattern WHITE_SPACE
          A whitespace character: [\t\n\x0B\f\r]
 
Method Summary
static String cleanSequence(String sequence)
          Removes all whitespace chars in the sequence string
static void closeSilently(Logger log, Closeable stream)
          Closes the Closable and logs the exception if any
static String deepCleanSequence(String sequence)
          Removes all special characters and digits as well as whitespace chars from the sequence
static boolean isAmbiguosProtein(String sequence)
          Check whether the sequence confirms to amboguous protein sequence
static boolean isNonAmbNucleotideSequence(String sequence)
          Ambiguous DNA chars : AGTCRYMKSWHBVDN // differs from protein in only one (!) - B char
static boolean isNucleotideSequence(FastaSequence s)
           
static boolean isProteinSequence(String sequence)
           
static HashSet<Score> readAAConResults(InputStream results)
          Read AACon result with no alignment files.
static List<FastaSequence> readFasta(InputStream inStream)
          Reads fasta sequences from inStream into the list of FastaSequence objects
static List<AnnotatedSequence> readJRonn(File result)
           
static List<AnnotatedSequence> readJRonn(InputStream inStream)
          Reader for JRonn horizontal file format
static void writeFasta(OutputStream os, List<FastaSequence> sequences)
          Writes FastaSequence in the file, each sequence will take one line only
static void writeFasta(OutputStream outstream, List<FastaSequence> sequences, int width)
          Writes list of FastaSequeces into the outstream formatting the sequence so that it contains width chars on each line
static void writeFastaKeepTheStream(OutputStream outstream, List<FastaSequence> sequences, int width)
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

WHITE_SPACE

public static final Pattern WHITE_SPACE
A whitespace character: [\t\n\x0B\f\r]


DIGIT

public static final Pattern DIGIT
A digit


NONWORD

public static final Pattern NONWORD
Non word


AA

public static final Pattern AA
Valid Amino acids


NON_AA

public static final Pattern NON_AA
inversion of AA pattern


AMBIGUOUS_AA

public static final Pattern AMBIGUOUS_AA
Same as AA pattern but with two additional letters - XU


NUCLEOTIDE

public static final Pattern NUCLEOTIDE
Nucleotides a, t, g, c, u


AMBIGUOUS_NUCLEOTIDE

public static final Pattern AMBIGUOUS_NUCLEOTIDE
Ambiguous nucleotide


NON_NUCLEOTIDE

public static final Pattern NON_NUCLEOTIDE
Non nucleotide

Method Detail

isNucleotideSequence

public static boolean isNucleotideSequence(FastaSequence s)
Returns:
true is the sequence contains only letters a,c, t, g, u

isNonAmbNucleotideSequence

public static boolean isNonAmbNucleotideSequence(String sequence)
Ambiguous DNA chars : AGTCRYMKSWHBVDN // differs from protein in only one (!) - B char


cleanSequence

public static String cleanSequence(String sequence)
Removes all whitespace chars in the sequence string

Parameters:
sequence -
Returns:
cleaned up sequence

deepCleanSequence

public static String deepCleanSequence(String sequence)
Removes all special characters and digits as well as whitespace chars from the sequence

Parameters:
sequence -
Returns:
cleaned up sequence

isProteinSequence

public static boolean isProteinSequence(String sequence)
Parameters:
sequence -
Returns:
true is the sequence is a protein sequence, false overwise

isAmbiguosProtein

public static boolean isAmbiguosProtein(String sequence)
Check whether the sequence confirms to amboguous protein sequence

Parameters:
sequence -
Returns:
return true only if the sequence if ambiguous protein sequence Return false otherwise. e.g. if the sequence is non-ambiguous protein or DNA

writeFasta

public static void writeFasta(OutputStream outstream,
                              List<FastaSequence> sequences,
                              int width)
                       throws IOException
Writes list of FastaSequeces into the outstream formatting the sequence so that it contains width chars on each line

Parameters:
outstream -
sequences -
width - - the maximum number of characters to write in one line
Throws:
IOException

writeFastaKeepTheStream

public static void writeFastaKeepTheStream(OutputStream outstream,
                                           List<FastaSequence> sequences,
                                           int width)
                                    throws IOException
Throws:
IOException

readFasta

public static List<FastaSequence> readFasta(InputStream inStream)
                                     throws IOException
Reads fasta sequences from inStream into the list of FastaSequence objects

Parameters:
inStream - from
Returns:
list of FastaSequence objects
Throws:
IOException

writeFasta

public static void writeFasta(OutputStream os,
                              List<FastaSequence> sequences)
                       throws IOException
Writes FastaSequence in the file, each sequence will take one line only

Parameters:
os -
sequences -
Throws:
IOException

readJRonn

public static List<AnnotatedSequence> readJRonn(File result)
                                         throws IOException,
                                                UnknownFileFormatException
Throws:
IOException
UnknownFileFormatException

readJRonn

public static List<AnnotatedSequence> readJRonn(InputStream inStream)
                                         throws IOException,
                                                UnknownFileFormatException
Reader for JRonn horizontal file format
 >Foobar M G D T T A G 0.48 0.42
 0.42 0.48 0.52 0.53 0.54
 
 
 Where all values are tab delimited

Parameters:
inStream - the InputStream connected to the JRonn output file
Returns:
List of AnnotatedSequence objects
Throws:
IOException - is thrown if the inStream has problems accessing the data
UnknownFileFormatException - is thrown if the inStream represents an unknown source of data, i.e. not a JRonn output

closeSilently

public static final void closeSilently(Logger log,
                                       Closeable stream)
Closes the Closable and logs the exception if any

Parameters:
log -
stream -

readAAConResults

public static HashSet<Score> readAAConResults(InputStream results)
Read AACon result with no alignment files. This method leaves incoming the InputStream results open!

Parameters:
results - output file of AAConservation
Returns:
Map with keys Method -> float[]