Recent Changes

Sunday, August 3

  1. file gsic.pdf (deleted) uploaded Deleted File
    3:45 pm

Tuesday, October 29

  1. 11:17 pm

Tuesday, April 30

  1. page XML edited ... W3C XML XML.com There are 5 predefined entity references in XML: &lt; < less than…
    ...
    W3C XML
    XML.com
    There are 5 predefined entity references in XML:
    &lt;
    <
    less than
    &gt;
    >
    greater than
    &amp;
    &
    ampersand
    &apos;
    '
    apostrophe
    &quot;
    "
    quotation mark
    Note: Only the characters "<" and "&" are strictly illegal in XML. The greater than character is legal, but it is a good habit to replace it.

    XML DTD (Document type descriptors)
    Document Type Definition (DTD) defines the legal building blocks of an XML document. It defines the document structure with a list of legal elements and attributes.
    (view changes)
    10:55 am

Monday, April 15

  1. page Trees, mathematical graphs and networks edited ... To understand graphs and how they can be represented, we will use the wikipedia page . Importa…
    ...
    To understand graphs and how they can be represented, we will use the wikipedia page . Important concepts are the Adjacency Matrix and Distance Matrix.
    Pajek is a popular package for visualizing graphs, and the pajek file format is a popular format for storing graphs.
    Check also the SCI2 tool from Indiana University
    On the (Semantic) Web we can use RDF with OWL ontologies to describe valid relationships and nodes
    Once we represent graphs, we can use them to solve problems using graph theory algorithms. Generic kinds of problem include:
    (view changes)
    5:44 am

Monday, April 1

  1. page Life sciences edited ... See original page Whenever we wish to apply informatics to a particular domain, the first thi…
    ...
    See original page
    Whenever we wish to apply informatics to a particular domain, the first thing we have to think about is how to represent the entities of that domain (at a simple level this applies to dates, images, videos, and so on). Once we can represent the entities, we can develop techniques and algorithms to help solve problems using the representations. The whole package of representations and algorithms constitutes a particular application branch of informatics (or a set of).
    ...
    this (see original page)also Harvard Biovisions page and narrated version) - and
    So what do we do? Well the systems of life are very very very very complex (and still poorly understood), but life seems to be quite efficient in that the complex systems are made up of a small number of kinds of things like proteins, molecules, DNA and so on; and if we can work out how to represent these basic entities we can start to piece them together into more complex entities. But STILL we are in the infancy of this subject: for instance, a single cell is far too complicated for us to fully understand or represent, except incompletely or at high levels of abstraction. We are discovering new things all the time - e.g. Epigenetics (see also on Wikipedia).
    Yet, even with these basic representations, we can do some amazing science on computers that wouldn't be possible without them.
    ...
    See original page
    Chemical structures / small molecules
    ...
    description see gsic.pdf) gsic.pdf {gsic.pdf} )
    Remember from our first class, the 2D chemical structure for Aspirin?
    {Aspirin.jpg}
    ...
    See BLAST database format
    Chemical structure search on PubChem
    DrugBank -
    Molecular Docking
    {1hvi.gif}
    1hvi.pdb -{1hvi.pdb} - a protein-ligand
    See JMOL page for this complex
    {Plotviz-PubChem-Disease.jpg}
    (view changes)
    9:03 am

Tuesday, March 26

  1. page Information Content, Compression and Encryption edited ... Here are some of the more important techniques for lossless compression: Run-length encoding …
    ...
    Here are some of the more important techniques for lossless compression:
    Run-length encoding
    Prefix code
    Huffman coding
    Arithmetic encoding
    ...
    Attacking the algorithm instead
    But just because we cannot bruteforce the key space doesn't mean we should give up. Perhaps there is a weakness in the algorithm itself! It turns out it took about a 1,000 years for somebody to figure out how to break this substitution cipher (or at least it took someone that long to admit it publicly).
    ...
    amazingly simple.
    Kerchoff's principle
    Common wisdom tells us we should not rely on "security by obscurity" of the encryption algorithm. Kerchoff's Principle tells us we should assume the algorithm is public, and the security of the cryptosystem should rely only the secrecy of the key. The general idea is that a publicly known algorithm should stand up to scrutiny in the public forum. If the algorithm is kept secret, it will likely have weaknesses that may have been caught publicly, and if the attacker gets a hold of the algorithm the security of all communications under that algorithm is under risk. This is what happened with the Enigma in World War II.
    Key lengths
    ...
    256-bit keys.
    Entropy of keys
    But not all keys are created equal (don't use a key like 10101010101010....10). You want keys that are random, i.e., the entropy of a 128-bit key should be 128 bits. In other words, each bit in the key should be picked perfectly at random. For example, what is the entropy of your 128-bit key if you pick 16 alphabetical characters at random? That yields a key space of only 52^16 keys, which is only about 2^91. In other words, such keys would have only 91 bits of information encoded into 128 bits, and would be within reach of a brute force attacker.
    (view changes)
    3:08 pm

Wednesday, March 20

  1. page Information Content, Compression and Encryption edited ... Attacking the algorithm instead But just because we cannot bruteforce the key space doesn't m…
    ...
    Attacking the algorithm instead
    But just because we cannot bruteforce the key space doesn't mean we should give up. Perhaps there is a weakness in the algorithm itself! It turns out it took about a 1,000 years for somebody to figure out how to break this substitution cipher (or at least it took someone that long to admit it publicly).
    ...
    amazingly simple. You can play around with an online tool here.
    Kerchoff's principle
    Common wisdom tells us we should not rely on "security by obscurity" of the encryption algorithm. Kerchoff's Principle tells us we should assume the algorithm is public, and the security of the cryptosystem should rely only the secrecy of the key. The general idea is that a publicly known algorithm should stand up to scrutiny in the public forum. If the algorithm is kept secret, it will likely have weaknesses that may have been caught publicly, and if the attacker gets a hold of the algorithm the security of all communications under that algorithm is under risk. This is what happened with the Enigma in World War II.
    Key lengths
    ...
    256-bit keys.
    Entropy of keys
    But not all keys are created equal (don't use a key like 10101010101010....10). You want keys that are random, i.e., the entropy of a 128-bit key should be 128 bits. In other words, each bit in the key should be picked perfectly at random. For example, what is the entropy of your 128-bit key if you pick 16 alphabetical characters at random? That yields a key space of only 52^16 keys, which is only about 2^91. In other words, such keys would have only 91 bits of information encoded into 128 bits, and would be within reach of a brute force attacker.
    (view changes)
    5:58 am
  2. page home edited ... Apr 24 Final Exam (regular classroom) TBD Wed May 1, 8am-10am Assignments & Examinat…
    ...
    Apr 24
    Final Exam (regular classroom)
    TBDWed May 1, 8am-10am
    Assignments & Examinations
    Midterm Examination
    (view changes)
    5:48 am

Sunday, March 17

  1. page Information Content, Compression and Encryption edited Description Length ... or D. We can pick a fixed length encoding where A = 00 ... C = 1…

    Description Length
    ...
    or D.
    We can pick a fixed length encoding where
    A = 00
    ...
    C = 10
    D = 11
    ...
    each character.
    But can we do better? Do we really need 2 bits (on the average) to encode symbols A, B, C, and D?
    ...
    they occur?".
    We

    We
    turn to
    ...
    description length.
    Information Content
    {solid_color.gif} {2d_barcode.jpg}
    ...
    Let's think first about binary events and binary numbers: the Binary Entropy function
    A string of coin tosses (1=heads, 0=tails) of a perfect coin would have maximum information content (H=1); a binary string that was always 000000... would have minimal (H=0) information content. Real life situations generally lie somewhere in between
    ...
    a generalized formvalue for the information content of any (i.e. non binary as well as binary) situationsrepresentation. If we use a base=2 for the logarithm, then it is equivalent to the average minimum number of bits required to store a set of information
    So Shannon converts a set of probabilities into an information content value. The probabilities can be derived via statistical analysis if not known a prori.
    Why does this matter? Well, the Shannon's Entropy can tell us:
    ...
    In this case the key is "shift by 3 places". The message 'BAD CAB' would be encrypted to 'EDGFDE'. The ciphertext looks cryptic, but we can all agree this cipher is easy to break. To recover plaintext from a ciphertext, we could simply try all the keys (how many possible keys are there?) and see if any of the decryptions yields intelligible English. This process of trying all the keys is called the brute force attack since it does not try to break the cipher in a clever way, and instead just tries every single key until it succeeds in decrypting the message.
    What we have performed is called the bruteforce attack, i.e., we have tried all the possible keys to decrypt the message. The key space for an algorithm is the set of all possible keys, and thus the amount of keys tried in the brute force attack is equal to the size of the key space. Note that we are making an assumption about the adversary: we assume the adversary is computationally bound and must use modern computers to crack the cipher. That is, the adversary is a "mere mortal" like the rest of us and cannot try keys faster than one would expect.
    ...
    alphabet, e.g.,
    Plaintext : ABCDEFG....XYZ
    Ciphertext: TIZPBUY....QEL
    ...
    each other)?
    What is the size of the key space? This key space is large enough to make a brute force attack impractical (at least in Caesar's times, now it is borderline feasible).
    Attacking the algorithm instead
    ...
    it publicly).
    It turns out one can exploit frequency analysis (see histogram above!) to realize the letter B seems to occur most often in the ciphertext. With enough ciphertext, one would expect the histogram of frequencies of letters in the ciphertext to match those in plain English, with just the work of mapping the correct letters left to be done. This attack is amazingly simple. You can play around with an online tool here.
    Kerchoff's principle
    ...
    But not all keys are created equal (don't use a key like 10101010101010....10). You want keys that are random, i.e., the entropy of a 128-bit key should be 128 bits. In other words, each bit in the key should be picked perfectly at random. For example, what is the entropy of your 128-bit key if you pick 16 alphabetical characters at random? That yields a key space of only 52^16 keys, which is only about 2^91. In other words, such keys would have only 91 bits of information encoded into 128 bits, and would be within reach of a brute force attacker.
    Consider two real examples:
    ...
    process ID.
    Open source code: Kerberos used 56-bit DES keys derived from only 32 bits of data (process ID, host ID, key count), and some of these bits were predictable leading to really only 20 bits of randomness. That would take only seconds to brute force.
    Lesson: don't cook up your own scheme to derive randomness. Use standard sources of randomness provided by the operating system, for example.
    ...
    The key distribution problem can be greatly alleviated through public key cryptography or asymmetric cryptography.
    Public Key Cryptography
    ...
    private key.
    Enc(k, m) = c
    Dec(s, c) = m
    ...
    A cryptographic hash function is a deterministic procedure that takes an arbitrary number of bits as input (called the message) and gives a fixed sized output (called the digest). For example, SHA-1 gives an output of 160 bits. In addition, this function must satisfy at least the three properties of collision resistance, pre-image resistance, and second pre-image resistance. These properties come in handy when it comes to storing passwords and signing data for example.
    Consider the compromise of a password database. If passwords are stored as cryptographic hashes, it is hard to invert the hash and recover the password (pre-image resistance) or generate some other password with the same hash (second pre-image resistance). Nonetheless, the attacker might spend time and space to build a large dictionary of 8-character inputs (for example) and their corresponding hashes for a popular choice of hash function such as SHA-1. Now if a password database (or an individual hash) is compromised, the attacker and perform a quick lookup on this database. The addition of a salt as input to the hash function forces the attacker to build a new dictionary for each possible salt value. The length of the salt can be increased to make such precomputation attacks infeasible to the attacker. For example, a 1 byte salt now requires 2^8 dictionaries. OS X currently uses SHA-1 as the hash function with 4-byte passwords, requiring 2^32 precomputed dictionaries.
    ...
    signed m1).
    But how can one construct these hash functions? Various algorithms have been proposed such as SHA-1 and MD5. But recent attacks have shown that collision resistance has been broken. In SHA-1's case, it should ideally take about 2^80 tries to generate a collision by bruteforce, but in 2005 researchers showed this could be done in 2^69 tries. In MD5's case, it should ideally take 2^64 tries to generate a collision by bruteforce, but instead collision can be generated "in seconds" today. SHA-1 should work for now. NIST will announce the winner for SHA-3 in 2012.
    (view changes)
    12:31 pm

Thursday, March 7

  1. page XML edited ... For example, if you are representing biological sequence data,you can clearly identify which p…
    ...
    For example, if you are representing biological sequence data,you can clearly identify which portion of the document contains sequence identifiers and cross references to public databases, and which portion contains raw sequence data. These sections are clearly marked and organized in a hierarchical document structure. A human reader or a computer program can therefore easily traverse a complete document and extract individual pieces of data.
    Here is an example of biological sequence data recorded in HTML and XML.
    <<span class="GRcorrect">html</span>>
    <<span class="GRcorrect">body</span>>
    <html>
    <body>

    <h1>NM-171533</h1>
    Organism: <b>Caenorhabditis elegans</b>
    <p>
    <span class="GRcorrect">agcacatgacatgagcagtgccccaaatgatgactgtgagatcgacaagggagcacatgacatgagcagtgccccaaatgatgactgtgagatcgacaaggg
    aacaccttctaccgcttcactttttacaacgctgatgctcagtcaaccatcttcttct
    acagctgttttacagtgtacatattgtggaagctcgtgcacatcttcccaattgca
    aacatgtttattctg</span>aacatgtttattctg</span></span>
    <p>
    [Full sequence has been omitted for brevity.]
    <<span class="GRcorrect">body</span>>
    <<span class="GRcorrect">html</span>>
    <body>
    <html>

    HTML Code
    <xmlversion="1.0" encoding="UTF-8" standalone="yes"?>
    <Sequence>
    <<span class="GRcorrect">accession</span>>NM-171533</accession>
    <<span class="GRcorrect">organism</span>>Caenorhabditis
    <accession>NM-171533</accession>
    <organism>Caenorhabditis
    elegans</organism>
    <sequence_data>
    <span class="GRcorrect">agcacatgacatgagcagtgccccaaatgatgactgtgagatcgacaagggaacaccttctaccgcttcactttttacaacgctgatgctcagtcaaccatcttcttct
    acagctgttttacagtgtacatattgtggaagctcgtgcacatcttcccaattgcaaacatgtttattc</span>
    agcacatgacatgagcagtgccccaaatgatgactgtgagatcgacaagggaacaccttctaccgcttcactttttacaacgctgatgctcagtcaaccatcttcttct
    acagctgttttacagtgtacatattgtggaagctcgtgcacatcttcccaattgcaaacatgtttattc

    [Full sequence has been omitted for brevity.]
    <sequence_data>
    (view changes)
    4:56 am

More