See original page

Whenever we wish to apply informatics to a particular domain, the first thing we have to think about is how to represent the entities of that domain (at a simple level this applies to dates, images, videos, and so on). Once we can represent the entities, we can develop techniques and algorithms to help solve problems using the representations. The whole package of representations and algorithms constitutes a particular application branch of informatics (or a set of).

But... Life is INCREDIBLY complex! If in doubt, watch this (see also Harvard Biovisions page and narrated version) - and this is just our current model of a very small piece of the puzzle...

So what do we do? Well the systems of life are very very very very complex (and still poorly understood), but life seems to be quite efficient in that the complex systems are made up of a small number of kinds of things like proteins, molecules, DNA and so on; and if we can work out how to represent these basic entities we can start to piece them together into more complex entities. But STILL we are in the infancy of this subject: for instance, a single cell is far too complicated for us to fully understand or represent, except incompletely or at high levels of abstraction. We are discovering new things all the time - e.g. Epigenetics (see also on Wikipedia).

Yet, even with these basic representations, we can do some amazing science on computers that wouldn't be possible without them.

Some things to bear in mind:

  • Our understanding of the operation of living things has advanced immensely in the last few decades, but the more we find out the more we realize there is to discover in the future
  • Most medical research is wrong
  • When we are representing life science entities, we are generally representing a model of the thing, not the thing itself (e.g. a 3D chemical structure is a model of a "real" molecule)
  • Informatics domains are built based on a set of representations, then algorithms that can be applied to those representations (e.g. bioinformatics = proteins, DNA, RNA, etc; cheminformatics=2D and 3D chemical structures; genomics=genes)

Here are some things we can do:

  • Represent chemical structures (atoms,bonds), proteins (atoms,bonds,amino acids), and DNA (codons/base pairs)
  • Represent biological pathways involving these entities
  • Store and search all of the above in databases
  • Store and search (ish) scientific publications
  • Store and search biomedical information - epidemiology, Electronic Medical Records, etc
  • Make predictions - activities of chemical compounds, protein function, protein structure, disease-gene associations, etc, etc.
  • Increasingly - map these to individual people (personalized medicine)

So let's take look at some of those entitiesimage007.jpg

See original page

Chemical structures / small molecules

And let's look at how they are represented (for a more detailed description see
File Not Found
File Not Found

Remember from our first class, the 2D chemical structure for Aspirin?


Now look again... does the construct look familiar? What is this mathematically?

Now let's look at a 3D structure...


Proteins and polypeptides

Proteins are really just big molecules, but they are made up of repeating units (encoded by DNA) called amino acids (or residues). In human beings and animals, there are 20 amino acid units. This means that as well as thinking of protein structure in terms of atoms and bonds, we can consider primary structure (often called a "sequence"), secondary structure and tertiary structure.


Nature has been kind to us: the protein sequence is very computer-friendly, as it is really a "language" with 20 words, and a protein at this level is a string of these 20 words. We even already have a coding system. For example here is a primary sequence of Tat, a protein involved in HIV:



DNA gets even easier - it's a language with 4 letters in a string (ACTG)

RNA uses (almost) the same encoding system

So we can simply represent a DNA sequence as a string of text, or in small numbers of binary bits (how many do we need)?

Applying algorithms to the representation

Once we have these representations of chemicals, proteins and DNA we can do things...

Sequence Alignment


Check out BLAST search on the Expasy page. We can use a Swissprot ID e.g. A0MPN3

See BLAST database format

Chemical structure search on PubChem

DrugBank -

Molecular Docking

1hvi.pdb - a protein-ligand complex for HIV Protease with a bound inhibitor from the Protein Data Bank
See JMOL page for this complex

Chemical structures in the PubChem database projected into 3 dimensions and labelled with inferred disease relationships

Network relationships of PubChem compounds to diseases as visualized in Cytoscape