Data Modeling is about understanding and representing how things (real world, computer) relate to each other within a particular domain. We have already explored. Data models (and entities) may be already established for a particular domain or sub domain, or may need to be derived from scratch.

Think about the entities and their relationships in the life sciences. What are the entities? How do they related to each other?

Modeling structured data


A formal Data Model allows specification of entities and how they relate to each other. Information can be modeled at three levels, each called a schema:

Conceptual schema - entities and their relationship in a particular domain (e.g. biology)
Logical schema - implements conceptual schema with a particular computer representation (e.g. XML, relational database)
Physical schema - representation at the hardware level (e.g. how bits and bytes are stored, devices, etc)

Traditionally, one builds a conceptual (and subsequent logical) schema through the process of Systems Analysis, in the process building graphical models, particularly:

These extreme formalities tend to be only used in the world of databases. However, their kin are used in all kinds of methods. For example, Contextual Design emphasizes the understanding of the full picture of a work environment in a particular domain, and from observation sessions one can create:
  • Physical Model (see an example related to designing software for dentists)
  • Cultural Model
  • Sequence Diagram
  • Data Flow Model
Amongst others.

At the logical schema level, the two most dominant forms are the relational database (Oracle, MySQL, etc) and, increasingly, semantic database (Triple Stores, RDF, OWL)

When complex entities are represented in a computer program, they are called data structures.

Modeling unstructured data


Nowadays, we can think of the world of information from two perspectives: structured and unstructured. The structured world includes formal models, databases, XML, etc. It is the traditional world of information systems. The unstructured world includes tagging, folksonomies, Web 2.0, natural language processing, data mining, and so on. Until recently, creating structured data was a prerequisite for using information systems, indeed much of database and systems analysis theory was about identifying and organizing structured data from an unstructured world. In summary:

"For the most part, information systems have grown up around structured data and structured systems. The structured environment is made up of data that has fields, columns, tables, rows and indexes. It centers around transactions and has reports, audits and definitions of words. There is a high degree of predictability and order associated with the structured environment.
The unstructured environment is very different from the structured environment. The unstructured environment has no particular order to it. It consists of text found in medical reports, warranties, contracts, e-mail and spreadsheets. The text has no rules governing its creation or usage. With text, there are no keys, no indexes, no columns or attributes. Text is free-form and is as disorderly as structured data is orderly."
Taken from http://www.b-eye-network.com/view/4955

We need to be able to handle all kinds of data. Problems are usually not very simple any more.

Examples of unstructured data
  • Documents, spreadsheets, etc
  • Journal articles
  • Web pages

We'll take an example from this RSC journal article, marked up as part of Project Prospect.

Text mining is done by our local company Megaputer.

We did some work at IU.

See e.g. Autonomy - case studies

Interesting trend - small pieces of structured data derived from unstructured data - see e.g. Nanopublications.