Databases


A database is a collection of data in structured format.
Database Management Systems (DBMS) provide software to manage and administer such databases. In practice, almost all database management systems use the Relational Model which will be discussed more in the theory section. Simply, the relational model considers everything in terms of tables (relations) with rows (tuples) and columns (attributes).

Database Theory

Relational algebra

  • A query involves taking one or more input relations (tables), applying one or more operators, and producing one or more output relations (tables)
  • Selection of rows and Projection of Columns
  • Set operators: Union operator allows logical OR for selection; Intersection operator allows logical AND for selection; Set Difference allows logical A AND NOT B (A or B minus intersection); Cross Product allows merging of columns
  • Join operator allows merging of rows or columns; Equijoin joins based on equality of a particular field

Equivalent Terms

Table
Relation

Row
Tuple
Record
Column
Attribute
Field

Designing the database

  • Three steps
    • Analysis of requirements (e.g. Contextual Design)
    • Conceptual Design (ERM, UML, etc)
    • Logical Design (Mapping to the DBMS)
  • Candidate Key - one or more attributes which can be used as a primary key
  • Primary (or Unique) Key - a candidate key which has been selected to uniquely identify a record
  • Foreign Key - a key which is used to establish a relationship between tables
  • Key Constraints - specify that only one instance is valid of a particular relationship for a particular entity
  • Weak Entities can only be identified uniquely by association with another entity, thus it must use a foreign key.

The most common DBMS you will come across


Notice that these DBMS manage "relational" databases. We will discuss a little more of the theory behind relational databases later, but in short these databases store a collection of tables (or relations). These tables contain rows and columns, where rows correspond to records (or tuples), and columns correspond to attributes. For example, one can envision a table of employee records. Each row in the table corresponds to a particular employee's record, and each column corresponds to attributes such as Employee ID, Name, and Job Title.

The database could hold multiple tables that are related to each other in some way (recall the ER diagrams we built to relate entities to each other). We first orient ourselves with a single table. We will use the Structured Query Language (SQL), which was designed to interact with relational databases (e.g., to insert, update, delete data, and to query the database for stored data based on some constraints).

Example using PostgreSQL


We're going to create a database for a Freight Statistics System. Here is the description:

You just joined a company that develops software for managing the inventory of a large freight haulage company. You are responsible for creating a database system to keep track of the pickup time & date, delivery time & date, average speed of transport, package ID, and name of sending company for each package and generates statistics based on these (including number of packages delivered in a time period, total average time for transport, overall average average speed during transport, and the top 100 companies) . Package ID’s are whole numbers assigned sequentially (e.g. package 123 is the one picked up after package 122). The company currently handles approximately 20 million packages per year. It works predominately in the U.S. and Canada, but has ambitions to develop overseas operations in the coming years (which could mean the number of packages handled approaches 100 million). Here are the main data points for each package:

PICKUP TIME
DELIVERY TIME
PICKUP DATE
DELIVERY DATE
AVERAGE SPEED
PACKAGE ID
NAME OF SENDING COMPANY

First, let's make assignments of to PostgreSQL types. Not in particular how we use the PostgreSQL timestamp type to remove the need for a separate date field. For more details, see PostgreSQL documentation on data types.

Name
PostgreSQL representation
PICKUP TIME
TIMESTAMP WITH TIME ZONE
DELIVERY TIME
TIMESTAMP WITH TIME ZONE
PICKUP DATE
TIMESTAMP WITH TIME ZONE
DELIVERY DATE
TIMESTAMP WITH TIME ZONE
AVG SPEED
INTEGER
PACKAGE ID
BIGINT
SENDING COMPANY
VARCHAR(300)

We can now create a database and do some things with it. You can find a full reference to the commands for PostgreSQL in the manual.

First, we create a FreightStats table:
CREATE TABLE FreightStats (
  pickup_time TIMESTAMP WITH TIME ZONE,
  delivery_time TIMESTAMP WITH TIME ZONE,
  avg_speed INTEGER,
  package_id BIGINT,
  sending_company VARCHAR(300) );
 
And check its schema
=> \d FreightStats
              Table "public.freightstats"
     Column      |           Type           | Modifiers
-----------------+--------------------------+-----------
 pickup_time     | timestamp with time zone |
 delivery_time   | timestamp with time zone |
 avg_speed       | integer                  |
 package_id      | bigint                   |
 sending_company | character varying(300)   |
 
We can then add our first record!
INSERT INTO FreightStats
  (pickup_time, delivery_time, avg_speed, package_id, sending_company)
VALUES (
  '2009-03-20 18:36:00',
  TIMESTAMP WITH TIME ZONE '2009-03-24 09:14:00-07',
  54,
  12586274,
  'DJW SOFTWARE' );
 
There's a lot to note here: the timestamp with implicit timezone, the timestamp with explicit timezone, the way quotes are used. Let's add another three records:
INSERT INTO FreightStats
  (pickup_time, delivery_time, avg_speed, package_id, sending_company)
VALUES (
  '2009-03-01 15:18:00',
  TIMESTAMP WITH TIME ZONE '2009-03-07 12:17:00',
  48,
  12586273,
  'BIG PACKAGES INC' );
 
INSERT INTO FreightStats
  (pickup_time, delivery_time, avg_speed, package_id, sending_company)
VALUES (
  '2009-02-26 01:05:00',
  TIMESTAMP WITH TIME ZONE '2009-03-01 23:59:00',
  55,
  12586272,
  'PETE''S PET SHOP' );
 
INSERT INTO FreightStats
  (pickup_time, delivery_time, avg_speed, package_id, sending_company)
VALUES (
  '2009-02-05 14:31:00',
  TIMESTAMP WITH TIME ZONE '2009-02-10 11:12:00',
  61,
  12586271,
  'INFOREP CORP' );
 
Note the double apostrophe required in PETE''S PET SHOP. We can now look at what we have with the SQL SELECT command:
=> SELECT * FROM FreightStats;
      pickup_time       |     delivery_time      | avg_speed | package_id | sending_company
------------------------+------------------------+-----------+------------+------------------
 2009-03-20 18:36:00-04 | 2009-03-24 12:14:00-04 |        54 |   12586274 | DJW SOFTWARE
 2009-03-01 15:18:00-05 | 2009-03-07 12:17:00-05 |        48 |   12586273 | BIG PACKAGES INC
 2009-02-26 01:05:00-05 | 2009-03-01 23:59:00-05 |        55 |   12586272 | PETE'S PET SHOP
 2009-02-05 14:31:00-05 | 2009-02-10 11:12:00-05 |        61 |   12586271 | INFOREP CORP
(4 rows)
 
Let's do some more fancy searches:
=> SELECT * FROM FreightStats WHERE pickup_time > '2009-03-01';
      pickup_time       |     delivery_time      | avg_speed | package_id | sending_company
------------------------+------------------------+-----------+------------+------------------
 2009-03-20 18:36:00-04 | 2009-03-24 12:14:00-04 |        54 |   12586274 | DJW SOFTWARE
 2009-03-01 15:18:00-05 | 2009-03-07 12:17:00-05 |        48 |   12586273 | BIG PACKAGES INC
(2 rows)
 
=> SELECT * FROM FreightStats WHERE pickup_time > '2009-03-01' AND pickup_time < '2009-03-10';
      pickup_time       |     delivery_time      | avg_speed | package_id | sending_company
------------------------+------------------------+-----------+------------+------------------
 2009-03-01 15:18:00-05 | 2009-03-07 12:17:00-05 |        48 |   12586273 | BIG PACKAGES INC
(1 row)
 
=> SELECT package_id, sending_company FROM FreightStats WHERE pickup_time > '2009-03-01';
 package_id | sending_company
------------+------------------
   12586274 | DJW SOFTWARE
   12586273 | BIG PACKAGES INC
(2 rows)
 
Let's calculate the time taken between pickup and delivery:
=> SELECT package_id, delivery_time-pickup_time FROM FreightStats;
 package_id |    ?column?
------------+-----------------
   12586274 | 3 days 17:38:00
   12586273 | 5 days 20:59:00
   12586272 | 3 days 22:54:00
   12586271 | 4 days 20:41:00
(4 rows)
 
And the overall mean times (and average speeds):
=> SELECT AVG(delivery_time-pickup_time), AVG(avg_speed) FROM FreightStats;
       avg       |         avg
-----------------+---------------------
 3 days 38:33:00 | 54.5000000000000000
(1 row)
 

Ah, notice a problem? Let's fix it:

=> SELECT justify_hours(AVG(delivery_time-pickup_time)), AVG(avg_speed) FROM FreightStats;
       avg       |         avg
-----------------+---------------------
 4 days 14:33:00 | 54.5000000000000000
(1 row)


Pretty cool eh?

Now let's try another example. You need to develop a PostgreSQL records system to log ambulance calls for a local health provider. The database needs to include:
  • Call time
  • Dispatch time
  • On scene time
  • Chief complaint
  • Priority (emergent, non-emergent)
  • Arrival-at-hospital time

What representations would you use? How would these map to PostgreSQL types (see documentation on data typescopied below )? What could you do with SQL queries that would be useful?

Name
Aliases
Description
bigint
int8
signed eight-byte integer
bigserial
serial8
autoincrementing eight-byte integer
bit [ ({{n) ]}}

fixed-length bit string
bit varying [ ({{n) ]}}
varbit
variable-length bit string
boolean
bool
logical Boolean (true/false)
box

rectangular box in the plane
bytea

binary data ("byte array")
character varying [ ({{n) ]}}
varchar [ ({{n) ]}}
variable-length character string
character [ ({{n) ]}}
char [ ({{n) ]}}
fixed-length character string
cidr

IPv4 or IPv6 network address
circle

circle in the plane
date

calendar date (year, month, day)
double precision
float8
double precision floating-point number
inet

IPv4 or IPv6 host address
integer
int, int4
signed four-byte integer
interval [ ({{p) ]}}

time span
line

infinite line in the plane
lseg

line segment in the plane
macaddr

MAC address
money

currency amount
numeric [ ({{p, s) ]}}
decimal [ ({{p, s) ]}}
exact numeric of selectable precision
path

geometric path in the plane
point

geometric point in the plane
polygon

closed geometric path in the plane
real
float4
single precision floating-point number
smallint
int2
signed two-byte integer
serial
serial4
autoincrementing four-byte integer
text

variable-length character string
time [ ({{p) ] [ without time zone ]}}

time of day
time [ ({{p) ] with time zone}}
timetz
time of day, including time zone
timestamp [ ({{p) ] [ without time zone ]}}

date and time
timestamp [ ({{p) ] with time zone}}
timestamptz
date and time, including time zone
tsquery

text search query
tsvector

text search document
txid_snapshot

user-level transaction ID snapshot
uuid

universally unique identifier
xml

XML data


Advanced topics - Normal Forms (in plain english - almost)

Normal forms let us get round problems of updating and efficiency.
  • 1NF - No repeating rows (implies a Unique Key); no intrinsic important ordering of the rows or columns (doesn't matter what order they are returned in)
  • 2NF - All fields in a table are dependent on the whole of a Candidate Key, not just part of it
  • 3NF - All fields in a table are ONLY dependent on the whole of the candidate key.

We'll see how this maps to our ambulance calls example

What are the candidate keys? Which should we use as the primary key?
How does this get put into 3NF?
What are the foreign keys that link the tables? What are the weak entities?

Advanced topics - Input validation and SQL injection


xkcd comic at http://xkcd.com/327/
xkcd comic at http://xkcd.com/327/


Many web applications store data in databases, and interact with these databased based on input from the user. As the cartoon above shows, if your application simply takes whatever the user supplies as input and inserts it into an SQL query, the resultant query might give you unexpected results, where SQL commands of the attacker's choice are injected into the application. This is called the SQL injection attack.

Escape sequences are special characters that send commands to the underlying program or control device. In the case of SQL, the single quote delimits input. Imagine a database with the table students. A web application may ask the user for a name, execute a query such as the one below, and display formatted results:
SELECT * FROM students WHERE (full_name='$INPUT_STRING');

While the developer may expect well formed input such as Bob Armstrong, with the resultant query:
SELECT * FROM students WHERE (full_name='Bob Armstrong');

the developer certainly did not expect to execute the following query, which executes the command to delete the entire table in the database
SELECT * FROM students WHERE (full_name='Robert'); DROP TABLE Students;--';
(note the two dashes -- at the end treat the rest of the line in the SQL query as a comment, and gets around problems of balanced quotes, or whatever else may follow the injected command)

We will look at a detailed example of an attack outlined here. This page also discusses several mitigation strategies.