Workshop 6: An Introduction to Relational Detabases

This workshop provides a basic introduction to Relational Databases using the SQLite program.

There are three main aspects of database usage,

  • database design and construction
  • loading data
  • querying the data

Below I discuss the main points of each and introduce use of the SQL language in the context of the sqlite3 database management program. This document contains the following sections:

Database Design

Relational databases, the most common type, are designed around entities and relationships between entities. Database design deals with these.

For example, a movie database might contain information on movies and actors. These are entities. The relationship that ties certain actors to certain movies can be called the cast. The figure below is part of the design of such a database. In it, the rectangles are entities and the diamond is a relationship. The lines connect the entities to the relationship.

../../_images/movies.actors.cast.er.diagram.png

Relational databases consist of tables of data. Each table consists of rows. In an entity table, each row contains data about one instance of that entity. For example, in a movie table, each row has information about one movie. The following is a description of a table to hold movie data written is SQL. This description is used to create the movie table.

CREATE TABLE movies (
    mid integer primary key,
    title text,
    year integer,
    genres text
);

The data in a row is divided into fields. Each field holds a particular piece of data. In our movie rows, the individual fields are:

  • mid – a unique integer identifier for the row
  • title – the movie title, stored as a text string
  • year – the year the movie came out, stored as an integer
  • genres – a list of classification labels for the movie content, stored as a text string

The primary key notation on the mid field indicates that the data will be sorted for fast lookup on this field. The following are a few rows of data from the movies table. This data comes from the publicly available IMDb (Internet Movie Database) at https://www.imdb.com/interfaces/ .

mid         title                           year  genres
----------  ------------------------------  ----  --------------------
369610      Jurassic World                  2015  Action,Adventure,Sci
1326190     Aliens: Zone-X                  2015  Sci-Fi
1392190     Mad Max: Fury Road              2015  Action,Adventure,Sci
1828251     Journey to Mt. Fuji             2015  Adventure,Family,Sci
2395427     Avengers: Age of Ultron         2015  Action,Adventure,Sci
2577662     The Rise of the Robots          2015  Sci-Fi
2651352     Ratpocalypse                    2015  Fantasy,Sci-Fi

Similarly, each row in an actors table holds data about an actor. The following is a description of a table to hold actor data written is SQL. Again, this description is used to create the actor table.

CREATE TABLE actors (
    aid integer primary key,
    name text
);

Rows in this table hold only two values:

  • aid – a unique integer identifier for the row
  • name – the name of the actor, stored as a text string

The following are a few rows of data from the actors table.

aid         name
----------  ------------------------------
1           Fred Astaire
2           Lauren Bacall
3           Brigitte Bardot
4           John Belushi
5           Ingmar Bergman
6           Ingrid Bergman
7           Humphrey Bogart
8           Marlon Brando
9           Richard Burton
10          James Cagney

Relationship tables are different. They hold values that tie the entities together. Instead of using actual data, the identifiers are used in a relationship table. The following is a description of the cast table.

CREATE TABLE cast (
    mid integer,
    aid integer,
    role text
);

The fields are:

  • mid – an integer identifier from the movies table
  • aid – an integer identifier from the actors table
  • role – a description of the actor’s role in the movie, stored as a text string

Movies typically have more than one actor, so the cast table will typically have more than one row for the same movie, each with a different actor. For example, the movie “Wonder Woman” has the following row in the movies table:

mid         title         year        genres
----------  ------------  ----------  ------------------------
451279      Wonder Woman  2017        Action,Adventure,Fantasy

Note the movie row identifier mid = 451279. In order to tie the movies to its actors, the same identifier, is used in the cast table.

mid         aid         role
----------  ----------  ------------------------------
451279      2933757     ["Diana"]
451279      1517976     ["Steve Trevor"]
451279      705         ["Antiope"]
451279      205063      ["Etta"]

Who are these actors? The only way to find out is to go to the actors table and look for the rows with the corresponding aid identifiers.

aid         name
----------  --------------------
2933757     Gal Gadot
1517976     Chris Pine
705         Robin Wright
205063      Lucy Davis

Adding Data

In sqlite3, the easiest way to add data to a table is to load it from a file. sqlite3 has a special command for this called .import that is one of a series of commands that start with a period and are called Dot Commands.

The file should:

  • contain rows of data
  • have in each row,
    • one value for each field
    • fields in the same order as the create table statement
  • all fields separated by the same character, such as
    • a tab “\t” (a tab separated file or tsv)
    • a comma “,” (a comma separated file or csv)

For example, importing movie data into the movies table can be done as follows. First set the type of field separator. This can be done with .mode csv or .mode tabs command, then import the data from the file movies.tsv. Note that the prompt sqlite> appears when the sqlite3 program is running.

sqlite>.mode tabs
sqlite>.import movies.tsv

Querying Data

Data is queried with SQL select statements. The basic form of an SQL query (Structured Query Language) for a single table is:

SELECT field name, field name, ...
FROM table name
WHERE condition [AND|OR condition etc.]
GROUP BY field name
ORDER BY field name [asc|desc] ...
LIMIT integer

The individual query parts are referred to as clauses. The Select and From clauses are required, all others are optional.

  • Select – lists the fields in the output, any order
  • From – lists the table(s) where the data is stored
  • Where – gives boolean condition(s) (true/false) limiting the rows used
  • Group by – used with aggregates like count(*)
  • Having – gives boolean conditions limiting output after a GROUP BY
  • Order by – sorts the output by field(s), either ascending (ASC) or descending (DESC)
  • Limit – restricts the output to a maximum number of rows

The simplest query returns the whole table. Limit is used because the table contains over 100,000 rows. Here, “*” means “all fields.”

SELECT *
FROM Movies
LIMIT 10
mid     title           year  genres
------  --------------  ----  ----------
35423   Kate & Leopold  2001  Comedy,Fan
66853   Na Boca da Noi  2016  Drama
69049   The Other Side  2018  Drama
88751   The Naked Mons  2005  Comedy,Hor
94859   Chief Zabu      2016  Comedy
96056   Crime and Puni  2002  Drama
97540   Responso        2004  \N
100275  The Wandering   2017  Comedy,Dra
102362  Istota          2000  Drama,Roma
107706  Stupid Lovers   2000  \N

Note that \N means NULL or no value.

To restrict the fields, use field names:

SELECT title, genres, year
FROM Movies
LIMIT 10
title           genres                year
--------------  --------------------  ----
Kate & Leopold  Comedy,Fantasy,Roman  2001
Na Boca da Noi  Drama                 2016
The Other Side  Drama                 2018
The Naked Mons  Comedy,Horror,Sci-Fi  2005
Chief Zabu      Comedy                2016
Crime and Puni  Drama                 2002
Responso        \N                    2004
The Wandering   Comedy,Drama,Fantasy  2017
Istota          Drama,Romance         2000
Stupid Lovers   \N                    2000

To restrict records, impose a condition

SELECT title, genres, year
FROM Movies
WHERE year = 2018
LIMIT 10
title                       genres                year
--------------------------  --------------------  ----
The Other Side of the Wind  Drama                 2018
T.G.M. - osvoboditel        \N                    2018
To Chase a Million          Action,Drama          2018
Fahrenheit 451              Drama,Sci-Fi,Thrille  2018
Nappily Ever After          Comedy,Drama,Romance  2018
Alita: Battle Angel         Action,Adventure,Rom  2018
Surviving in L.A.           Comedy,Drama,Romance  2018
Escape from Heaven          Comedy,Fantasy        2018
The Last Full Measure       Drama,War             2018
Caravaggio and My Mother t  Comedy,Drama          2018

For string comparison several options are available.

  • = – strings must match exactly (usage: field = “pattern”)

    • not case sensitive
  • LIKE – strings must match exactly (usage: field LIKE “pattern”)

    • can use wildcards in pattern
    • ‘%’ for zero or more “I don’t care” letters
    • ‘_’ for one letter
    • not case sensitive

The following example uses a condition on the title and genres to restrict the output to titles which begin with “star” and where “sci-fi” occurs somewhere in the genres field.

sqlite> select title, genres, year
   ...> from movies
   ...> where year = 2017 and title like "star%" and genres like "%sci-fi%"
   ...> limit 10;
title                          genres                year
-----------------------------  --------------------  ----
Star Wars: The Fallen Brother  Action,Fantasy,Sci-F  2017
Starwatch                      Action,Drama,Sci-Fi   2017
Star Wars: The Dark Reckoning  Sci-Fi                2017
Star Trek: The Paradise Maker  Adventure,Animation,  2017

Joins

When you want to combine data from different tables, joins are used. This is how to retrieve information on both actors and movies in the same query. Joins occur in the FROM clause. All the required tables are listed and the columns that should be used to join the rows are specified. Recall the actors – cast – movies diagram from above. Now it’s labeled with the columns that join the entity and relationship tables.

../../_images/movies.actors.cast.er.diagram.with.primary.keys.png

Going back to the Wonder Woman example. Here is a query that returns the actors by looking for the movie name. The results are shown after the query.

sqlite> select mid, title, aid, name, role
   ...> from movies join cast using(mid) join actors using (aid)
   ...> where title like "wonder woman";
mid         title         aid         name        role
----------  ------------  ----------  ----------  --------------------
451279      Wonder Woman  2933757     Gal Gadot   ["Diana"]
451279      Wonder Woman  1517976     Chris Pine  ["Steve Trevor"]
451279      Wonder Woman  705         Robin Wrig  ["Antiope"]
451279      Wonder Woman  205063      Lucy Davis  ["Etta"]

Notice the joins in the from clause. The first one is

movies join cast using(mid)

This indicates that rows from movie should be combined with rows from cast when they share the same mid value. In effect, this produces an intermediate table with the following rows: mid, title, aid, role as can be seen in the following query.

sqlite> select *
from movies join cast using (mid)
limit 10;
mid     title           year  genres      aid       role
------  --------------  ----  ----------  --------  --------------
35423   Kate & Leopold  2001  Comedy,Fan  212       ["Kate McKay"]
35423   Kate & Leopold  2001  Comedy,Fan  413168    ["Leopold"]
35423   Kate & Leopold  2001  Comedy,Fan  630       ["Stuart Besse
35423   Kate & Leopold  2001  Comedy,Fan  5227      ["Charlie McKa
66853   Na Boca da Noi  2016  Drama       180878    ["Vítor Hugo"
66853   Na Boca da Noi  2016  Drama       206883    ["Hugo"]
66853   Na Boca da Noi  2016  Drama       94426     \N
66853   Na Boca da Noi  2016  Drama       138681    \N
69049   The Other Side  2018  Drama       1379      ["Jake Hannafo
69049   The Other Side  2018  Drama       709947    ["John Dale"]

The second join is:

X join actors using (aid)

where X is the result of the first join. This indicates that rows from the first join should be combined with rows from actors when they share the same aid. Again, this has the effect of producing an intermediate table with one additional field, name.

sqlite> select *
from movies join cast using (mid) join actors using (aid)
limit 10;
mid     title           year  genres      aid       role            name
------  --------------  ----  ----------  --------  --------------  --------------------
35423   Kate & Leopold  2001  Comedy,Fan  212       ["Kate McKay"]  Meg Ryan
35423   Kate & Leopold  2001  Comedy,Fan  413168    ["Leopold"]     Hugh Jackman
35423   Kate & Leopold  2001  Comedy,Fan  630       ["Stuart Besse  Liev Schreiber
35423   Kate & Leopold  2001  Comedy,Fan  5227      ["Charlie McKa  Breckin Meyer
66853   Na Boca da Noi  2016  Drama       180878    ["Vítor Hugo"   Rubens Correia
66853   Na Boca da Noi  2016  Drama       206883    ["Hugo"]        Ivan de Albuquerque
66853   Na Boca da Noi  2016  Drama       94426     \N              Roberto Bonfim
66853   Na Boca da Noi  2016  Drama       138681    \N              Marilia Carneiro
69049   The Other Side  2018  Drama       1379      ["Jake Hannafo  John Huston
69049   The Other Side  2018  Drama       709947    ["John Dale"]   Robert Random

To obtain the results we’re interested in, sqlite searches the rows in the final intermediate table for those whose titles match “wonder woman”.