Monday, February 18, 2019

Combine Tables using Joins (SQLite3)

When designing a database we usually divide data between two or more tables to avoid duplication of data. To access information from such a DB we need a mechanism to pull the information between the tables and combine it. This is done through a Join.

A join combines information from two or more tables to create a new set of records, each of which can contain some or all of the information in the tables involved.

We will create a database pop.db which will contain two tables PopByRegion and PopByCountry so that we can use joins to combine the information they contain. The following code creates the required DB and the tables:

import sqlite3 as dbapi

con = dbapi.connect('pop.db')
cur = con.cursor()

cur.execute('CREATE TABLE PopByRegion(Region TEXT, Population INEGER)')

cur.execute('INSERT INTO PopByRegion VALUES("Central Africa", 330993)')
cur.execute('INSERT INTO PopByRegion VALUES("Southeastern Africa", 743112)')
cur.execute('INSERT INTO PopByRegion VALUES("Northern Africa", 1037463)')
cur.execute('INSERT INTO PopByRegion VALUES("Southern Asia", 2051941)')
cur.execute('INSERT INTO PopByRegion VALUES("Asia Pacific", 785468)')
cur.execute('INSERT INTO PopByRegion VALUES("Middle East", 687630)')
cur.execute('INSERT INTO PopByRegion VALUES("Eastern Asia", 1362955)')
cur.execute('INSERT INTO PopByRegion VALUES("South America", 593121)')
cur.execute('INSERT INTO PopByRegion VALUES("Eastern Europe", 223427)')
cur.execute('INSERT INTO PopByRegion VALUES("North America", 661157)')
cur.execute('INSERT INTO PopByRegion VALUES("Western Europe", 387933)')
cur.execute('INSERT INTO PopByRegion VALUES("Japan", 100562)')

con.commit()

Inserting data one row at a time like this requires a lot of typing. It is simpler to make a list of tuples to be inserted and write a loop that inserts the values from these tuples one by one using the placeholder notation. We will use this approach to create and populate the PopByCountry DB as shown in the following code:

import sqlite3 as dbapi

con = dbapi.connect('pop.db')
cur = con.cursor()

cur.execute('CREATE TABLE PopByCountry(Region TEXT, Country TEXT, Population INEGER)')

countries = [("Eastern Asia", "China", 1285238), ("Eastern Asia", "DPR Korea", 24056), ("Eastern Asia", "Hong Kong (China)", 8764), ("Eastern Asia", "Mongolia", 3407), ("Eastern Asia", "Republic of Korea", 41491), ("Eastern Asia", "Taiwan", 1433), ("North America", "Bahamas", 368), ("North America", "Canada", 40876), ("North America", "Greenland", 43), ("North America", "Mexico", 126875), ("North America", "United States", 493038)]
             
for c in countries:
    cur.execute('INSERT INTO PopByCountry VALUES (?, ?, ?)', (c[0], c[1], c[2]))

con.commit()

To check the tables in the DB use the following query:

SELECT name FROM sqlite_master WHERE type = "table" and you should see the following result:

name
PopByRegion
PopByCountry


Now that we have two tables in our database, we can use joins to combine the information they contain. We’ll begin with inner joins, which involve the following:

1. Constructing the cross product of the tables
2. Discarding rows that do not meet the selection criteria
3. Selecting columns from the remaining rows

First, all combinations of all rows in the tables are combined, which makes the cross product. Second, the selection criteria specified by WHERE is applied, and rows that don’t match are removed. Finally, the selected columns are kept, and all others are discarded.

Now let us get the names of the region and countries with projected populations greater than one million by using the inner join. The following query will create the required inner join :

SELECT PopByRegion.Region, PopByCountry.Country
FROM PopByRegion INNER JOIN PopByCountry
WHERE (PopByRegion.Region = PopByCountry.Region)
AND (PopByRegion.Population > 1000000)

1. We construct the cross product of the tables and combine every row of PopByRegion with every row of PopByCountry.

2. We discard rows that do not meet the selection criteria. The join’s WHERE clause specifies two of these: the region taken from PopByRegion must be the same as the region taken from PopByCountry, and the region’s population must be greater than one million.

3. At last,  we select  the  region  and  country  names  from  the  rows  that  have survived.

Let's use this query in the program which will print the names of the region and countries with projected populations greater than one million. See the code below:

import sqlite3 as dbapi

con = dbapi.connect('pop.db')

cur = con.cursor()

cur.execute('SELECT PopByRegion.Region, PopByCountry.Country FROM PopByRegion INNER JOIN PopByCountry WHERE (PopByRegion.Region = PopByCountry.Region)AND (PopByRegion.Population > 1000000)')
print(cur.fetchall())


The output of the program is shown below:

[('Eastern Asia', 'China'), ('Eastern Asia', 'DPR Korea'), ('Eastern Asia', 'Hon
g Kong (China)'), ('Eastern Asia', 'Mongolia'), ('Eastern Asia', 'Republic of Ko
rea'), ('Eastern Asia', 'Taiwan')]
------------------
(program exited with code: 0)

Press any key to continue . . .

If the query results contain duplicate values we can use the DISTINCT clause in the query to remove the duplicates. See the following query which find the regions where one country accounts for more than 10 percent of the region’s overall population:

SELECT PopByRegion.Region
FROM PopByRegion INNER JOIN PopByCountry
WHERE (PopByRegion.Region = PopByCountry.Region)
AND ((PopByCountry.Population * 1.0) / PopByRegion.Population > 0.10)

Let's use this query in the program which will print the  regions where one country accounts for more than 10 percent of the region’s overall population:

import sqlite3 as dbapi

con = dbapi.connect('pop.db')

cur = con.cursor()

cur.execute('SELECT PopByRegion.Region FROM PopByRegion INNER JOIN PopByCountry WHERE (PopByRegion.Region = PopByCountry.Region) AND ((PopByCountry.Population * 1.0) / PopByRegion.Population > 0.10)')

print(cur.fetchall())

The output of the program is shown below:

[('Eastern Asia',), ('North America',), ('North America',)]

------------------
(program exited with code: 0)

Press any key to continue . . .

As we can see output contains a duplicate value North America thus we'll use DISTINCT clause in the query and execute it again as shown below:

SELECT DISTINCT PopByRegion.Region
FROM PopByRegion INNER JOIN PopByCountry
WHERE (PopByRegion.Region = PopByCountry.Region)
AND ((PopByCountry.Population * 1.0) / PopByRegion.Population > 0.10)

We use multiplication and division in our WHERE condition to calculate the percentage of the region’s population by country as a floating-point number.Let's use this query in the program which will print the  regions where one country accounts for more than 10 percent of the region’s overall population:

import sqlite3 as dbapi

con = dbapi.connect('pop.db')

cur = con.cursor()

cur.execute('SELECT DISTINCT PopByRegion.Region FROM PopByRegion INNER JOIN PopByCountry WHERE (PopByRegion.Region = PopByCountry.Region) AND ((PopByCountry.Population * 1.0) / PopByRegion.Population > 0.10)')

print(cur.fetchall())

The output of the program is shown below which doesn't contain any duplicates:

[('Eastern Asia',), ('North America',)]
------------------
(program exited with code: 0)

Press any key to continue . . .

Here I am ending today's discussion. In the next post we'll focus on Keys and Constraints. Till we meet next keep practicing and learning Python as Python is easy to learn!








Share:

0 comments:

Post a Comment