In this report I will analyse the following Gapminder datasets showing developmnet of the countries over a period of time:
I will use this data to answer the following questions about economis and social development of the world:
The analysed questions are listed in detail in the introduction to the Explanatory Data Analysis.
For the analysis I will take only data for 2010 and 2018. For some datasets the information for these years is not available. In these cases I will use the closest year for which there is information available.
In some parts of the analysis I will pay special attention to the following three countries that are of most interest to me:
The full datasets are avilable on the website of Gapminder at: https://www.gapminder.org/data/
# Import the packages that will be used in the analysis
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from functools import reduce
%matplotlib inline
sns.set_style('darkgrid')
The datasets I chose from the Gapminder site did not have the 'continent' column. So I downloaded an Excel file with country and region information from another Gapminder source at: https://www.gapminder.org/data/geo/.
I then deleted in Excel the columns I did not need and saved the file to a csv format. I noticed that the separator was ';'.
df_region = pd.read_csv('list_of_countries.csv', sep = ';')
df_region.head(3)
I think that the division by 8 regions will be most representative. So I will choose this column to be included in my future combined dataset.
df_region.eight_regions.unique()
# Take the required columns and rename the regions to be used subsequently in plots
df_region = df_region[['name', 'eight_regions']]
df_region.rename(columns = {'name':'country', 'eight_regions':'region'}, inplace = True)
df_region.region.replace(to_replace=dict(asia_west='Western Asia', europe_east='Eastern Europe', africa_north='Northern Africa', europe_west = 'Western Europe',
africa_sub_saharan= 'Sub Saharan Africa', america_north='North America', america_south='South America',
east_asia_pacific='East Asia Pasific'), inplace=True)
df_region.head(3)
df_region['region'].unique()
Since I chose many separate datasets, I will wrap the csv import procedure into a function and use it all along.
# Write a function to download a csv file into a dataframe and rename columns
def import_csv(file, cols, old_col1, new_col1, old_col2, new_col2):
# where:
# file - name of the csv file
# cols - list of columns that will be taken from the file
# old_col1 - old name of the first year column
# new_col1 - new name of the first year column
# old_col2 - old name of the last year column
# new_col2 - new name of the last year column
df = pd.read_csv(file)
df = df[cols]
df.rename(columns = {old_col1:new_col1, old_col2:new_col2}, inplace = True)
return df
Load Worldbank's dataset of GNI (Gross National Income) per capita based on PPP (Purchasing Power Parity) converted to international dollars. A higher indicator shows a better well-being of the nation.
df_gdp = import_csv('gnipercapita_ppp_current_international.csv', ['country', '2010','2018'], '2010','gdp_2010','2018','gdp_2018')
df_gdp.head(3)
Load the dataset of Life Expectancy. The indicator shows the average number of years a newborn child is expected to live if current mortality patterns stay the same.
df_le = import_csv('life_expectancy_years.csv', ['country', '2010','2018'], '2010','le_2010','2018','le_2018')
df_le.head(3)
Load the dataset of Population Growth. The indicator shows the percentage growth compared to the previous year's popultaion.
df_pg = import_csv('population_growth_annual_percent.csv', ['country', '2010','2018'], '2010','pg_2010','2018','pg_2018')
df_pg.head(3)
Load the data of income inequality index. A higher number means more inequality.
df_ineq = import_csv('gini.csv', ['country', '2010','2018'], '2010','ineq_2010','2018','ineq_2018')
df_ineq.head(3)
Load dataset of Human Development Index. The index is based on three dimensions: health level, educational level and living standard. A higher index shows a better well-being of the nation.
df_hdi = import_csv('hdi_human_development_index.csv', ['country', '2010','2015'], '2010','hdi_2010','2015','hdi_2015')
df_hdi.head(3)
Load the dataset of Democracy Index. The index is prepared by the Economist Intelligence Unit based on 60 different aspects of the society. It ranges from 0 to 100. A higher index shows a better situation with the democracy in the country. The Gapminder database shows the index as a percentage of 1.
df_demox = import_csv('demox_eiu.csv', ['country', '2010','2018'], '2010','demox_2010','2018','demox_2018')
df_demox.head(3)
Load the dataset of Government Military Expenditures as a percentage of GDP.
df_gms = import_csv('military_expenditure_percent_of_gdp.csv', ['country', '2010','2018'], '2010','gms_2010','2018','gms_2018')
df_gms.head(3)
Load the dataset of Transparency International score of perceived corruption. Higher values indicate less corruption.
df_pci = import_csv('corruption_perception_index_cpi.csv', ['country', '2012','2017'], '2012','pci_2012','2017','pci_2017')
df_pci.head(3)
I will perform the merger based on the 'country' column. For the merger I will use the reduce
function.
# Create the list of all dataframes:
dataframes = [df_region, df_gdp, df_le, df_pg, df_ineq, df_hdi, df_demox, df_gms, df_pci]
df_gap = reduce(lambda left,right: pd.merge(left,right,on=['country'],
how='outer'), dataframes)
df_gap.head(3)
I noticed that some indicators, including:
are expressed as proportions. To scale the indicators down to the same scale, I will multiply the proportion-expressed indicators by 100.
# Write a function to convert proportions to percentage
def to_percentage(df, list_of_columns):
for column in list_of_columns:
df[column] = 100.00 * df[column]
return df
# Convert the proportion-based indicators to percentage-based indicators
list_of_columns = ['pg_2010', 'pg_2018', 'hdi_2010', 'hdi_2015', 'demox_2010', 'demox_2018', 'gms_2010', 'gms_2018']
to_percentage(df_gap, list_of_columns)
df_gap.head(3)
df_gap.info()
df_gap_clean = df_gap.dropna()
df_gap_clean.info()
df_gap_clean.head(3)
In this section I will use the clean dataset to answer the following questions about economis and social development of the world with particular focus on Russia, Germany and the US:
Column name | Description | Explanation |
---|---|---|
country | Country | Country |
region | Region | One of 8 regions |
gdp_2010 | GNI per capita, 2010 | Gross National Income per capita based on Purchasing Power Parity |
gdp_2018 | GNI per capita, 2018 | |
le_2010 | life expectancy, 2010 | Life expectancy in years |
le_2018 | life expectancy, 2018 | |
pg_2010 | population growth, 2010 | Population growth as a % to the previous year's population |
pg_2018 | population growth, 2018 | |
ineq_2010 | inequality index, 2010 | A higher index shows more inequality |
ineq_2018 | inequality index, 2018 | |
hdi_2010 | human development index, 2010 | A higher index shows better human development |
hdi_2015 | human development index, 2015 | |
demox_2010 | democracy index, 2010 | A higher index shows more democracy |
demox_2018 | democracy index, 2018 | |
gms_2010 | government military spending, 2010 | Military expenses as % of GDP |
gms_2018 | government military spendin, 2018 | |
pci_2012 | perceived corruption index, 2012 | A higher index shows less corruption |
pci_2017 | perceived corruption index, 2017 |
df_gap_clean.describe()
From the summary information we can see the following tentative changes in the indicators in 2018 (2015 for human development index and 2017 for perceived corruption index) compared to 2010:
In the plot below I will visualise the changes in the average GDP per capita in 2018 compared to 2010.
gdp_2010_mean = df_gap_clean['gdp_2010'].mean()
gdp_2018_mean = df_gap_clean['gdp_2018'].mean()
x = ['2010','2018']
y = [gdp_2010_mean, gdp_2018_mean]
plt.xlabel('Year')
plt.ylabel('mean GPD per capita, USD')
plt.title('Change in the average GDP per capita')
plt.bar(x, y)
plt.show()
Conclusion:
print('We can see that the world average GDP per capita has grown from 2010 to 2018 by: {} USD.'.format(round(gdp_2018_mean - gdp_2010_mean),0))
In this section I will plot a histogram showing the growth in the average income per continent in 2018 compared to 2010.
gdp_region_2018 = df_gap_clean.groupby('region')['gdp_2018'].mean()
gdp_region_2010 = df_gap_clean.groupby('region')['gdp_2010'].mean()
gdp_region_2018
# Function to plot categorical data
def category_bars(df_1, df_2, x_label, y_label, plot_title):
ind = df_1.index
bar_places = np.arange(len(df_1))
width = 0.35
plt.subplots(figsize=(12, 5))
bars_2010 = plt.bar(bar_places, df_1, width, color='r', alpha=.7, label='2010')
bars_2018 = plt.bar(bar_places + width, df_2, width, color='g', alpha=.7, label='2018')
# title and labels
plt.xlabel(x_label)
plt.ylabel(y_label)
plt.title(plot_title)
locations = bar_places + width / 2 # xtick locations
labels = ind # xtick labels
plt.xticks(locations, labels)
plt.plot(figsize=(18, 15))
# legend
plt.legend()
plt.show();
category_bars(gdp_region_2010, gdp_region_2018, 'Regions', 'Mean GDP per capita, USD', 'Comparison of GDP per capita by region by year')
Conclusion: Western Europe seems to have the highest GDP per capita followed by East Asia Pasific. As expected, Africa shows the lowest GDP per capita.
Let's make a bar plot to see how the regions differ by population growth.
gdp_region_pg_2018 = df_gap_clean.groupby('region')['pg_2018'].mean()
gdp_region_pg_2010 = df_gap_clean.groupby('region')['pg_2010'].mean()
gdp_region_pg_2018
category_bars(gdp_region_pg_2010, gdp_region_pg_2018, 'Regions', 'Population growth, % to previous year', 'Comparison of population growth by region by year')
Conclusion: Western Asia and Africa show the highest population growth. All other regions, but Eastern Europe, show a positive population growth, while Eastern Europe has a negative population growth.
is_three_countries = df_gap_clean['country'].isin(['Russia', 'Germany', 'United States'])
df_gap_three_countries = df_gap_clean[is_three_countries]
gdp_2010_three_countries = df_gap_three_countries['gdp_2010']
gdp_2018_three_countries = df_gap_three_countries['gdp_2018']
gdp_2018_three_countries
ind_three_countries = ['Germany', 'Russia', 'United States']
bar_places = np.arange(len(gdp_2018_three_countries))
width = 0.35
plt.subplots(figsize=(12, 5))
bars_2010 = plt.bar(bar_places, gdp_2010_three_countries, width, color='r', alpha=.7, label='2010')
bars_2018 = plt.bar(bar_places + width, gdp_2018_three_countries, width, color='g', alpha=.7, label = '2018')
# title and labels
plt.ylabel('Mean GDP per capita, USD')
plt.xlabel('Three countries')
plt.title('Comparison of GDP per capita by country by year')
locations = bar_places + width / 2 # xtick locations
labels = ind_three_countries # xtick labels
plt.xticks(locations, labels)
plt.legend()
plt.show();
Conclusion: Among the three countries Russia has the lowest GDP per capita, Germany being on the second place with the United States having the highest GDP per capita. Both Germany and the United States are way beyond Russia.
Let's see the changes in the social indicators on a histogram.
These indicators include:
To see the changes, let's split our dataset in two, one for 2010 (or 2012) and the other one for 2018 (or 2015, or 2017).
# Create the dataset for 2010 with selected columns
columns_2010 = ['ineq_2010','hdi_2010','demox_2010','pci_2012']
df_2010 = df_gap_clean[columns_2010]
df_2010.rename(columns = {'ineq_2010': 'Inequality','hdi_2010': 'Human Development','demox_2010': 'Democracy','pci_2012': 'Corruption'}, inplace = True)
df_2010_mean = df_2010.mean()
df_2010_mean
# Create the dataset for 2018
columns_2018 = ['ineq_2018','hdi_2015','demox_2018','pci_2017']
df_2018 = df_gap_clean[columns_2018]
df_2018.rename(columns = {'ineq_2018': 'Inequality','hdi_2015': 'Human Development','demox_2018': 'Democracy','pci_2017': 'Corruption'}, inplace = True)
df_2018_mean = df_2018.mean()
df_2018_mean
category_bars(df_2010_mean, df_2018_mean, 'Mean', 'Indicators', 'Mean indicators in 2010 and 2018')
Conclusion: As we can see from the histogram, the only noticeable change was in the Human Development Index. All other indices roughly stayed on the same level during the period under analysis.
To answer these questions I will plot scatter plots to see if there is any apparent correlation. On the plots I will pay special attention to the places where Russia, Germany and the Unites States stand.
Since the mechanism of drawing the scatter plots will be identical, I will wrap it up in a function.
# Write a function to show a scatter plot with two indicators
def scatter_plot(df, xcol, ycol, country, x_label, y_label):
plt.figure(figsize=(12,7))
plt.scatter(df[xcol], df[ycol], c = "b", s = 3**2, zorder = 2)
selected_countries=['Germany','Russia', 'United States']
mask=(df[country].isin(selected_countries))
plt.scatter(df[xcol][mask], df[ycol][mask], c = "r", s = 5**2, zorder = 3)
for (x, y, label) in df[[xcol, ycol, country]][mask].values:
plt.text(x, y, label, zorder=4)
plt.xlabel(x_label)
plt.ylabel(y_label)
plt.title('The correlation between {} and {}'.format(x_label, y_label))
plt.show()
In the scatter plot below I will analyse the correlation between the GDP per capita and life expectancy. I am inclined to say that there should be a positive correlation. A better well being should promote the length of life.
scatter_plot(df_gap_clean, 'le_2018', 'gdp_2018', 'country', 'Life Expectancy', 'GDP per Capita')
Conclusion: And indeed, there is a clear tentative positive correlation between the GDP per capita and the life expectancy. Russia has considerably shorter life expectancy and smaller GDP per capita compared to Germany and the US. Germany has the highest life expectancy among the three countries and the US have the highest GDP per capita.
On the next scatter plot I will check if there is a correlation between the GDP per capita and the level of democracy. In my opinion there should be a positive correlation.
scatter_plot(df_gap_clean, 'demox_2018', 'gdp_2018', 'country', 'Democracy Index', 'GDP per Capita')
Conclusion: Indeed, the scatter plot also shows a positive correlation between the GDP per capita and the level of democracy. However, many outliers may be seen. Russia on this plot is on the lower end with very low democracy index. Germany has the highest democracy index among the three countries.
On the next scatter plot I will check if there is a correlation between the GDP per capita and the level of corruption. In my opinion there should be a positive correlation.
scatter_plot(df_gap_clean, 'pci_2017', 'gdp_2018', 'country', 'Corruption Perception Index', 'GDP per Capita')
Conclusion: There is a tentative correlation between the GPD per capita and the corruption perception level. Remember that a higher corruption index shows less corruption. Again Russia is on the lower end of the sample. And Germany, again, has the highest Corruption Perception Index, which shows the least corruption among the three countries.
On the next scatter plot I will check if there is a correlation between the human development index and the democracy index. In my opinion, there should be a clear positive correlation between the two indices.
scatter_plot(df_gap_clean, 'hdi_2015', 'demox_2018', 'country', 'Human Development Index', 'Democracy Index')
Conclusion: The positive correlation is not that obvious. A positive correlation may be seen but there are many outliers and the countries are scattered quite a lot. Russia is among the outliers, with a very low democracy index and a rather high human development index.
On the next scatter plot I want to check if there is a correlation between the level of government military expenditures and the democracy index. I would think that more democratic countries would spend less on military expenses.
scatter_plot(df_gap_clean, 'gms_2018', 'demox_2018', 'country', 'Government Military Spending', 'Democracy Index')
Conclusion: There seems to be a very vague negative correlation: I don't see a clear dependency between the level of democracy and the military expenditures. Russia has the lowest democracy level among the three countries but the highest military expenditures as a percentage of GDP.
On the next scatter plot I want to check if there is a correlation between the level of inequality and the democracy index. I would think that more democratic countries would have less inequality in income distribution.
scatter_plot(df_gap_clean, 'ineq_2018', 'demox_2018', 'country', 'Inequality Index', 'Democracy Index')
Conclusion: There seems to be no correlation between the democracy level and the level of income inequality. Russia has a low democracy index and stands in the middle of Germany and the US in terms of income inequality.
Let's plot some visualisations to see the distribution of income in 2018 among the countries. I will start with a simple histogram
# Plot the distribution of GDP per capita
df_gdp = df_gap_clean['gdp_2018']
df_gdp.hist();
plt.title('Distribution of income in the world')
plt.xlabel('Income, USD')
plt.ylabel('Number of countries')
plt.show()
df_gdp.describe()
From the histogram and summary above of the 2018 GDP per capita indicators I can see that the distribution of income is far from normal. It is very right-skewed. The mean income is around 22K, the standard deviation is almost as big as the mean. The minimal income is only 900, while the maximum income is as much as 95K. The average income is 15,7K.
Conclusion: This shows great inequality among the countries in terms of income distribution.
Let's see which countries are at the very bottom and at the very top of the food chain.
# Identify low income countries in the lowest quartile:
low_income = df_gap_clean.query('gdp_2018 < 5555.0')
low_income['country'].count()
# List the poorest 5 countries
sort_by_income = df_gap_clean.sort_values('gdp_2018')
sort_by_income.head(5)
poorest_country = df_gap_clean.query('gdp_2018 == 900.0')
poorest_country
Conclusion: There are as much as 33 countries in the lowest quartile with income less than 5.5K. The poorest country is the Democratic Republic of Congo with GDP per Capita of USD 900.
# Identify high-income countries in the highest quartile:
high_income = df_gap_clean.query('gdp_2018 >= 33700.00')
high_income['country'].count()
# List the richest 5 countries
sort_by_income = df_gap_clean.sort_values('gdp_2018', ascending = False)
sort_by_income.head(5)
richest_country = df_gap_clean.query('gdp_2018 == 94700.00')
richest_country
Conclusion: There are as much as 33 countries in the highest quartile with income more than 33.7K. The richest country is Singapore with GDP per Capita of USD 94.7K.
On the plot below I will see the development of the inequality index by regions to identify regions with the highest inequality in income distribution within the country.
gdp_region_ineq_2018 = df_gap_clean.groupby('region')['ineq_2018'].mean()
gdp_region_ineq_2010 = df_gap_clean.groupby('region')['ineq_2010'].mean()
gdp_region_ineq_2018
ind_region_ineq = gdp_region_ineq_2018.index
bar_places = np.arange(len(gdp_region_ineq_2018))
width = 0.35
plt.subplots(figsize=(12, 5))
bars_2010 = plt.bar(bar_places, gdp_region_ineq_2010, width, color='r', alpha=.7, label='2010')
bars_2018 = plt.bar(bar_places + width, gdp_region_ineq_2018, width, color='g', alpha=.7, label = '2018')
# title and labels
plt.ylabel('Inequality index, % of 100%')
plt.xlabel('Regions')
plt.title('Comparison of Inequality Index by region by year')
locations = bar_places + width / 2 # xtick locations
labels = ind_region_ineq # xtick labels
plt.xticks(locations, labels)
plt.legend()
plt.show();
Conclusion: South America, North America and Sub-Saharan Africa show the highest income inequality within the country. Europe seems to have the least income inequality.
Based on Gapminder data and my analysis of it, I summarised below the conclusions regarding the development of the society in 2018 compared to 2010. In the analysis I used the methods of descriptive statistics and visualisations. So the conclusions regarding potential correlations between indicators are tentative and may deviate from the actual state of affairs.
Section 1: General developments in the world in 2018 compared to 2010
GDP per capita has noticeably increased by 5 353 USD. Western Europe seems to have the highest GDP per capita followed by East Asia Pasific. Africa shows the lowest GDP per capita. Western Asia and Africa have shown the highest population growth. Eastern Europe has a negative population growth. The only noticeable increase was in the Human Development Index. The levels of democracy, corruption and inequality roughly stayed the same. Among the three countries Russia has the lowest GDP per capita, Germany being on the second place with the United States having the highest GDP per capita.
Section 2: Are there tentative correlations between certain indicators?
There is are clear tentative positive correlations between GDP per capita and life expectancy, between GDP per capita and the level of democracy. Russia has considerably shorter life expectancy and smaller GDP per capita compared to Germany and the US. Russia shows a very low democracy index, while Germany has the highest democracy index among the three countries. There is a tentative negative correlation between GPD per capita and the corruption perception level. Russia is on the lower end of the sample, while Germany shows the lowest level of corruption among the three countries. There seems to be no obvious correlations between human development and democracy or between democracy and military expenditures or bewteen democracy and inequality. Russia is among the outliers, with a very low democracy index and a rather high human development index and high military expenditures.
Section 3: What are the developments in the disbribution of income in the world?
There is great inequality among the countries in terms of income distribution. 33 countries are in the lowest quartile with income less than 5.5K. The poorest country is Congo with GDP of USD 900. Other 33 countries are in the highest quartile with income more than 33.7K. The richest country is Singapore with GDP of USD 94.7K. South America, North America and Sub-Saharan Africa show the highest income distribution inequality within the country. Europe seems to have the least income inequality.