Data Science

Practical 1

Aim: data exploration and visualization using mathematical and statistical tools.

Theory:

This practical aim to exploring and visualizing data with python or php or java language using mathematical and statistical tools such as Tableau, Matplotlib and Seaborn and following are the processes that is to be performed on the data:

- Reading CSV file

- Process on it to clean data

- Perform mathematical and statistical techniques mean, mode, median, summation, groupby and standard deviation using NumPy and Pandas library.

- Visualize the relations and distributions of the data using Plot and Graph techniques.

Practical:

Details of dataset:

The dataset is downloaded from the UNdata (https://data.un.org/)

1. Total Fertility Rate (Live births per woman)

2. Average Income per person Total population, both sexes combined (Income in thousands)

Step: 1: Reading Dataset

First, apply following filters before downloading the Fertility rate dataset:

i. Select years from 1950 to 2010.

ii. Select all countries. (Default selected)

Apply following filters before downloading the Income Data:

i. Deselect the current filters (High-Income to Upper-middle-income countries).

ii. Select years from 1950 to 2010.

iii. Select all countries. (Default selected)

Here, the Jupyter Notebook is used to perform all the exploring, cleaning and visualizing task using Python programming language.

Import all the following python libraries:

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

from scipy import stats

               Read the dataset saved in current working directory.

df = pd.read_csv("UN_Fertility_Rate_Data.csv")

df.head()

	Country or Area	Year(s)	Variant	Value
0	Afghanistan	2005-2010	Medium	6.478
1	Afghanistan	2000-2005	Medium	7.182
2	Afghanistan	1995-2000	Medium	7.654
3	Afghanistan	1990-1995	Medium	7.482
4	Afghanistan	1985-1990	Medium	7.469

Step: 2: Explore the dataset.

df.columns

Index(['Country or Area', 'Year(s)', 'Variant', 'Value'], dtype='object')

df['Variant'].unique()

array(['Medium'], dtype=object)

df['Year(s)'].unique()

array(['2005-2010', '2000-2005', '1995-2000', '1990-1995', '1985-1990',

'1980-1985', '1975-1980', '1970-1975', '1965-1970', '1960-1965',

'1955-1960', '1950-1955'], dtype=object)

df = df.rename(columns={'Country or Area':'Country','Year(s)':'Years','Value':'Rate'})

df = df.drop(['Variant'], axis = 1)

df2 = df

df

	Country	Years	Rate
0	Afghanistan	2005-2010	6.478
1	Afghanistan	2000-2005	7.182
2	Afghanistan	1995-2000	7.654
3	Afghanistan	1990-1995	7.482
4	Afghanistan	1985-1990	7.469
...	...	...	...
3463	Zimbabwe	1970-1975	7.400
3464	Zimbabwe	1965-1970	7.400
3465	Zimbabwe	1960-1965	7.300
3466	Zimbabwe	1955-1960	7.000
3467	Zimbabwe	1950-1955	6.800

3468 rows × 3 columns

df['Index'] = df.index

df = df.set_index(['Country','Index'])

a = [1,2,3,4,5,6,7,8,9,10,11,12]

a = a*289

df['Index'] = a

Step: 3:          Plot the graph of Rate vs. Years.

df2 = df

plt.figure(figsize=(14,6))

sns.lineplot(y = df['Rate'], x = df['Years'])

<matplotlib.axes._subplots.AxesSubplot at 0x55ddfa0>

               Conclusion: This graph says that the fertility rate per woman is continuously decreasing over the years 1950 to 2010.

Step: 4: Pivot table to make the Country as a Index, Years as column and Rate as a value.

               Below is the graph of Fertility rate over the years 1950 to 2010 for the Afghanistan Country. It shows the bar for each 5 year span. It shows that from year 1950 till 1995 the fertility rate was almost constant but after that till year 2000 it increased slightly and than it is decreasing.

df2.loc[['Afghanistan']].plot(kind ='bar',figsize=(14,6))

<matplotlib.axes._subplots.AxesSubplot at 0x1354b550>

Step: 5: Cleaning the Dataset.

df2.columns.values

df3 = pd.DataFrame(

    np.arange(24).reshape(2, 12),

    columns=[('Rate', '1950-1955'), ('Rate', '1955-1960'),

       ('Rate', '1960-1965'), ('Rate', '1965-1970'),

       ('Rate', '1970-1975'), ('Rate', '1975-1980'),

       ('Rate', '1980-1985'), ('Rate', '1985-1990'),

       ('Rate', '1990-1995'), ('Rate', '1995-2000'),

       ('Rate', '2000-2005'), ('Rate', '2005-2010')])

df3.rename(columns='_'.join, inplace=True)

x = df3.columns

df2.columns = x

df2

248 rows × 12 columns

Step: 6:       Below is the graph of values of fertility rate for all countries for the year 1955-1960 vs. 1950-1955. It shows that ths scattering of the values are almost making 45degrees angle with both axes. It means that the regression lne of this graph has postive values and x and y has similar values. So, we can conclude that from year 1950-1955 to year 1955-1960 there is very slightly change or negligible change in the fertilty rate of the countries.

df2.plot(kind ='scatter',figsize=(14,6),x = 'Rate_1955-1960',y='Rate_1950-1955')

<matplotlib.axes._subplots.AxesSubplot at 0x136dfec8>

Step: 7: Below heat map is for the fertilty rate values of first 10 countires of the dataframe over the years 1950-1955 to 2005-2010.

df3 = df2.iloc[0:10, :-1]

df3

plt.figure(figsize=(14,7))

sns.heatmap(data = df3, annot = True)

<matplotlib.axes._subplots.AxesSubplot at 0x1334b0e8>

Step: 8: Here we will create a new and main Data frame of this practical and will name it as Data.

The Data dataframe will look as following:

X = df.Years.unique()

Data = df

Data = Data.pivot_table(index = 'Years',columns ='Country', values =['Rate'])

df3 = pd.DataFrame(

    np.arange(496).reshape(2, 248),

    columns=['Afghanistan', 'Africa', 'Albania', 'Algeria', 'Angola',

       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Asia',

       'Australia', 'Australia/New Zealand', 'Austria', 'Azerbaijan',

       . . . .,

       'Western Africa', 'Western Asia', 'Western Europe',

       'Western Sahara', 'World', 'Yemen', 'Zambia', 'Zimbabwe'])

x = df3.columns

Data.columns = x

Data

Step: 9: The Next is the most important graph. This line graph is plotted where each line shows each country. The graph is for the values of fertility rate from year 1950 to 2010. Here, all the countries are shown by a light colour gray line. Some of the countries are highlighted. From highlighted Countries, Yemen and Niger are poor countires. India, Brazil and Spain are the average developing countries. All other highlighted can be considered as rich countries.

plt.figure(figsize=(14,14))

m = Data.mean(axis = 1)

Data['World Average'] = m

for i in range(0,248):

    sns.lineplot(data = Data, y = Data[Data.columns[i]], x = Data.index, color='grey', alpha = 0.1)

sns.lineplot(data = Data, y = 'India', x = Data.index, color = 'brown', label= 'India')

sns.lineplot(data = Data, y = 'Brazil', x = Data.index, color = 'green',  label= 'Brazil')

sns.lineplot(data = Data, y = 'Niger', x = Data.index, color = 'purple', label= 'Niger')

sns.lineplot(data = Data, y = 'Yemen', x = Data.index, color = 'blue', label= 'Yemen')

sns.lineplot(data = Data, y = 'France', x = Data.index, color = 'orange', label= 'France')

sns.lineplot(data = Data, y = 'Sweden', x = Data.index, color = 'yellow', label= 'Sweden')

sns.lineplot(data = Data, y = 'United Kingdom', x = Data.index, color = 'pink', label= 'United Kingdom')

sns.lineplot(data = Data, y = 'Spain', x = Data.index, color = 'brown', label= 'Spain')

sns.lineplot(data = Data, y = 'Italy', x = Data.index, color = 'blue', label= 'Italy')

sns.lineplot(data = Data, y = 'Germany', x = Data.index, color = 'yellow', label= 'Germany')

sns.lineplot(data = Data, y = 'Japan', x = Data.index, color = 'purple', label= 'Japan')

sns.lineplot(data = Data, y = 'China', x = Data.index, color = 'brown', label= 'China')

sns.lineplot(data = Data, y = 'World Average', x = Data.index, color = 'black', label= 'World Average', linewidth = 5)

savefig('sample.png')

Conclusion: From this graph we can say that the poor countries have more fertility rate. i.e. there are more number of children per woman in poor countries. Average or Developing countries have fertility rate between 3 to 6. And Rich countries have verry low fertility rate.

Now, By plotting the world average of each countries over this years, we can say that the line has decreasing curve. That means the fertility rate for almost all countries have been decreasing from 1950 to 2010.

Part: 2:

Here we will read the next dataset i.e. Average income per person.

Step: 1: Read the dataset of Average income per person for all the countries over the years 1950 to 2010.

income = pd.read_csv('UN_Income_Data.csv')

income.head()

	Country or Area	Year(s)	Variant	Value
0	Afghanistan	2010	Constant fertility	29185.507
1	Afghanistan	2009	Constant fertility	28394.813
2	Afghanistan	2008	Constant fertility	27722.276
3	Afghanistan	2007	Constant fertility	27100.536
4	Afghanistan	2006	Constant fertility	26433.049

Step: 2: Explore the dataset and match the no of countries with the previous dataset.

income = income.drop('Variant', axis = 1)

income = income.rename(columns = {'Country or Area' : 'Country', 'Year(s)' : 'Years', 'Value' : 'Income'})

___________________________________________________________________________________

imean = []

for item in np.split(income, 3477):

    imean.append(item['Income'].mean())

#     print(item['Income'].mean())

imean[1:13]

IncomeCol = imean[0:3384]

# IncomeCol

len(IncomeCol)

len(imean)

country = income['Country'].unique()

len(country)

x = [0]*282

for i in range(0,282):

    x[i] = i

x = np.repeat(x, 12)

x = list(x)

len(x)

Step: 3:

Aim of the Part 2 is that we want to see that does the average income per person in a perticular countires affect the fertility rate of that country or not.

Here we make a hypothesis of the result as follows:

- The graph will be combination of line and scatter.

- The Income of the countries will be represented as line and fertility rate values will be represented as scatter points.

- We want ot see the relationship between this two variables.

- Here we assume that both the graph will coinside on each other.

- It shows that the countries with high income per person has low fertility rate and countries with low income has high frtility rate.

So, we will explore the dataset first:

avgincome = pd.DataFrame(zip(IncomeCol, x), columns = ['Income', 'Country'])

years = df['Years'][0:3383]

avgincome['Years'] = years

avgincome = avgincome.pivot_table(columns = 'Country',values = ['Income'], index = 'Years')

avgincome.columns = country

avgincome

avgincome2 = avgincome

plt.figure(figsize = (14,7))

avgincome.mean()

avgincome2 = avgincome2.T

avgincome2['mean'] = avgincome.mean()

Data.mean()

Afghanistan      7.367917

Africa           6.140500

Albania          4.006667

Algeria          5.702917

Angola           6.938500

...

World            3.848667

Yemen            7.712500

Zambia           6.691667

Zimbabwe         5.898333

World Average    4.401196

Length: 249, dtype: float64

Step: 4: Next we will create a new dataframe combining our two exisisting dataframes Data and avgincome. We will read it from the below file:

UN_Combined_Data.csv

new_df = pd.read_csv("UN_Combined_Data.csv")

reformed = new_df[new_df['Year(s)']=='1990-1995']

reformed1

	Country or Area	Year(s)	Value	Income
3	Afghanistan	1990-1995	7.482	15757.5100
15	Africa	1990-1995	5.724	699691.5748
27	Albania	1990-1995	2.786	3130.6212
39	Algeria	1990-1995	4.120	29234.7394
75	Angola	1990-1995	7.100	15887.8532
...	...	...	...	...
3315	Vanuatu	1990-1995	4.830	2664.3614
3327	Venezuela (Bolivarian Republic of)	1990-1995	3.250	10138.3074
3339	Viet Nam	1990-1995	3.227	76.3872
3363	Western Africa	1990-1995	6.395	41187.3072
3375	Western Asia	1990-1995	4.036	8.8406

245 rows × 4 columns

Step: 5: We will now plot the graph of Income vs. Value (i.e. Fertility Rate).

This is includes regression line. We can derive results from the slope of the regression line.

If the slope of regression line is Positive then the relation between two variables is positive.(i.e. if one value increases, another increases.) and vice-versa.

re2 = reformed1.sort_values(['Income'])

plt.figure(figsize = (14,7))

sns.regplot(data = re2, x = 'Income', y = 'Value')

<matplotlib.axes._subplots.AxesSubplot at 0x1485f040>

Conclusion of the above graph:

From graph, we can see that the slope of the regression line is negative. And so it proves that Income and Fertility rate are inversely proportional. i.e. the countries whose Income is high tend to have lower fertility rate and countries with low income generally have high fertility rate.

Measures of central Tendancy:

Mean:

def my_mean_fun(xyz):

my_sum = 0

for i in range(0, 12):

my_sum = my_sum + xyz.iloc[i]

# print(my_sum)

Mean = my_sum / 12

print(my_sum / 12)

return Mean

my_mean = my_mean_fun(Data)

sns.distplot(my_mean)

Afghanistan 7.367917

Africa 6.140500

Albania 4.006667

Algeria 5.702917

Angola 6.938500

...

Yemen 7.712500

Zambia 6.691667

Zimbabwe 5.898333

World Average 4.401196

mean NaN

Length: 250, dtype: float64

<matplotlib.axes._subplots.AxesSubplot at 0x181b93b8>

Standard Deviation:

import math as math

def my_std_fun(xyz):

sq_sum = 0

std = []

for i in range(0,12):

diff = my_mean[i] - xyz.iloc[i]

sq_sum = sq_sum + diff*diff

var = sq_sum / 12

# print(var)

for i in range(0,12):

std.append(math.sqrt(var[i]))

# std = math.sqrt(var)

print(std)

return std

my_std = my_std_fun(Data)

[3.532038474165607, 2.3819336693552855, 1.193136578140668, 2.1684070906460087, 3.2913903457332605, 1.6880082583353477, 2.0734038451802714, 1.7757598701174224, 1.7190840639995537, 1.0755802527976859, 2.3017676672716605, 2.2473502031868304]

new_data_std = my_std_fun(new_data)

new_data_df = pd.DataFrame(new_data_std)

new_data_df.to_csv("new_data_std.csv")

Median:

def my_median(xyz):

Median = []

for i in range(0,12):

result = xyz.sort_values(xyz.columns[i], ascending=[1]).iloc[:,i][[125]]

Median.append(result)

return Median

my_data_median = my_median(new_data)

my_data_median

median_df = pd.DataFrame(my_data_median)

median_df.to_csv("median_data.csv")

Mode:

new_data.mode().iloc[0].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1850ae38>

Percentile, Quartile and Decile:

def percentile(n):

p = n * (251) / 100

pi = int(p)

result = new_data.sort_values('2000-2005', ascending=[1])

x = float(result['2000-2005'][[pi]])

# print(x)

y = float(result['2000-2005'][[pi+1]])

# print(y)

ans = x + (p - pi)*(y - x)

return ans

p = percentile(10)

print("p10 = D1 = ", p)

p = percentile(25)

print("p25 = Q1 = ", p)

p = percentile(30)

print("p30 = D3 = ", p)

p = percentile(50)

print("p50 = Q2 = ", p)

p = percentile(65)

print("p65 = ", p)

p = percentile(75)

print("p75 = Q3 = ", p)

p = percentile(90)

print("p90 = D9 = ", p)

p = percentile(99)

print("p99 = ", p)

p10 = D1 = 1.371

p25 = Q1 = 1.8415

p30 = D3 = 1.9469

p50 = Q2 = 2.645

p65 = 3.2571500000000007

p75 = Q3 = 4.2145

p90 = D9 = 5.7984

p99 = nan

Skewness:

We have our Data dataframe transposed to new_data dataframe as following.

new_data

We will find Pearsons coefficient of skewness for our dataframe.

The equation for the Pearsons coefficient of skewness is

SKp =

Now lets plot the distribution plot for this data frame. We have also included mean, mode and median lines in the graph.

plt.figure(figsize = (14,7))

mean = new_data_mean.mean()

median = new_data.median().median()

mode = new_data.mode().iloc[0].mean()

f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.2, 1)})

sns.boxplot(new_data, ax=ax_box)

ax_box.axvline(mea, color='r', linestyle='--')

ax_box.axvline(median, color='g', linestyle='-')

ax_box.axvline(mode, color='b', linestyle='-')

sns.distplot(new_data, ax=ax_hist)

ax_hist.axvline(mean, color='r', linestyle='--')

ax_hist.axvline(median, color='g', linestyle='-')

ax_hist.axvline(mode, color='b', linestyle='-')

plt.legend({'Mean':mean,'Median':median,'Mode':mode})

ax_box.set(xlabel='')

plt.show()

Conclusion: Here we can see that the data is equally distributed on both the sides of mean. And it is giving us Symmetric curve over the years 1950 to 2010.

But how is it being this symmetric?

For that we will make distplot for different years.

To see the change over the years, we will plot same graphs for year 1950-1955, 1980-1985 and 2005-2010. Here the total years 60 is diveded into 3 parts.

First plot the graph for year 1950-1955 and see the tendency.

mean5055 = new_data_mean['1950-1955']

median5055 = 6.0

mode5055 = new_data['1950-1955'].mode().iloc[0]

f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.2, 1)})

sns.boxplot(new_data['1950-1955'], ax=ax_box)

ax_box.axvline(mean5055, color='r', linestyle='--')

ax_box.axvline(median5055, color='g', linestyle='-')

ax_box.axvline(mode5055, color='b', linestyle='-')

sns.distplot(new_data['1950-1955'], ax=ax_hist)

ax_hist.axvline(mean5055, color='r', linestyle='--')

ax_hist.axvline(median5055, color='g', linestyle='-')

ax_hist.axvline(mode5055, color='b', linestyle='-')

plt.legend({'Mean': mean5055,'Median': median5055,'Mode': mode5055 })

ax_box.set(xlabel='')

plt.show()

Conclusion: From above graph we can see that the tail of the graph is long towards the negative edge of the x-axis. Also mean<median<mode . So the graph is becoming negativey skewed.

We can say that during the years 1950-1955, the fertility rate of most of the countries were higher than mean (because graph if higher to the right side of the mean and lower towards the left side of the mean).

std5055 = new_data_std[0]

pearsoncoeff5055 = 3*(mean5055 - median5055) / std5055

Pearsons coefficient of skewness for 1950-1955 = - 1.4646796041034003

Now plot graph for the year 1980-1985 and see the tendancy.

mean8085 = new_data_mean['1980-1985']

median8085 = 4.23

mode8085 = new_data['1980-1985'].mode().iloc[0]

f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.2, 1)})

sns.boxplot(new_data['1980-1985'], ax=ax_box)

ax_box.axvline(mean8085, color='r', linestyle='--')

ax_box.axvline(median8085, color='g', linestyle='-')

ax_box.axvline(mode8085, color='b', linestyle='-')

sns.distplot(new_data['1980-1985'], ax=ax_hist)

ax_hist.axvline(mean8085, color='r', linestyle='--')

ax_hist.axvline(median8085, color='g', linestyle='-')

ax_hist.axvline(mode8085, color='b', linestyle='-')

plt.legend({'Mean': mean8085,'Median': median8085,'Mode': mode8085})

ax_box.set(xlabel='')

plt.show()

Conclusion: From above graph we can see that the graph is becoming symmetric. Also mean=median<mode . Mean, median and mode are very less far from each other.

We can say that from 1955 to 1980, fertility rate of some of the countries had started decreasing and so the negatively skewed graph started moving to get equally distributed to mean. We can say that the countries were 50% - 50% divided on fertility rate > mean and fertility rate < mean.

std8085 = new_data_std[6]

pearsoncoeff8085 = 3*(mean8085 - median8085) / std8085

Pearsons coefficient of skewness for 1980-1985 = - 0.9624275516198356

Now lets plot the graph for the year 2005-2010 (30 years after 1985) and see the tendency.

mean510 = new_data_mean['2005-2010']

median510 = 2.539

mode510 = new_data['2005-2010'].mode().iloc[0]

f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.2, 1)})

sns.boxplot(new_data['2005-2010'], ax=ax_box)

ax_box.axvline(mean510, color='r', linestyle='--')

ax_box.axvline(median510, color='g', linestyle='-')

ax_box.axvline(mode510, color='b', linestyle='-')

sns.distplot(new_data['2005-2010'], ax=ax_hist)

ax_hist.axvline(mean510, color='r', linestyle='--')

ax_hist.axvline(median510, color='g', linestyle='-')

ax_hist.axvline(mode510, color='b', linestyle='-')

plt.legend({'Mean':mean510,'Median':median510,'Mode':mode510})

ax_box.set(xlabel='')

plt.show()

Conclusion: From above graph we can see that the tail of the graph is long towards the positive edge of the x-axis. Also mean>median>mode . So the graph is becoming positively skewed from previously symmetric over the years 1985 to 2010.

We can say that during the years 2005-2010, the fertility rate of most of the countries have became lower than mean (because graph if higher to the left side of the mean and lower towards the right side of the mean).

std510 = new_data_std[11]

pearsoncoeff510 = 3*(mean510 - median510) / std510

Pearsons coefficient of skewness for 1980-1985 = - 0.991976218080132

These three graph shows us the transition of the fertility rate for all countries from more than mean to become less than mean. And thats why the final graph (plotted first) is becoming symmetric and having peaks on both sides of mean.

Explanation of code ( functions used in plotting ):

- matplotlib library as plt.

- seaborn library as sns.

- Subplots to make it convenient to create common layouts to subplots, including enclosing figure object, in a single cell. 1^st argument: ncols, nrows; 2^nd argument: sharex = True (common x axis for both boxplot and distplot); 3^rd argument: gridspac_kw = to create the grid the subplots are placed on.

- Axvline argument: Add vertical line across the axes. Pass data as 1^st argument, can define colour, linewidth, linestyle etc.

- plt.legend: to provide legend name to the variable plotted.

The reasons for change 

- Empowerrment of women (Increasing Access to education and increasing labour participation)

- Declining Child mortality

- Rising cost of bringing up children

World population by level of fertility without projections

Conclusion: The width given to each country in this chart corresponds to the share of each countrys population in the total global population. This is why China and India are so wide.

From this we can say that ,,

Globally, the fertility rate has been fallen to 2.5 childern per woman and low fertility rates are the norm in most of the world.

The 65% of the worlds population live in countries with the fertility rate below 3 childern per woman.

We also see convergence in fertility rates: the countries that already had low fertility rates in 1950s only slightly decreased fertility rates further, while many of the countries that had the highest fretility back than saw a rapid reduction of the number of children per woman.

It took Iran only 10 years for fertility rate to fall from more than 6 children per woman to fewer than 3 children per woman. China made this transition in 11 years before the introduction of the one-child policy.

The speed with which countries can make transition to low fertility rate has increased over time.

Charts:

- How long did it take for fertility rate to fall from 6 children per woman to fewer than 3 children per woman.

Years-it-took-Fertility-to-fall-from-6-to-below-3

- Age of merrige of woman and Merital fertility rate

Country or Region	Mean age at first marriage	Births per married women	Percentage never married	Total fertility rate
Belgium	24.9	6.8		6.2
France	25.3	6.5	10	5.8
Germany	26.6	5.6		5.1
England	25.2	5.4	12	4.9
Netherlands	26.5	5.4		4.9
Scandinavia	26.1	5.1	14	4.5

- Children borm per woman, 2019

- Birth rates:

- Womans Educational Attainment vs. No of children per woman:

- Number of children, by level of education of the mother, in countries where women have on an average 5 or more children.

Dhs data on fertility by the level of education

- Fertility and female labour force participation:

References:

- Book: Alberto Cairo - The Functional Art_ An introduction to information graphics and visualization (2012, New Riders)

- Fertility Rate - Our World in Data