Weekend Project: Data Science

Justin DeTone
13 min readMay 10, 2021

--

Photo by Lukas Blazek on Unsplash

Background

In an effort to explore the different careers that are out there, and to better contextualize what different roles different jobs take on, I have decided to do regular, weekend projects where I take an interesting position and I build a two-day project around it. The goal is to get a little bit of a taste for what different jobs entail. The scope of these projects is by necessity going to be small, which is dictated by the two-day time frame that I am working with. This will have the added benefit of forcing my efforts to be lean and to help me develop my skills for getting the best results with limited input. Though I must confess that this project took me two weekends to complete, I can admit that I could have optimized my time to better focus on the essential parts of the project without getting sidetracked.

For this project I settled on data science, since I hear a lot about it as a growing, high in-demand field. I did some initial research on what a data scientist did. This was necessary so that I could craft a mini project around the title. To give a brief summary, a data scientist solves business problems by leveraging data. A data scientist may try to predict the future, what happened in the past, or how a business should act. There is an emphasis on working with large amounts of data. It’s a position that occupies an intersection between Programming, Business and statistics. Data scientists will often use machine learning techniques or models, but it’s important to stress that their priority is in addressing a business question and not in developing the science of Artificial Intelligence.

Getting Started

I was able to get some insight into the process that a data scientist might follow from reading a few articles. This one was especially helpful in laying out a path for a simple project and providing some Python code to work with as a starting point. The process that I followed starts with identifying a problem/finding a dataset. Here the difference in working for a company and doing this as an individual were clear. If I were an in-industry data scientist, I would have to think that someone would come to me with a business question and that I would have some company data to explore. Instead, as someone doing this as a project, I had to find my own dataset and essentially create my own problem.

The next step after that was to peek into the dataset. This involved learning some basic characteristics of the data I was working with. Then I would perform exploratory data analysis, which builds on the previous step and seeks to answer questions that I have about the data. Then came feature selection where I had to consider which values were useful in building a model as well as manipulate the data for better model-building. Next came model building and evaluation, where I would use I model to drive some kind of decision about the data. Finally I would conclude by considering model improvements.

Finding a dataset

My first task was to find a dataset that had enough information to work with and was relatively clean — I didn’t want to spend too much time cleaning the data due to aforementioned limited project scope. After some deliberation, I ended up settling on a dataset that contained survey data for every country measuring happiness. I selected this dataset because it had an obvious metric (happiness) that could be analyzed for its dependence on other features within the dataset (such as gdp per capita and perceived corruption of the surveyed country).

The dataset that I worked with was actually contained in two csv files. The first file contained survey results of average happiness for every country for the year 2021 as well as survey results for many other questions for features that I’d imagine were thought to influence how likely someone was to say they were happy (ratings of freedom, perceived social support, perceived corruption, etc.). Most of the survey questions in the dataset were asked as yes or no questions and then the mean response of all citizens was given as a decimal (from 0 to 1). The other dataset contained contained multiple entries for each country with historical happiness data going back to 2016. This dataset also had the survey results of the other features that you might think would correlate with happiness.

In the end, I decided to limit the scope of this project to working with just the 2021 data. I really did want to try to drive a prediction around the historical data, perhaps seeing if survey responses to some questions could predict a country’s happiness in future years, but ultimately had to limit the scope of this project. This is a common theme, but considering that this was a project relegated to a weekend (again, this one took two — but I didn’t want it to), I needed to set some definite scope limits.

Exploring the Dataset

Now that I had a dataset, it was time to explore it. I ended up using python library pandas to read the csv file for 2021 happiness data into a dataframe object, which is a custom object that pandas stores 2-dimensional data into. Here is the python code that I used to explore the dataset:

import pandas as pddef peek(df):
print("The dimensions of this table are: ", df.shape)
print("The column names are: ", df.columns)
print("These columns have the following data types:\n", df.dtypes)

if __name__ == "__main__":
happiness2021 = pd.read_csv("data\world-happiness-report.csv")
peek(happiness2021)

When run, this code gave me information about the shape of the dataset, the names of all of the columns (which are dataset features) and the data type of each feature. The output of the code is below.

The dimensions of this table are:  (149, 20)The column names are:  Index(['Country name', 'Regional indicator', 'Ladder score','Standard 
error of ladder score', 'upperwhisker', 'lowerwhisker',
'Logged GDP per capita', 'Social support', 'Healthy life expectancy','Freedom to make life choices', 'Generosity','Perceptions of corruption', 'Ladder score in Dystopia','Explained by: Log GDP per capita', 'Explained by: Social support','Explained by: Healthy life expectancy','Explained by: Freedom to make life choices','Explained by: Generosity', 'Explained by: Perceptions of corruption','Dystopia + residual'],dtype='object')These columns have the following data types:Country name objectRegional indicator objectLadder score float64Standard error of ladder score float64upperwhisker float64lowerwhisker float64Logged GDP per capita float64Social support float64Healthy life expectancy float64Freedom to make life choices float64Generosity float64Perceptions of corruption float64Ladder score in Dystopia float64Explained by: Log GDP per capita float64Explained by: Social support float64Explained by: Healthy life expectancy float64Explained by: Freedom to make life choices float64Explained by: Generosity float64Explained by: Perceptions of corruption float64Dystopia + residual float64dtype: object

Sidenote: if I were to redo this project, I would have done my code in a Jupyter notebook instead of splitting it out into several scripts. That’s because I feel like the formatting of a Jupyter notebook would have lent itself better to this project, which involved outputting a lot of results in python.

I also peeked into the dataset by using the dataframe head() method, which allows me to look at the first n rows of data. I ended up spending too much time on this project trying to generate a usable table image of the first 5 rows of the dataset. My best results came with installing another python library: plotly. Still the formatting of the table is still a bit awkward. This was another area of the project that I felt was good enough to move forward on. For a formal presentation on this dataset, this table clearly needed work, but I didn’t want to get too sidetracked on formatting a table when the focus of this project was on applying data science principles. Anyway here’s the table, split into two images because it’s so wide.

Again, don’t do this at a job

The code that generated this table is below.

import pandas as pd
import plotly.figure_factory as ff
def add_space(str_header):
header_list = str_header.split()
output = ""
for word in header_list:
output += word + "\n"
print(output)
return output[:-1]

def clip_title(str_header):
return str_header[:20]
happiness2021 = pd.read_csv("data\world-happiness-report.csv")
print(happiness2021.columns)
happiness2021.rename(columns=clip_title, inplace=True)
print(happiness2021.columns)
print(happiness2021.head(5))print("The dimensions of this table are: ", happiness2021.shape)print(happiness2021.dtypes)
print("columns: ", happiness2021.columns)
def plot_table(df, width, height, img_name, truncate_text=False):
new_df = df
if truncate_text:
new_df.rename(columns=clip_title, inplace=True)
fig = ff.create_table(new_df.reset_index())
fig.update_layout(
autosize=False,
width=width,
height=height
)
fig.write_image(img_name, scale=1)
fig.show()
return 1
fig = ff.create_table(happiness2021.head(20))
fig.update_layout(
autosize=False,
width=1500,
height=400
)
fig.write_image("historical.png", scale=1)
fig.show()

I encourage you to follow the link to the dataset if you are curious about the details of each feature. For this project I decided to focus only on: Ladder score (measure of happiness), Logged GDP, Social Support, Life Expectancy, Freedom, Generosity and Perceptions of corruption. The dataset also contained some statistical data on the distribution of answers for the ladder score as well as an already calculated impact of each feature on a country’s happiness ladder score. Now that I have glanced at the dataset, I can start framing some questions about the data.

Exploratory Data Analysis

During this phase of the project I came up with several questions about the data and worked with the data to explore the answers. I had the following questions about the data:

What is the distribution of each variable?

To answer this question I plotted histograms for the features that I was interested in with the following code. I used another python library, matplotlib, which is often used in python to visualize data.

import matplotlib.pyplot as plt
import pandas as pd
plt.rcParams['figure.figsize'] = (20, 10)fig, axes = plt.subplots(nrows = 2, ncols = 4)
axes[-1, -1].axis('off')
df = pd.read_csv("data\world-happiness-report-2021.csv")
print(df.columns)
features = ["Ladder score", "Logged GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption"]
xaxes = features
yaxes = ["Counts"] * 7
#flatten array of axes
axes = axes.ravel()
for idx, ax in enumerate(axes[:-1]):
ax.hist(df[features[idx]].dropna(), bins = 30)
ax.set_xlabel(xaxes[idx], fontsize=20)
ax.set_ylabel(yaxes[idx], fontsize=20)
ax.tick_params(axis='both', labelsize=15)

plt.show()

This was the result

I think it’s interesting that most of the distributions are more dense around the upper quartile or so of data, with the exception of perceived generosity (which has more representation in the lower quartile) and the ladder score (which follows more of a normal distribution). I also wanted to point out that the logarithm of the gdp per capita was used in this dataset. This is crucial because the range of true gdp per capita values (in US dollars) is quite large: 761 to 103,000. If this dataset did not have the adjusted gdp values, I would have had to add code to take the natural log of the gdp distribution myself.

The next question that I looked into was: How much correlation exists between each of these features? For this I generated a Pearson ranking visualization. A Pearson correlation is a measure of the linear correlation between two sets of data. The visualization calculates that correlation for each feature with every other feature. It then colors the result. A result greater than zero shows a direct linear correlation, less than zero shows an inverse linear correlation and a result near zero demonstrates no linear correlation between the variables. Here is the visualization.

Again, if I had more time I would have better formatted this visualization so that the x-axis labels weren’t cut off. If it helps, the order on both axes are the same. It is interesting that the strongest correlation between the features considered was between life expectancy and gdp per capita. It is also interesting that generosity had very little correlation with overall happiness (ladder score). It does make sense that perceived corruption is consistently inversely related to every other feature.

Here is the code that generated that plot:

import matplotlib.pyplot as plt
import pandas as pd
from yellowbrick.features import Rank2D
df = pd.read_csv("data\world-happiness-report-2021.csv")
features = ["Ladder score", "Logged GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption"]
data_array = df[features].to_numpy()visualizer = Rank2D(features=features, algorithm='pearson')
visualizer.fit(data_array)
visualizer.transform(data_array)
visualizer.poof()

Finally, I was curious about how countries with different amounts of GDP per capita would have their happiness ladder score impacted by each of the features. My idea here is that a poor country might have one feature correlate highly with happiness, while a rich country sees less of an impact on happiness as that feature is varied. To look into this, I first divided the data into four sub-datasets divided into quartiles based on their gdp. I then plotted each feature vs happiness and performed a linear regression. This was done for each quartile. Below is the code

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.read_csv("data\world-happiness-report-2021.csv")
print(df.columns)
print(df.shape)
quarter1 = df[df['Logged GDP per capita'] < 8.541]
quarter2 = df[(df['Logged GDP per capita'] >= 8.541) & (df['Logged GDP per capita'] < 9.569)]
quarter3 = df[(df['Logged GDP per capita'] >= 9.569) & (df['Logged GDP per capita'] < 10.421)]
quarter4 = df[df['Logged GDP per capita'] >= 10.421]
quarter = quarter1 #run this for each quartery_feature = ["Ladder score"] * 6
features = ["Logged GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption"]
fig, axes = plt.subplots(nrows = 2, ncols = 3)axes = axes.ravel()
for idx, ax in enumerate(axes):
model = LinearRegression()
model.fit(quarter[features[idx]].to_numpy().reshape(-1, 1), quarter[y_feature[idx]].to_numpy().reshape(-1,1))
ax.scatter(quarter[features[idx]], quarter[y_feature[idx]])

xmin, xmax = ax.get_xlim()
#ymin, ymax = ax.get_ylim()

x_new = np.linspace(xmin, xmax, 100)
y_new = model.predict(x_new[:, np.newaxis])

ax.plot(x_new, y_new)
ax.set_xlabel(features[idx], fontsize=10)
ax.set_ylabel(y_feature[idx], fontsize=10)
ax.tick_params(axis='both', labelsize=15)

print(model.score(quarter[features[idx]].to_numpy().reshape(-1, 1), quarter[y_feature[idx]].to_numpy().reshape(-1,1)))
print("coefficient: ", model.coef_)
plt.show()

I generated four images of plots.

Bottom 25th percentile of GDP per capita
25 to 50th percentile of GDP per capita
50 to 75th percentile of GDP per capita
75th percentile and up of GDP per capita

While this provides a nice visual, comparing these charts isn’t all that useful. That’s because I didn’t set the axes to have the same range for each variable, so you can’t really judge the slope of the line visually. Below I have reproduced the R2 values as well as the coefficient of each plot.

The R2 values convey how well the data points fit to the regression line. More helpful is comparing the coefficients of each feature vs happiness across the four quartiles. You can’t use this data to draw conclusions about the relative importance of each feature in determining happiness because the data here has not been normalized. What we can do is notice that social support becomes increasingly more correlated with happiness as GDP goes up. We see this in the increasing values of the Social Support coefficients as we go from the first to last quarter.

Another interesting observation is that Generosity seems to have a strong correlation with happiness, but only in the highest quartile. Also only important in the richest quartile of countries in predicting happiness is how corrupt the country is perceived as. In the top quartile, a high R2 and very negative coefficient imply this. A near zero R2 in every other quartile implies very little correlation of corruption and happiness elsewhere.

Building a Model

Finally, I will perform a multi linear regression on all of the data to create a model that predicts the happiness of a country given all other features. It’s the same idea as above, but instead of correlating one feature with one dependent variable, I look for a linear regression between the six features here and the happiness ladder score. Below is my code for that, which makes use of numpy and statsmodels, in addition to the other libraries already mentioned.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import statsmodels as ssm
df = pd.read_csv("data\world-happiness-report-2021.csv")y_feature = ["Ladder score"] * 6
features = ["Logged GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption"]
dependent = df["Ladder score"]
independent = df[["Logged GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption"]]
model = LinearRegression()model.fit(independent, dependent)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
intercept = model.intercept_
coefficients=model.coef_
print("R2: ", model.score(independent, dependent))
print("Intercept: ", intercept)
print("coefficients: ", coefficients)
x = ssm.add_constant(independent)
model = ssm.OLS(dependent, independent).fit()
predictions = model.summary()
print(predictions)

This code leaves me with the following data about the model:

R2: 0.7558471374226855

Intercept: -2.2372192944749205

coefficients: [ 0.2795329, 2.47620585, 0.03031381, 2.0104647, 0.36438194, -0.60509177]

These are the coefficients that make up a multi-dimensional line. They resolve into the following formula:

happiness ladder score =
-2.240
+ (Logged GDP per Capita) * 0.280
+ (Social Support) * 2.48
+ (Life Expectancy) * 0.030
+ (Freedom) * 2.010
+ (Generosity) * 0.364
+ (Perceived Corruption) * -0.605

This model could be used to predict a country’s happiness ladder score given all of the other features surveyed. The model ended up having an R2 value of 0.756. A score of 1 indicates that a model completely predicts the relationship between its independent features and the dependent feature. A score of 0 indicates a model that has no ability to predict the dependent feature. Hence a score of 0.756 is pretty good.

Conclusion

Normally you would want to refine your model further here, but, you know, time constraints. I also didn’t evaluate the added R2 value of adding each feature to the model, which is something you want to do to ensure that every feature added to the model is actually useful in driving a prediction.

It’s been a while since I’ve taken statistics, so I hope I didn’t butcher the statistical concepts used here too badly. Obviously a data scientist would be required to have a good grounding in statistics. If I were to continue to learn more about this topic I would definitely see value in learning more there. Of course, this project only represents a narrow view of what a data scientist can do. Still it was interesting to work within a dataset and come to some simple conclusions about how happiness in countries is impacted by other considerations.

--

--

Justin DeTone
Justin DeTone

No responses yet