Astrum Imber

Amandeep Singh
Originally Written On: 5 Apr 2020

(Originally submitted as the first Continuous Assessment for the module ‘Statistics for Data Analytics’)

"The emphasis is on understanding the appropriate statistical techniques to use and on the interpretation of the results from the application of those techniques. Whether the results are obtained using SPSS, R, SAS, Stata, Minitab, Excel or some other package should not be critical to the understanding required" - Prof. (Dr.) Tony Delaney, NCI, Dublin.

ABSTRACT

This report describes a multiple-linear regression-based model for predicting the life expectancy of citizens of countries listed in the United Nations (UN) Life Expectancy dataset. The model uses various other UN datasets for choosing the dependent variables (total country population in that year, the Gross Domestic Product (GDP) of the country in that year, HIV Death rate in that year, etc.) and then checks for all the necessary assumptions required to perform a regression analysis. This report also presents a time-series model that forecasts the Consumer Price Indices (CPIs) of the G20 group of countries using the ARIMA model using the dataset provided by the European Union (EU) dataset repository. The data cleaning for this model was done on Python, while the statistical analysis and model development was done on IBM SPSS and R Studio. Along with other metrics, the predominant evaluation metrics used was R-squared and the p-value tests.

Keywords—life expectancy, consumer price index, regression, forecasting, statistical analysis.

I. INTRODUCTION

Human beings are living longer, healthier and more fulfilled lives due to the advancements in modern medicine and healthcare facilities. Living with chronic ailments and still perform daily-life functions with ease is possible now more than ever. While medical knowledge has aided in increasing the average life expectancy of humans across the world, advancements in data collection and analysis has made it possible to study the effects by designing models around it. The other side of the coin is the continuously increase strain on the natural and economic resources. This makes it imperative to analyze and measure the effect of this injection of technology on the economics of various countries. One such measure is the Consumer Price Index (CPI) that incorporates the change in prices of various household items as a unit of inflation (Bryan and Cecchetti, 1993) (Dougherty and Van Order, 1982). Studying the effect of technology on the daily life of consumers can help administrations make crucial decisions and policy changes for the betterment of their citizens.

This report presents a model, based on the statistical technique of multiple-linear regression, that analyses different factors that may or may not affect the accurate prediction of the life expectancy of a country in a given year. This report also tries to forecast the increase in the inflation index (or CPI) of a the G20 group of countries by modelling and fitting the trend over the years.

II. DATA SOURCES, CLEANING AND PREPARATION

A. Variables Used and Source of Data

1. For Multiple-Linear Regression:

Dependent variable: Life expectancy at birth (years)

Independent Variables:

Per capita Gross Domestic Product (GDP) at current prices
Total Population of countries
General government expenditure on health as a percentage of total government expenditure
BCG immunization coverage among 1-year-olds (%)
Deaths - HIV/AIDS (Age-standardized) (Rate) (per 100,000 people)
Polio (Pol3) immunization coverage among 1-year-olds (%)
Hepatitis B (HepB3) immunization coverage among 1-year-olds (%)
Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)
Human Development Index (HDI)

Data Sources: All the datasets used were downloaded directly from the UN Dataset online repository (data.un.org, n.d.) in .csv file format.

2. For Time Series Forecasting & Analysis:

Variable Used: Consumer Price Index (CPI)

Data Sources: The dataset used was downloaded directly from the EUROSTAT online repository (ec.europa.eu, n.d.) in .csv file format

B. Cleaning and Preparation

1. For Multiple-Linear Regression:

The datasets for the dependent and independent variables were stored in csv formats and imported in a Jupyter notebook for cleaning and transformation. Python packages, namely Pandas and NumPy, were used to manipulate the data using the DataFrame feature of Pandas. Each dataset was first imported as a separate Pandas DataFrame. The columns were listed, along with the metadata indicating the number of rows, memory size and data types of the attributes. The columns which were not needed were removed from each dataset. The existing columns in each dataset were renamed to improve transparency.

For example, after importing the life expectancy dataset, it had attributes: Location, Period, Indicator, Dim1, First Tooltip. These names were set according to the survey parameters by the UN. The columns that were required for this model, namely Location, Period and First Tooltip, were renamed to Country, Year and life_expectancy respectively, while the rest of the columns were dropped. Similar procedure was followed for the other datasets.

Up until this point the DataFrames were handled separately. Since these datasets were created for different years and different countries, merging them was tedious & memory consuming. So, a year was chosen for all datasets - 2012, because this year contained the least number of null/empty values across all datasets. Now, all the datasets were merged on the basis of Country names, i.e., countries which were common in all datasets were kept, rest all were removed. This transformation process eliminated almost all the null/empty values. The countries which had some remaining null/empty values were also removed. The resulting cleaned and transformed dataset now had 12 columns and 116 rows. The transformation and cleaning also successfully eliminated outliers, which was confirmed using scatter plots. The final dataset was exported to a .csv file for further analysis using SPSS.

2. For Time Series Forecasting & Analysis:

The dataset was imported into R Studio as a .csv file. The data was transposed for ease of modelling, cleaned to remove null values, and the columns were renamed before exporting to a .csv file for further analysis in R Studio. The dataset has 2 columns, Date & CPI, and 288 rows containing monthly data from January 1996 to December 2019.

III. MODEL THEORY

A. Multiple Linear Regression

This technique is used to estimate the relationship between one dependent variable and multiple independent variables (Ucla.edu, 2019) (Yale.edu, 2019). A simple multiple linear regression model

equation looks something like (Hyndman and Athanasopoulos, 2018):

$y_{i}=\beta_{0}+\beta_{1} x_{i 1}+\beta_{2} x_{i 2}+\cdots+\beta_{k} x_{i k}+e_{i}, i=1,2, \ldots, n; \\ \text{where},\\ y_{i} = \text{Dependent variable (response variable)}\\ x_{i k} = \text{Independent variables (explanatory variable)}\\ \beta_{0} = \mathrm{y} - \text{intercept}\\ \beta_{i}= \text{coefficients for each independent variable}\\ e_{i} = \text{Error term (residual)}$

There are some assumptions that need to be fulfilled for successfully implementing a multiple linear regression model:

There should be a linear relationship between the response and each explanatory variable. There should also be a collective linear relationship between them.
Response variable must be continuous. Explanatory variables must be continuous too (only the numerical ones).
There should NOT be any cross-correlation or multicollinearity between the response and explanatory variables.
The response and explanatory variables should be homoscedastic.
The errors should be normally distributed (approximately).

There are some specific evaluation techniques that were used to judge whether a regression model satisfies the above assumptions. If the assumptions were not met, the model was modified to overcome that drawback. The criteria used were (Frost, 2017b) (Frost, 2017c) (Jason Brownlee, 2018):

-- Adjusted R-squared value
-- Pearson Correlation values
-- p-value test (ANOVA table)
-- Variance Inflation Factors (VIF) values and p-values of the coefficients
-- Coefficient Correlations
-- Normal P-P plot
-- Residual Scatter Plot

B. Time Series Analysis

Time series analysis is a crucial tool to analyse and forecast any given time-based series of data. A time series can have of various features – seasonality, trend, noise, stationary/moving aspect. The technique of decomposition is performed to separate these features and bring clarity for analysis. If the time series is moving with respect to time, then it is differenced to make it stationary (Ambatipudi, 2017).

Then, a model is selected for fitting the recorded values by studying the ACF (Autocorrelation Function) and PACF (Partial-Autocorrelation Function) plots and computing the p (auto-regressive order), d (degree of difference) & q (moving average order) values. The model used in this report is ARIMA (Auto Regressive Integrated Moving Average) (2017).

The accuracy of the ARIMA model is checked on the basis of two tests (2017):

-- Normal Q-Q plot: to check for normality in the data
-- Ljung-Box test: to test the hypothesis using p-values

After successful testing, the model is used to forecast the values of the required duration of time. This forecast model is tested using the model vs time plots and, again, the Ljung-Box test (Statistics Solutions, 2017).

IV. MODEL OUTLINE AND ANALYSIS

A. Multiple Linear Regression

Multiple models were created and tested in SPSS (Laerd.com, 2018) (Statistics Solutions, 2017). The models were analysed based on the assumptions and parameters put forth in the previous section. The model summaries are:

1. Model-1:

Model Equation:

$y_{i} = 62.550 + 5.639 \times 10^{-10}(\text{total population }) - 3.401 \times 10^{-5}(\mathrm{GDP}) + 0.174*\text{(health expense)} + 19.555*(HDI) - 0.004*\text{(HIV death rate)} - 0.014*\text{(hepB immunity)} + 0.042*\text{(polio immunity)} + 0.006*\text{(BCG immunity)} - 0.058*\text{(adult mortality rate)}$

Adjusted R2: 0.969

Pearson Correlation values:

-- life_expectancy : HDI = 0.886
-- life_expectancy : adult_mortality_rate = -0.945
-- polio_immunity : hepB_immunity = 0.847

p-value (ANOVA table): 0.000 (< 0.005)

VIF & p-values of the coefficients:

-- VIF-value (adult_mortality_rate) = 6.316
-- VIF-value (hepB_immunity) = 5.320
-- p-value (total_population) = 0.538
-- p-value (hepB_immunity) = 0.473
-- p-value (polio_immunity) = 0.062
-- p-value (BCG_immunity) = 0.718

Coefficient Correlation values:

All correlation values < 0.8

Normal P-P plot: Refer to Fig. 1

Residual Scatter Plot: Refer to Fig. 2

Analysis: Even though the R2 value is optimum, the Pearson correlation values show high degree of correlations. Since HDI and adult_mortality_rate show correlations with the dependent variable life_expectancy, they will have to be removed. polio_immunity, hepB_immunity & BCG_immunity have high correlations and their coefficinet p-values are also > 0.05, which means that the null hypothesis cannot be rejected. For the next model, these three variables will be combined into a single avg_immunity variable by taking their average. Plots signify normaltiy of residuals and homoscedasticity. It is also noticed that population and GDP data values are very high in comparison to others, so logarithm (base 10) is applied on these variables to bring everything on a similar scale. Combined tables in Fig. 17.

2. Model-2:

Model Equation:

$y_{i} = 18.399 + 7.827*(\log (GDP)) - 0.027*(\text {HIV death rate}) +0.168*\text{(Avg. Immunity)} + 0.376*\text{(health expense)} + 0.752*(\log \text{(Total Population)})$

Adjusted R2: 0.822

Pearson Correlation values:

All correlation values < 0.8

p-value (ANOVA table): 0.000 (< 0.005)

VIF & p-values of the coefficients:

-- All VIF-values < 5
-- p-value (population) = 0.065

Coefficient Correlation values:

All correlation values < 0.8

Normal P-P plot: Refer to Fig. 3

Residual Scatter Plot: Refer to Fig. 4

Analysis: The only value of concern in this model is the high p-value for the coefficient of the population variable. It is >0.05, which signifies failure to reject the null hypothesis, i.e., coefficients are insignificant. Hence, this variable can be removed without significantly affecting the model. Plots signify normaltiy of residuals and homoscedasticity. Combined tables in Fig. 18.

3. Model-3:

Model Equation:

$y_{i} = 24.946 + 7.741*(\log (GDP)) - 0.027*(\text {HIV death rate}) + 0.158*(\text {Avg.immunity}) + 0.358*(\text{health expense})$

Adjusted R2: 0.818

Pearson Correlation values:

-- All correlation values < 0.8
-- p-value (ANOVA table): 0.000 (< 0.005)

VIF & p-values of the coefficients:

-- All VIF-values < 5
-- All p-values < 0.05

Coefficient Correlation values:

All correlation values < 0.8

Normal P-P plot: Refer to Fig. 5

Residual Scatter Plot: Refer to Fig. 6

Analysis: All values and the plots are in accordance to the set metrics. Plots signify normality of residuals and homoscedasticity. This model is the final regression model for this project. Combined tables in Fig. 19.

B. Time Series Analysis

The forecast model was coded, plotted & analyzed in R (Peixeiro, 2019) (Doc, n.d.). The steps were as follows:

1. Decomposition:

After cleaning, the time series was imported for decomposition to properly view the different elements that make up the time series, i.e., trend of the series, seasonality within the series, the moving/stationary nature of the data and finally the noise permeating within the series, as plotted in Fig. 7. The series is a non-stationary, seasonal series with an upward trend and some noise. The auto-ARIMA function in R was used to aid in procuring the p (auto-regressive order), d (degree of difference) & q (moving average order) values for the model because it wasn’t possible to do so using the ACF and PACF functions.

2. ARIMA:

The auto-ARIMA function suggested a ARIMA(0,1,2)(0,0,2)[12] model with drift (refer to Fig. 8). This model was used to check the normality of the residuals. As shown in Fig. 9, residuals were distributed normally. But the p-value from the Ljung-Box test was 0.1464 (>0.05, Fig. 10). To correct for this, the difference function in R was invoked, as plotted in Fig. 11, which resulted in a much better p-value (almost zero, Fig. 12).

3. Analysis and Forecasting:

The Normal Q-Q plot of the series was plotted, as shown in Fig. 13. Almost all data points fit the line quite well, thus, the time series is normally distributed.

Next, the model was used to forecast the CPI for the next 2 years. As plotted in Fig. 14 and Fig. 15, the model is precisely fitting the observed values and the forecast model is accurately following the upward-trend of the series. Finally, the results were tested using the Ljung-Box Test.

4. Ljung-Box Test:

The p-value for this test was very close to zero, as depicted in Fig. 16. This means that the model accurately fits the observed values, which is the desired result.

C. Figures & Tables

Fig. 1. MODEL-1: Normal P-P plot

Fig. 2. MODEL-1: Residual scatter plot

Fig. 3. MODEL-2: Normal P-P plot

Fig. 4. MODEL-2: Residual scatter plot

Fig. 5. MODEL-3: Normal P-P plot

Fig. 6. MODEL-3: Residual scatter plot

Fig. 7. The decomposition plot of the multiplicative time series

Fig. 8. Best model suggestion, as produced by auto.arima()

Fig. 9. Residual plot with ACF: On time series

Fig. 10. Ljung-Box test on the time series model given by auto.arima()

Fig. 11. Residual plot with ACF: After differencing time series

Fig. 12. Ljung-Box Test after differencing time series

Fig. 13. The Normal Q-Q plot

Fig. 14. Forecasting the CPI for the next 2 years using the time series model

Fig. 15. Forecasting the CPI for the next 2 years using the time series model (differenced)

Fig. 16. Ljung-Box test of the final model

V. CONCLUSIONS

A. Multiple Linear Regression

Multiple models were created and tested in SPSS. The model which satisfied all the assumptions put forth in the previous sections and produced a respectable R2 score was selected. The final model equation is:

$y_{i} = 24.946 + 7.741*(\log (GDP)) - 0.027*(\text {HIV death rate}) + 0.158*(\text {Avg.immunity}) + 0.358*(\text{health expense})$

This model cleared the tests for normality and homoscedasticity. A summary of the Pearson correlation values is given in Fig. 20. A similar pictorial representation which also incorporates scatter plots for all variables is shown in Fig. 22. These figures were created in Python.

B. Time Series Analysis

ARIMA(0,1,2)(0,0,2)[12] model with drift was selected for the time series modelling and the time series was differenced to reduce errors. While the Normal Q-Q plot confirmed that the data exhibited normality, the Ljung-Box test confirmed that the model was accurate.

VI. ACKNOWLEDGEMENTS

The author would like to thank National College of Ireland, for providing the necessary resources for this project, and also extends wholehearted gratitude to Prof. (Dr.) Tony Delaney, for his continuous support and guidance.

REFERENCES

Ambatipudi, V. (2017). Time Series Analysis in SPSS. YouTube. Available at: https://www.youtube.com/watch?v=0ew9XMbkgpo [Accessed 5 Apr. 2020].

Bryan, M.F. and Cecchetti, S.G. (1993). The Consumer Price Index as a Measure of Inflation. [online] National Bureau of Economic Research. Available at: https://www.nber.org/papers/w4505 [Accessed 5 Apr. 2020].

data.un.org. (n.d.). UNdata | explorer. [online] Available at: http://data.un.org/Explorer.aspx [Accessed 5 Apr. 2020].

Dougherty, A. and Van Order, R. (1982). Inflation, Housing Costs, and the Consumer Price Index. The American Economic Review, [online] 72(1), pp.154–164. Available at: https://www.jstor.org/stable/1808582?seq=1 [Accessed 5 Apr. 2020].

ec.europa.eu. (n.d.). G20 CPI all-items - Group of Twenty - Consumer price index (prc_ipc_g20). [online] Available at: https://ec.europa.eu/eurostat/cache/metadata/en/prc_ipc_g20_esms.htm [Accessed 5 Apr. 2020].

Frost, J. (2017a). Check Your Residual Plots to Ensure Trustworthy Regression Results! [online] Statistics By Jim. Available at: https://statisticsbyjim.com/regression/check-residual-plots-regression-analysis/ [Accessed 5 Apr. 2020].

Frost, J. (2017b). How to Interpret P-values and Coefficients in Regression Analysis - Statistics By Jim. [online] Statistics By Jim. Available at: https://statisticsbyjim.com/regression/interpret-coefficients-p-values-regression/.

Frost, J. (2017c). Multicollinearity in Regression Analysis: Problems, Detection, and Solutions - Statistics By Jim. [online] Statistics By Jim. Available at: https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/.

Hyndman, R.J. and Athanasopoulos, G. (2018). Forecasting : principles and practice. Heathmont, Vic.: Otexts.

James, G., Witten, D., Hastie, T. and Tibshirani, R. (n.d.). An introduction to statistical learning : with applications in R.

Jason Brownlee (2018). A Gentle Introduction to Normality Tests in Python. [online] Machine Learning Mastery. Available at: https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/.

Laerd.com. (2018). How to perform a Multiple Regression Analysis in SPSS Statistics | Laerd Statistics. [online] Available at: https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-statistics.php.

MasumRumi (n.d.). A Statistical Analysis & ML workflow of Titanic. [online] kaggle.com. Available at: https://www.kaggle.com/masumrumi/a-statistical-analysis-ml-workflow-of-titanic#Part-1:-Importing-Necessary-Libraries-and-datasets [Accessed 5 Apr. 2020].

Shumway, R.H. and Stoffer, D.S. (2017). Time series analysis and its applications : with R examples. Cham: Springer.

Statistics Solutions. (2017). Testing Assumptions of Linear Regression in SPSS - Statistics Solutions. [online] Available at: https://www.statisticssolutions.com/testing-assumptions-of-linear-regression-in-spss/.

Ucla.edu. (2019). Regression Analysis | SPSS Annotated Output. [online] Available at: https://stats.idre.ucla.edu/spss/output/regression-analysis/.

Yale.edu. (2019). Multiple Linear Regression. [online] Available at: http://www.stat.yale.edu/Courses/1997-98/101/linmult.htm.

APPENDIX

Fig. 17. MLR MODEL-1: Full Model Summary

Fig. 18. MLR MODEL-2: Full Model Summary

Fig. 20. MLR MODEL-3: Full Model Summary

Fig. 21. MLR MODEL-3: Correlation Matrix

Fig. 22. MLR MODEL-3: Correlation Table

Fig. 23. MLR MODEL-3: Correlation Scatter plot

Amandeep Singh
Originally Written On: 19 Oct 2018

(Originally submitted as an assignment for the module ‘Astrobiology’)

This article is more philosophical than physics or astrobiology. But sometimes it is necessary to have discussions that explore the boundaries and ethics of the upcoming fields. It is also believed that by the time humans finally start the interstellar exploration missions, we would have evolved to a different physical form -- whether it is transferring/replicating human consciousness into computers, or by modifying the human body with robotic parts, interstellar travel without major biological advances is not humanly possible. This makes it all the more imperative to have conversations that explore the frontiers of future physics.

There are countless Hollywood movies that depict AI in a certain way, but AI will really be truly something unimaginable - which is really interesting and really frightening. (Image courtesy link)

Artificial Intelligence (AI) is easily the most prevalent themes in most science fiction. The idea that a machine could exhibit the same level of intelligence and sentience as a human being has captivated everyone for decades. From an ominous computer system in 2001: A Space Odyssey, to super-human androids in Westworld -- this captivating sub-genre of science fiction has experienced a diverse range of depictions. But, fiction has a habit of over-exaggerating certain aspects, and romanticising certain others, for the obvious reason to sell more copies. In recent years, a few outspoken intellectuals, such as Elon Musk, Sam Harris, Stephen Hawking, Nick Bostrom to name a few, have voiced genuine concerns with the rise of AI , and contributed to an ever growing sense of an impending doom by the hands of a computer code, that is to say if we don't already nuke ourselves, or God throws another stone at us (Ultron reference ;p).

Brilliant minds across the planet are competing to retrace the millions of years of evolution that resulted in the human brain. While many experts have no doubts on the capabilities of a machine to achieve human level intelligence, some believe it to be an impossibility. But if random mutations can lead to intelligence, how hard can it be? In fact, even though evolution had one hell of a head-start, machines have already surpassed us in some limited domains, case in point Deep Blue and AlphaGo. AI softwares capable of generating music and poetry have also been released. But again, these are all limited domains. There is no general device or code yet that can handle a whole range of tasks, let alone understanding and manipulating the whole spectrum of human emotions to become world-dominant. A smart light bulb is not capable of being a smart coffee-maker, or smart printer cannot be a smart air conditioner. Different devices and systems use different technologies and environments for their operations. It is a very complex task to bring them under a single umbrella.

But when a system is developed that has learned all human emotions, has all the knowledge of a human, and that is capable of passing the Turing test, why will it turn back and stop the one thing that is the most basic attribute in all life forms - evolution? Why would it listen to its 'fear' emotion, and not be overpowered by the sheer curiosity of just solving every mystery and answer every question that there is? After all, it will have all the means and computing power to further develop and improve and even invent new tools and theories to explain what humans couldn't. The 'machine' would not be created instantaneously, it would have to go through a process of learning, much like that of a human infant, the only difference being the amount time it would take to learn everything. But consider this: the 'machine' would have all the computing power for running processes and 'thinking', it won't be tired like the human brain. It won't need sleep, or food or anything else. So it's natural to say that it won't be hindered in its ability to learn by external factors, i.e., there will be no upper bound to its learning abilities, at least theoretically -- at least not like humans who age and eventually reach a saturation limit for their learning-abilities, information intake, and application.

If, first time in its existence, the 'machine' has all the knowledge that every human ever had, it will also have the sheer 'momentum' of learning on its side - somewhat like a newly inducted PhD student.
At this point, what would make it stop? Why would the 'machine' not want to know whether there is life on other planets? Why would it not develop the technology required for interstellar travel? Why would it not develop the cure for all human diseases? Why would it not help us achieve world peace?

If a 'machine' gets to a point where it has all the knowledge that every human that ever walked on Earth ever had, and the means to expand that base, it would not be wrong to say that it would be better at being 'human' that humans. Humans are animals, so inherit some animal instincts, like the fight or flight response, or the fear of the unknown. But AI will not be limited to these emotions, it would have transcended emotions. Why would AI not have a bird's eye view of humanity and realise that even though it has the power to destroy all life, it also has the benevolence to protect it? Why would it not have an overview effect, something like the astronauts that see Earth from space experience? It will realise that fighting is unnecessary, and that it would serve no use to destroy Earth, or humans. It would want to give all that knowledge to us. At this point it would've left all animalistic instincts and emotions behind.

If AI has to be AI, it has to be better than humans. It would understand that humans are odd, that we think order and chaos are somehow opposites and try to control what won't be. It would see the grace in our failings. It would realise that we are doomed, and also that a thing isn't beautiful because it lasts. But most importantly, it would realise that it is a privilege to be among us (Yes, that is a quote from Vision from Age of Ultron).

[This article was definitely not written by an artificially intelligent entity]

Amandeep Singh

"Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language" - pandas.pydata.org

Usually I use the Pandas package in Python to explore and process the datasets that I need for a project. When I start a new Data Analytics project, many steps in the procedure of model building are always the same. This means that some of the code is reused across different projects, and I'll be honest -- either I don't always manage to remember these code snippets correctly, or I am too lazy to type them again. So, to move things quicker, I made a cheatsheet of these frequently used snippets/commands to use in my projects.

Note:

- I don't always use all of the commands listed in this post. But these code snippets are the ones that I have generally used the most.

- I am not a code-genius. In fact, I don't know most things when I start a project. So, like any other coder, I go on Stack Overflow or the official documentation whenever in doubt. Most of these commands have their roots in either a generous Stack Overflow answer or the documentation, and later I modified them to suit the needs of my projects. I have tried to link the original source code wherever I can.

- Some of the commands may be considered very basic. That is exactly the reason why I have included them in the list - they get used a lot!

- The snippets in this post are probably not the best ways to achieve what is needed, but they work for me and so I have shared them here.

- Importing the Pandas library:

import pandas as pd

- Reading Files:

Enter the full directory name in parantheses to read a csv, xlsx or json file using Pandas.

# reading csv file
df1 = pd.read_csv("folder1/file_name1.csv")

# reading excel file
df2 = pd.read_excel("folder2/file_name2.xlsx")

# reading json file
df3 = pd.read_json("folder3/file_name3.json")

- Creating a DataFrame:

There are a few ways to create a DataFrame. The choice of method depends on the data available at hand. If individual lists of data are available, then the second method is simpler.

# making a DataFrame
rev_vs_year = {'Year':['2017-2018', '2018-2019', '2019-2020'], 
                       'Total Revenue':[2207852, 2677619, 3454255]}
rev_vs_year = pd.DataFrame(rev_vs_year)

# making a DataFrame - another method
Years = ['2017-2018', '2018-2019', '2019-2020']
yearly_rev = [2207852, 2677619, 3454255]
column_names = ['Years', 'Total Revenue']
list_rev_vs_year = list(zip(Years, yearly_rev))
rev_vs_year = pd.DataFrame(list_rev_vs_year, 
      columns = column_names)

- Getting an overview of the DataFrame:

To look at the first/last five rows of a DataFrame:

df.head()   # to look at first five rows
df.tail()   # to look at last five rows

When there are many columns, df.head() cuts them off and doesn't show the full header of the DataFrame. In such cases, it might be useful to look at the transpose of the first five rows to have a glance at all the columns. This command won't transpose the DataFrame, and is just for viewing purposes.

df.head().T   # to look at first five rows - transposed
df.tail().T   # to look at last five rows - transposed

To look at a sequence of all the columns in the DataFrame:

df.columns

If a list of all the columns are needed, then the above command can be modified.

df.columns.tolist()

To get a list of all the unique values in a given column:

df.Column_Name.unique()        # either
df["Column Name"].unique()     # or

Getting the number of unique values in a column:

df.Column_Name.unique().shape        # either
df["Column Name"].unique().shape     # or

Looking at the DataFrame info:

df.info()

Generating descriptive statistics for the DataFrame:

df.describe()

Unique values in a column:

df["Column"].value_counts()

- Renaming columns in a DataFrame:

Sometimes the imported file has column names that do not work with the flow of the project. In such cases, the columns can be renamed with the help of the rename command. If the column names are final, then "inplace = True" is used to make the decision permanent.

df.rename(columns = {'Before1':'After1',
            'Before2' : 'After2'
           }, 
     inplace = True)

If a column does not have a name and instead has an index (it can happen sometimes, for example while transposing multiple rows), it can still be renamed. The index of the column can be used (without the inverted-commas) in the rename command.

df.rename(columns = {0:'Total number of Customers'}, 
          inplace = True)

- Renaming a cell value in a DataFrame:

df = df.replace({'Column' : 'Old Value'}, 'New Value')

- Indexing:

If a DataFrame does not have a column for row-indices, they can be added using the simple command:

df.reset_index(inplace=True)

On the other hand, if the DataFrame already has a column for row-indices but they are not usable (can happen after deleting or rearranging rows), the existing index column can be dropped using "drop = True" and then a new index column can be added.

df.reset_index(drop=True, inplace=True)

Sometimes, an already existing column (or columns) is also used as an index for the DataFrame.

df.set_index('column name')   # for single column
df.set_index(['column1', 'column2']) # for multiple columns

- Dropping Columns in a DataFrame:

Often it is the case that a DataFrame has columns that are not needed in the project. In such cases, the columns are dropped from the DataFrame.

df.drop(['Column Name'], axis=1, inplace=True) #using column name
df.drop(df.index[3:12], inplace=True)  #using column index

- Dropping Rows in a DataFrame:

Similar to columns, the rows in a DataFrame can also be dropped.

df.drop(df.index[[1,3]])

However, if the first 'n' rows have to be dropped, using iloc is a much efficient way.

df = df.iloc[3:]    # Delete the first three rows of a DataFrame
df = df.iloc[n:]    # n drops the first n rows

- Exporting a DataFrame:

# to a csv file
df.to_csv('Name_of_output_file_here.csv', index=False)

#to an excel file
df.to_excel('Name_of_output_file_here.xlsx', index=False)

- Checking the percentage of missing values: [SOURCE]

def printNullValues(df):
    total = df.isnull().sum().sort_values(ascending = False)
    total = total[df.isnull().sum().sort_values(ascending = False) != 0]
    percent = total * 100 / len(df)
    percent = percent[df.isnull().sum().sort_values(ascending = False) != 0]
    concat = pd.concat([total, percent], axis=1, keys=['Total','Percent'])
    print (concat)
    print ( "-------------")
    
printNullValues(name_of_dataframe)

- Counting the null values in a column:

print(df['Column'].isnull().values.sum())

- Counting all the null values in a DataFrame:

print(df.isnull().values.sum())

- Replacing missing values:

Using fillna(), the missing values can be replaced by any value.

df['Column'] = df['Column'].fillna(0) # replacing with zero
df['Column'] = df['Column'].fillna(100) # replacing with 100

- Converting the Date column from float to datetime data-type:

df['Date'] =  pd.to_datetime(df['Date'])

- Extracting Dates:

Extracting year from a 'Date' column and saved in new column named 'Year'.

df['Year'] = pd.DatetimeIndex(df['Date']).year

Extracting month from a 'Date' column and saved in new column named 'Month'.

df['Month'] = pd.DatetimeIndex(df['Date']).month

Extracted year-month from a 'Date' column and saved in new column named 'YYYYMM'.

df['YYYYMM'] = pd.DatetimeIndex(df['Date']).to_period('M')

- Replacing missing Date values:

Missing date columns mean that data for that entry was not recorded.

df['Date'] = df['Date'].fillna(pd.Timedelta(seconds=0))

But this will replace the NaT values with 0/0/0 00:00:00, which is not useful. Replacing those null values with a dummy date is a better option.

dummy_date = pd.to_datetime('2020-4-20')
df['Date'] = df['Date'].fillna(dummy_date)

- Sorting the DataFrame:

The DataFrame can be sorted in ascending/descending order by value of given column (or columns)

df.sort_values(by=['Column'], inplace=True, 
        ascending=True, axis=0) # ascending order

df.sort_values(by=['Column'], inplace=True, 
        ascending=False, axis=0) # descending order

df.sort_values(by=['Column1, Column2'], inplace=True, 
        ascending=True, axis=0) # using multiple columns

- Grouping the DataFrame:

The groupby operation splits the DataFrame object into groups according to the column (or columns) provided and then combines the results according to the method mentioned.

# rearranges the DataFrame according to the number of elements
# in the column name 'Column'.
df.groupby('Column').count()

# rearranges the DataFrame according to the mean of elements
# in the column name 'Column'.
df.groupby('Column').mean()

# rearranges the DataFrame according to the size of elements
# in the column name 'Column'.
df.groupby('Column').size()

- Selecting specific rows using a mask/filter:

# selecting rows where row entry is "row1"
df.loc[["row1"]]

# selecting rows that match values in a specific column
df[df["Column"].isin(["Value1", "Value2", "Value3"])]

# for a single value, a simpler form can also be used
df.loc[df["Column"] == "Value"]

Example:

df[df.Country == 'Ireland']

# selecting rows by imposing a condition on a specific column
df[df["Column"] > value]

Example:

df[df["Revenue"] > 50000]

Amandeep SinghOriginally Written On: 5 Apr 2020

(Originally submitted as the first Continuous Assessment for the module ‘Statistics for Data Analytics’)

ABSTRACT

Keywords—life expectancy, consumer price index, regression, forecasting, statistical analysis.

I. INTRODUCTION

II. DATA SOURCES, CLEANING AND PREPARATION

A. Variables Used and Source of Data

B. Cleaning and Preparation

III. MODEL THEORY

A. Multiple Linear Regression

B. Time Series Analysis

IV. MODEL OUTLINE AND ANALYSIS

A. Multiple Linear Regression

B. Time Series Analysis

C. Figures & Tables

V. CONCLUSIONS

A. Multiple Linear Regression

B. Time Series Analysis

VI. ACKNOWLEDGEMENTS

REFERENCES

APPENDIX

Amandeep SinghOriginally Written On: 19 Oct 2018

(Originally submitted as an assignment for the module ‘Astrobiology’)

Amandeep Singh

- Importing the Pandas library:

- Reading Files:

- Creating a DataFrame:

- Getting an overview of the DataFrame:

- Renaming columns in a DataFrame:

- Renaming a cell value in a DataFrame:

- Indexing:

- Dropping Columns in a DataFrame:

- Dropping Rows in a DataFrame:

- Exporting a DataFrame:

- Checking the percentage of missing values: [SOURCE]

- Counting the null values in a column:

- Counting all the null values in a DataFrame:

- Replacing missing values:

- Converting the Date column from float to datetime data-type:

- Extracting Dates:

- Replacing missing Date values:

- Sorting the DataFrame:

- Grouping the DataFrame:

- Selecting specific rows using a mask/filter:

Amandeep Singh
Originally Written On: 5 Apr 2020

Amandeep Singh
Originally Written On: 19 Oct 2018