Amandeep Singh
Originally Written On: 5 Apr 2020
(Originally submitted as the first Continuous Assessment for the module ‘Statistics for Data Analytics’)
ABSTRACT
This report describes a multiple-linear regression-based model for predicting the life expectancy of citizens of countries listed in the United Nations (UN) Life Expectancy dataset. The model uses various other UN datasets for choosing the dependent variables (total country population in that year, the Gross Domestic Product (GDP) of the country in that year, HIV Death rate in that year, etc.) and then checks for all the necessary assumptions required to perform a regression analysis. This report also presents a time-series model that forecasts the Consumer Price Indices (CPIs) of the G20 group of countries using the ARIMA model using the dataset provided by the European Union (EU) dataset repository. The data cleaning for this model was done on Python, while the statistical analysis and model development was done on IBM SPSS and R Studio. Along with other metrics, the predominant evaluation metrics used was R-squared and the p-value tests.
Keywords—life expectancy, consumer price index, regression, forecasting, statistical analysis.
I. INTRODUCTION
Human beings are living longer, healthier and more fulfilled lives due to the advancements in modern medicine and healthcare facilities. Living with chronic ailments and still perform daily-life functions with ease is possible now more than ever. While medical knowledge has aided in increasing the average life expectancy of humans across the world, advancements in data collection and analysis has made it possible to study the effects by designing models around it. The other side of the coin is the continuously increase strain on the natural and economic resources. This makes it imperative to analyze and measure the effect of this injection of technology on the economics of various countries. One such measure is the Consumer Price Index (CPI) that incorporates the change in prices of various household items as a unit of inflation (Bryan and Cecchetti, 1993) (Dougherty and Van Order, 1982). Studying the effect of technology on the daily life of consumers can help administrations make crucial decisions and policy changes for the betterment of their citizens.
This report presents a model, based on the statistical technique of multiple-linear regression, that analyses different factors that may or may not affect the accurate prediction of the life expectancy of a country in a given year. This report also tries to forecast the increase in the inflation index (or CPI) of a the G20 group of countries by modelling and fitting the trend over the years.
II. DATA SOURCES, CLEANING AND PREPARATION
A. Variables Used and Source of Data
1. For Multiple-Linear Regression:
Dependent variable: Life expectancy at birth (years)
Independent Variables:
- Per capita Gross Domestic Product (GDP) at current prices
- Total Population of countries
- General government expenditure on health as a percentage of total government expenditure
- BCG immunization coverage among 1-year-olds (%)
- Deaths - HIV/AIDS (Age-standardized) (Rate) (per 100,000 people)
- Polio (Pol3) immunization coverage among 1-year-olds (%)
- Hepatitis B (HepB3) immunization coverage among 1-year-olds (%)
- Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)
- Human Development Index (HDI)
Data Sources: All the datasets used were downloaded directly from the UN Dataset online repository (data.un.org, n.d.) in .csv file format.
2. For Time Series Forecasting & Analysis:
Variable Used: Consumer Price Index (CPI)
Data Sources: The dataset used was downloaded directly from the EUROSTAT online repository (ec.europa.eu, n.d.) in .csv file format
B. Cleaning and Preparation
1. For Multiple-Linear Regression:
The datasets for the dependent and independent variables were stored in csv formats and imported in a Jupyter notebook for cleaning and transformation. Python packages, namely Pandas and NumPy, were used to manipulate the data using the DataFrame feature of Pandas. Each dataset was first imported as a separate Pandas DataFrame. The columns were listed, along with the metadata indicating the number of rows, memory size and data types of the attributes. The columns which were not needed were removed from each dataset. The existing columns in each dataset were renamed to improve transparency.
For example, after importing the life expectancy dataset, it had attributes: Location, Period, Indicator, Dim1, First Tooltip. These names were set according to the survey parameters by the UN. The columns that were required for this model, namely Location, Period and First Tooltip, were renamed to Country, Year and life_expectancy respectively, while the rest of the columns were dropped. Similar procedure was followed for the other datasets.
Up until this point the DataFrames were handled separately. Since these datasets were created for different years and different countries, merging them was tedious & memory consuming. So, a year was chosen for all datasets - 2012, because this year contained the least number of null/empty values across all datasets. Now, all the datasets were merged on the basis of Country names, i.e., countries which were common in all datasets were kept, rest all were removed. This transformation process eliminated almost all the null/empty values. The countries which had some remaining null/empty values were also removed. The resulting cleaned and transformed dataset now had 12 columns and 116 rows. The transformation and cleaning also successfully eliminated outliers, which was confirmed using scatter plots. The final dataset was exported to a .csv file for further analysis using SPSS.
2. For Time Series Forecasting & Analysis:
The dataset was imported into R Studio as a .csv file. The data was transposed for ease of modelling, cleaned to remove null values, and the columns were renamed before exporting to a .csv file for further analysis in R Studio. The dataset has 2 columns, Date & CPI, and 288 rows containing monthly data from January 1996 to December 2019.
III. MODEL THEORY
A. Multiple Linear Regression
This technique is used to estimate the relationship between one dependent variable and multiple independent variables (Ucla.edu, 2019) (Yale.edu, 2019). A simple multiple linear regression model
equation looks something like (Hyndman and Athanasopoulos, 2018):
- There should be a linear relationship between the response and each explanatory variable. There should also be a collective linear relationship between them.
- Response variable must be continuous. Explanatory variables must be continuous too (only the numerical ones).
- There should NOT be any cross-correlation or multicollinearity between the response and explanatory variables.
- The response and explanatory variables should be homoscedastic.
- The errors should be normally distributed (approximately).
There are some specific evaluation techniques that were used to judge whether a regression model satisfies the above assumptions. If the assumptions were not met, the model was modified to overcome that drawback. The criteria used were (Frost, 2017b) (Frost, 2017c) (Jason Brownlee, 2018):
- -- Adjusted R-squared value
- -- Pearson Correlation values
- -- p-value test (ANOVA table)
- -- Variance Inflation Factors (VIF) values and p-values of the coefficients
- -- Coefficient Correlations
- -- Normal P-P plot
- -- Residual Scatter Plot
B. Time Series Analysis
Time series analysis is a crucial tool to analyse and forecast any given time-based series of data. A time series can have of various features – seasonality, trend, noise, stationary/moving aspect. The technique of decomposition is performed to separate these features and bring clarity for analysis. If the time series is moving with respect to time, then it is differenced to make it stationary (Ambatipudi, 2017).
Then, a model is selected for fitting the recorded values by studying the ACF (Autocorrelation Function) and PACF (Partial-Autocorrelation Function) plots and computing the p (auto-regressive order), d (degree of difference) & q (moving average order) values. The model used in this report is ARIMA (Auto Regressive Integrated Moving Average) (2017).
The accuracy of the ARIMA model is checked on the basis of two tests (2017):
- -- Normal Q-Q plot: to check for normality in the data
- -- Ljung-Box test: to test the hypothesis using p-values
After successful testing, the model is used to forecast the values of the required duration of time. This forecast model is tested using the model vs time plots and, again, the Ljung-Box test (Statistics Solutions, 2017).
IV. MODEL OUTLINE AND ANALYSIS
A. Multiple Linear Regression
Multiple models were created and tested in SPSS (Laerd.com, 2018) (Statistics Solutions, 2017). The models were analysed based on the assumptions and parameters put forth in the previous section. The model summaries are:
1. Model-1:
Model Equation:
Adjusted R2: 0.969
Pearson Correlation values:
- -- life_expectancy : HDI = 0.886
- -- life_expectancy : adult_mortality_rate = -0.945
- -- polio_immunity : hepB_immunity = 0.847
p-value (ANOVA table): 0.000 (< 0.005)
VIF & p-values of the coefficients:
- -- VIF-value (adult_mortality_rate) = 6.316
- -- VIF-value (hepB_immunity) = 5.320
- -- p-value (total_population) = 0.538
- -- p-value (hepB_immunity) = 0.473
- -- p-value (polio_immunity) = 0.062
- -- p-value (BCG_immunity) = 0.718
Coefficient Correlation values:
All correlation values < 0.8
Normal P-P plot: Refer to Fig. 1
Residual Scatter Plot: Refer to Fig. 2
Analysis: Even though the R2 value is optimum, the Pearson correlation values show high degree of correlations. Since HDI and adult_mortality_rate show correlations with the dependent variable life_expectancy, they will have to be removed. polio_immunity, hepB_immunity & BCG_immunity have high correlations and their coefficinet p-values are also > 0.05, which means that the null hypothesis cannot be rejected. For the next model, these three variables will be combined into a single avg_immunity variable by taking their average. Plots signify normaltiy of residuals and homoscedasticity. It is also noticed that population and GDP data values are very high in comparison to others, so logarithm (base 10) is applied on these variables to bring everything on a similar scale. Combined tables in Fig. 17.
2. Model-2:
Model Equation:
Adjusted R2: 0.822
Pearson Correlation values:
All correlation values < 0.8
p-value (ANOVA table): 0.000 (< 0.005)
- -- All VIF-values < 5
- -- p-value (population) = 0.065
Coefficient Correlation values:
All correlation values < 0.8
Normal P-P plot: Refer to Fig. 3
Residual Scatter Plot: Refer to Fig. 4
Analysis: The only value of concern in this model is the high p-value for the coefficient of the population variable. It is >0.05, which signifies failure to reject the null hypothesis, i.e., coefficients are insignificant. Hence, this variable can be removed without significantly affecting the model. Plots signify normaltiy of residuals and homoscedasticity. Combined tables in Fig. 18.
3. Model-3:
Model Equation:
Adjusted R2: 0.818
- -- All correlation values < 0.8
- -- p-value (ANOVA table): 0.000 (< 0.005)
- -- All VIF-values < 5
- -- All p-values < 0.05
Coefficient Correlation values:
All correlation values < 0.8
Normal P-P plot: Refer to Fig. 5
Residual Scatter Plot: Refer to Fig. 6
Analysis: All values and the plots are in accordance to the set metrics. Plots signify normality of residuals and homoscedasticity. This model is the final regression model for this project. Combined tables in Fig. 19.
B. Time Series Analysis
The forecast model was coded, plotted & analyzed in R (Peixeiro, 2019) (Doc, n.d.). The steps were as follows:
1. Decomposition:
After cleaning, the time series was imported for decomposition to properly view the different elements that make up the time series, i.e., trend of the series, seasonality within the series, the moving/stationary nature of the data and finally the noise permeating within the series, as plotted in Fig. 7. The series is a non-stationary, seasonal series with an upward trend and some noise. The auto-ARIMA function in R was used to aid in procuring the p (auto-regressive order), d (degree of difference) & q (moving average order) values for the model because it wasn’t possible to do so using the ACF and PACF functions.
2. ARIMA:
The auto-ARIMA function suggested a ARIMA(0,1,2)(0,0,2)[12] model with drift (refer to Fig. 8). This model was used to check the normality of the residuals. As shown in Fig. 9, residuals were distributed normally. But the p-value from the Ljung-Box test was 0.1464 (>0.05, Fig. 10). To correct for this, the difference function in R was invoked, as plotted in Fig. 11, which resulted in a much better p-value (almost zero, Fig. 12).
3. Analysis and Forecasting:
The Normal Q-Q plot of the series was plotted, as shown in Fig. 13. Almost all data points fit the line quite well, thus, the time series is normally distributed.
Next, the model was used to forecast the CPI for the next 2 years. As plotted in Fig. 14 and Fig. 15, the model is precisely fitting the observed values and the forecast model is accurately following the upward-trend of the series. Finally, the results were tested using the Ljung-Box Test.
4. Ljung-Box Test:
The p-value for this test was very close to zero, as depicted in Fig. 16. This means that the model accurately fits the observed values, which is the desired result.
C. Figures & Tables
Fig. 1. MODEL-1: Normal P-P plot |
Fig. 2. MODEL-1: Residual scatter plot |
Fig. 3. MODEL-2: Normal P-P plot |
Fig. 4. MODEL-2: Residual scatter plot |
Fig. 5. MODEL-3: Normal P-P plot |
Fig. 6. MODEL-3: Residual scatter plot |
Fig. 7. The decomposition plot of the multiplicative time series |
Fig. 8. Best model suggestion, as produced by auto.arima() |
Fig. 9. Residual plot with ACF: On time series |
Fig. 10. Ljung-Box test on the time series model given by auto.arima() |
Fig. 11. Residual plot with ACF: After differencing time series |
Fig. 12. Ljung-Box Test after differencing time series |
Fig. 13. The Normal Q-Q plot |
Fig. 14. Forecasting the CPI for the next 2 years using the time series model |
Fig. 15. Forecasting the CPI for the next 2 years using the time series model (differenced) |
Fig. 16. Ljung-Box test of the final model |
V. CONCLUSIONS
A. Multiple Linear Regression
This model cleared the tests for normality and homoscedasticity. A summary of the Pearson correlation values is given in Fig. 20. A similar pictorial representation which also incorporates scatter plots for all variables is shown in Fig. 22. These figures were created in Python.
B. Time Series Analysis
ARIMA(0,1,2)(0,0,2)[12] model with drift was selected for the time series modelling and the time series was differenced to reduce errors. While the Normal Q-Q plot confirmed that the data exhibited normality, the Ljung-Box test confirmed that the model was accurate.
VI. ACKNOWLEDGEMENTS
The author would like to thank National College of Ireland, for providing the necessary resources for this project, and also extends wholehearted gratitude to Prof. (Dr.) Tony Delaney, for his continuous support and guidance.
REFERENCES
Bryan, M.F. and Cecchetti, S.G. (1993). The Consumer Price Index as a Measure of Inflation. [online] National Bureau of Economic Research. Available at: https://www.nber.org/papers/w4505 [Accessed 5 Apr. 2020].
data.un.org. (n.d.). UNdata | explorer. [online] Available at: http://data.un.org/Explorer.aspx [Accessed 5 Apr. 2020].
Dougherty, A. and Van Order, R. (1982). Inflation, Housing Costs, and the Consumer Price Index. The American Economic Review, [online] 72(1), pp.154–164. Available at: https://www.jstor.org/stable/1808582?seq=1 [Accessed 5 Apr. 2020].
ec.europa.eu. (n.d.). G20 CPI all-items - Group of Twenty - Consumer price index (prc_ipc_g20). [online] Available at: https://ec.europa.eu/eurostat/cache/metadata/en/prc_ipc_g20_esms.htm [Accessed 5 Apr. 2020].
Frost, J. (2017a). Check Your Residual Plots to Ensure Trustworthy Regression Results! [online] Statistics By Jim. Available at: https://statisticsbyjim.com/regression/check-residual-plots-regression-analysis/ [Accessed 5 Apr. 2020].
Frost, J. (2017b). How to Interpret P-values and Coefficients in Regression Analysis - Statistics By Jim. [online] Statistics By Jim. Available at: https://statisticsbyjim.com/regression/interpret-coefficients-p-values-regression/.
Frost, J. (2017c). Multicollinearity in Regression Analysis: Problems, Detection, and Solutions - Statistics By Jim. [online] Statistics By Jim. Available at: https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/.
Hyndman, R.J. and Athanasopoulos, G. (2018). Forecasting : principles and practice. Heathmont, Vic.: Otexts.
James, G., Witten, D., Hastie, T. and Tibshirani, R. (n.d.). An introduction to statistical learning : with applications in R.
Jason Brownlee (2018). A Gentle Introduction to Normality Tests in Python. [online] Machine Learning Mastery. Available at: https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/.
Laerd.com. (2018). How to perform a Multiple Regression Analysis in SPSS Statistics | Laerd Statistics. [online] Available at: https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-statistics.php.
MasumRumi (n.d.). A Statistical Analysis & ML workflow of Titanic. [online] kaggle.com. Available at: https://www.kaggle.com/masumrumi/a-statistical-analysis-ml-workflow-of-titanic#Part-1:-Importing-Necessary-Libraries-and-datasets [Accessed 5 Apr. 2020].
Shumway, R.H. and Stoffer, D.S. (2017). Time series analysis and its applications : with R examples. Cham: Springer.
Statistics Solutions. (2017). Testing Assumptions of Linear Regression in SPSS - Statistics Solutions. [online] Available at: https://www.statisticssolutions.com/testing-assumptions-of-linear-regression-in-spss/.
Ucla.edu. (2019). Regression Analysis | SPSS Annotated Output. [online] Available at: https://stats.idre.ucla.edu/spss/output/regression-analysis/.
Yale.edu. (2019). Multiple Linear Regression. [online] Available at: http://www.stat.yale.edu/Courses/1997-98/101/linmult.htm.
APPENDIX
Fig. 17. MLR MODEL-1: Full Model Summary |
Fig. 18. MLR MODEL-2: Full Model Summary |
Fig. 20. MLR MODEL-3: Full Model Summary |
Fig. 21. MLR MODEL-3: Correlation Matrix |
Fig. 22. MLR MODEL-3: Correlation Table |
Fig. 23. MLR MODEL-3: Correlation Scatter plot |