ARIMA Forecasting Vocabulary
*Stationary, Auto-correlation, Differencing
Accurately forecasting costs, sales, user growth, patient readmission, etc is an important step to providing directors actionable information. This can be difficult to model by hand or in Excel. In addition, using traditional methods like moving averages might not provide enough insight into the various trends and seasonality that occur in real life data sets.
Using models like the ARIMA and ETS provides analysts the ability to predict more accurately and robustly by considering multiple factors like seasonality and trend. What is even better is that languages like R and Python make it much easier for analysts and data teams to avoid all the work they would usually have to do by hand. This can reduce the time to develop a model by more than half and increase accuracy. However, prior to using the ARIMA model in any programming language, it is very important that data scientist and analysts focus on developing a good understanding of the statistical concepts that allow the ARIMA model to work.
Concepts like stationarity, autocorrelation and differencing are just a few of the key vocab words that need to be understood in order for data teams to better develop models. Here are some of those definitions.
Stationarity
An important concepts when using the ARIMA model and many other time series models is stationarity.
A stationary time series refers to a time series that has a consistent mean, variance and covariance. To put simply, this means the time series is somewhat predictable.
However, there is one major problem. The blog post on statistics how to use the quote from Thomson (1994) that nicely sums up the problem:
“Experience with real-world data, however, soon convinces one that both stationarity and Gaussianity are fairy tales invented for the amusement of undergraduates.”
Most time series are “Non-Stationary”. This means over time a time series has a change in its mean, variance and or covariance. Non stationary time series are very difficult to predict because they often have other variables like white noise, and stochastic trends influencing their output. Some examples of processes (time series) that are non-stationary are random walks, random walks with drifts and deterministic trends.
A random walk refers to a process or time series that is equal to the last period value plus some form of stochastic (white noise) component. This means that this component is not consistent and non-systematic.
Adding drift to a random walk refers to adding a constant component depicted as “α “ . Visually this can cause the appearance of a positive or negative trend.Stocks are sometimes used as an example because their price starts at the previous days last price and then moves from that position.
Although a random walk with drift and a deterministic trend can look very similar there is a distinction. A random walk is regressed on the last periods value whereas a deterministic trend is based on time. Typically the growth is constant over time and some form of white noise component. This is different from trend stationary. Trend stationarity occurs when the trend component can be pulled out of a time series and the component left behind is stationary.
Trends can cause a problem in basic forecasting because it will often cause the model to underpredict the model. For instance, if the method being used is the moving average method, then the average will often underestimate the next value, even when using the seasonal variation of the moving average because of the constant increase.
This is where the ARIMA models components come in.
Autocorrelation
Autocorrelation in time series forecasting refers to the correlation an observation has between itself and another observation in the time series. These different observation in time series are called lags and autocorrelation can occur between the current lag and the previous lag or even lags several months and or years prior to the current lag.
To give an example, image if one year fishermen drastically overfished the salmon population during fishing season. More than likely the next year’s salmon season would be influenced by the current year. The numbers total count of salmon caught would probably be much lower because of the overfishing. This would be an example of two lags that might be a year apart but were autocorrelated because one influences the output of another.
This is an important concept in ARIMA modeling because it influences how many previous observation values are considered in the final ARIMA(0,0,0) model. This would start to get more into the math side as it starts to reference how many previous lags should be considered and also what coefficient will be multiplied by each of those previous lags.
Stochastic
Stochastic is a term that can be very confusing if you are accustomed to dealing with the cleanliness of algebra. Typically, if you put the same set of parameters into a process or function you get the same output.
For instance, if you have an x = 2 and have the equation x+2=y, then you know the out put will always be 4.
With a stochastic process the parameters inserted into the system could be the same. However, the output is somewhat “random”. A stochastic process will often have some form of normal distribution of an output but it is nonetheless random. It becomes difficult to accurately predict future values when stochastic variables are involved. Often times, this variable is added on as a constant in the final ARIMA equation.
Differencing
When working with data that is non-stationary one of the solutions to attempt to create a data set that is stationary is to use differencing. Differencing can help stabilize the mean and remove stochastic trends. It is very similar to taking the derivative. Now instead of focusing on the actual output, the model is focusing on the change of the process.
Differencing involves taking the current value and the previous value and finding the difference. Thus, instead of working the final dollar amount or count, you are now working with the delta. This can eliminate some of the non-constant factors and white noise. This process of differencing can be done multiple times (of course with limitations) to help make the data stationary. This will be symbolized in the ARIMA(0,0,0) model have a 1 at the second 0. The end result will look like ARIMA(0,1,0). This means the data set was differenced once. If it were twice, then the model would depict ARIMA(0,2,0).
Differencing is only one of the possible transformations that could be used to help transitiont he data set into a stationary data set. It is the simplest to implement.
Before getting started with R and the ARIMA model it is important to understand the statistical concepts that are utilized by the tools. This will help when developing models because you will have a much easier time tweaking the model parameters and data sets when you get the output. In addition, it provides analysts and data scientists the ability to better explain the output to their directors as well as explain any variances that might occur. Once a team has developed a solid ARIMA model, it is much easier to move into driver based models because analysts can start to focus on the random noise that is often caused by outside factors like new products, overtime, new employees, etc.
Call To Action
If your department is looking to develop an improve your forecasts and upskill your employees contact us today! We would love to help instruct you and your teams
Further Reading
How To Grow As A Data Scientist
How To Survive Corporate Politics As A Data Scientist