The authors have declared that no competing interests exist.
The COVID19 pandemic has had a profound impact on global health and economies. The pandemic continues to spread and accurate forecasting of its spread is essential for the effective management of healthcare systems and the development of effective policies. The development of forecasting models for COVID19 has become increasingly important as the pandemic continues to evolve. In this paper, we will summarize the Covid19 pandemic in the United States state by state. And then, we utilize the temporal data of coronavirus spread from January 18, 2020 to January 29, 2023. Finally, we model the evolution of the COVID19 outbreak and perform prediction using ARIMA and time series forecasting models on some selected states.
In the United States, annual community outbreaks of coronavirus infections typically occur during late fall and winter. There may be variation in the timing of outbreaks between regions and between communities in the same region.
In the early months of the pandemic, cases and deaths were heavily concentrated in the metropoli tan areas of New York, New Orleans, Boston and Detroit with other major cities. Overall, urban areas with more ethically and racially diverse population were initially impacted more than areas less diverse.
There are various approaches to forecasting the spread of COVID19, including statistical mod els, machine learning models, and hybrid models. Statistical models are based on mathematical and statistical models and include regression analysis, time series analysis, and ARIMA models. Machine learning models, such as neural networks and support vector machines, are based on patterns in the data and use algorithms to make predictions. Hybrid models combine elements of statistical and machine learning models and can be more accurate than either approach alone
In recent years, there has been a growing interest in using deep learning techniques for COVID19 forecasting. Deep learning is a type of machine learning that uses artificial neural networks with multiple layers to model complex patterns in data. One example of a deep learning model is the Long ShortTerm Memory (LSTM) network, which has been used to make predictions about the spread of COVID19.
Another recent development is the use of ensemble models, which combine the predictions of multiple models to make a final prediction. Ensemble models can be more accurate than individual models because they are able to capture the strengths of different models and account for uncertainty in the data.
Over the last 35 years there has been considerable information accumulated about forecasting techniques and how these methods are applied in a wide variety of settings. Conflicting results are very common when performing advanced forecasting competitions between different methods. As forecasting tasks can vary by many dimensions, it is unlikely that one method will be better than all others for all forecasting scenarios. What we require from a forecasting method are consistently sensible forecasts, and these should be frequently evaluated against the task at hand.
Despite the progress made in the development of COVID19 forecasting models, there are still many challenges to overcome. One major challenge is the limited availability of highquality data, which can lead to inaccurate predictions. In addition, the rapid evolution of the pandemic makes it difficult to develop models that are both accurate and relevant over time
Another challenge is the difficulty in accounting for external factors, such as government policies and social behavior that can have a significant impact on the spread of COVID19. For example, a sudden increase in testing can result in a spike in the number of confirmed cases, which can make it difficult to accurately predict the future spread of the disease.
The COVID19 pandemic has had a profound impact on the United States with different states experiencing different levels of impact. The impact of the pandemic on each state has been influenced by factors such as population density, demographics, lockdown policy and local response measures.
States in the Northeast region of the U.S., such as New York, New Jersey, and Massachusetts, were among the hardest hit by the pandemic in the early stages of the outbreak. These states have large urban areas with high population densities, which facilitated the spread of the virus. As a result, these states have some of the highest numbers of confirmed cases and deaths in the country.
States in the Midwest region, such as Illinois, Michigan, and Ohio, have also been significantly impacted by the pandemic. These states had different experiences with pandemic. For exam ple, some rural communities have led to a wide range of experiences with the pandemic. But some rural communities in these states have been relatively spared from the pandemic, while larger urban areas have seen a more significant impact.
States in the South region, such as Florida, Texas, and Georgia, have also been impacted by the pandemic although the impact has varied widely across the region. Some states in the South, such as Florida and Texas have experienced a large number of confirmed cases and deaths, while others, such as Georgia, have experienced a more moderate impact.
The situation with the COVID19 pandemic in the western region of the United States varied by state but overall was serious. Some of the worstaffected states in the West at that time were California, Oregon, Arizona, and Washington. Many states implemented measures to slow the spread of the virus. As of knowledge cutoff in 2021, vaccines were being distributed in the US, which offered hope for controlling the spread of the virus in the future.
The data for the ongoing Covid19 outbreak in the United States is collected from the Center for Disease Control and Prevention. The columns of this dataset include the Total number of weekly cases, Weekly Death and Weekly tests volume of Covid19 patients accumulating in all the states, on a weekly basis from January 29, 2020 to January 18, 2023. The total cases per 100,000, allow for comparisons between areas with different population sizes.
In this paper, we collected weekly cases and weekly deaths from five states, New York, California, Texas, New Jersy, Florida in the United States. Then we made the comparison and do the simulation analysis for better forecasting models
We will use different criteria to evaluate the performance of different forecasting models. In order to measure forecasting accuracy, a number of metrics can be used, including mean absolute error (MAE), root mean squared error (RMSE), mean absolute percentage error (MAPE), mean percentage error (MPE) and the mean absolute scaled error (MASE). Comparison between different forecasting models is based on these criteria and formulas
P_{t} is the predicted value at time t, Z_{t} is the observed value at time t and N is the number of predictions.
However, it is important to note that forecasting accuracy can be influenced by a range of factors, including the quality and quantity of available data, the choice of forecasting method, and the inherent unpredictability of some phenomena. As a result, it is often difficult to achieve high levels of forecasting accuracy in practice, and forecasters must continually refine their methods and adjust their expectations based on new data and insights.
The time series processes we have discussed so far are all stationary processes, but many applied time series, particularly those arising from economic and business areas, are nonstationary. With respect to the class of covariance stationary processes, nonstationary time series can occur in many different ways. They could have nonconstant means µt, timevarying second moments such as non constant variance σ2, or both of these properties. In this section, we will explain the construction of a very useful class of homogeneous nonstationary time series models, the autoregressive integrated moving average models. Some useful differencing and variance stabilizing transformations are introduced to connect the stationary and nonstationary time series models
Many models used in practice are of the simple ARIMA type, which have a long history and were formalized in Box and Jenkins. ARIMA stands for Autoregressive Integrated Moving Average and an
where u_{t} is white noise and usually Normally distributed as u_{t}_{~ }N (0; σ^{2}). The stationarity and invertibility conditions are simply that the roots of ϕ(L) and η(L), respectively, are outside the unit circle
Since we are also taking into account the seasonal pattern even if it is weak, we should also examine the seasonal ARIMA process. This model is built by adding seasonal terms in the non seasonal ARIMA model we mentioned before. One shorthand notation for the model is{(p, d, q)} : nonseasonal part
{(P, D, Q)m}: seasonal part.
P = seasonal AR order, D = seasonal differencing, Q = seasonal MA order
m: the number of observations before the next year starts; seasonal period.
The seasonal parts have term nonseasonal components with backshifts of the seasonal period. For instance, we take { ARIMA(p, d, q)(P, D, Q)m} model for weekly data (m=52). Without differencing operations, this process can be formally written as:
A seasonal ARIMA model incorporates both nonseasonal and seasonal factors in a multiplicative fashion.
Dynamic Harmonic Regression (DHR) is a statistical modeling technique used for time series data analysis. It is a type of regression model that accounts for the seasonality and nonstationarity of the data. The model combines the strengths of regression analysis and time series decomposition, making it a useful tool for predicting future values based on historical trends
Firstly, we considered regression models
The system was composed by four components: trend (T), sustained cyclical (C) with period different to the seasonality, seasonal (S) and white noise (ϵ_{t}).
The measured values of y are the output (observations) series of a system of stochastic state space equations, which can then be broken down to allow for estimation of the four components.
So for such time series, we prefer a harmonic regression approach where the seasonal pattern is modelled using Fourier terms with shortterm time series dynamics handled by an ARIMA error. The DHR model consists of two parts: a regression component and a harmonic component. The regression component models the underlying linear relationship between the independent variables and the dependent variables, while the harmonic component models the seasonal patterns in the data. The harmonic component uses trigonometric functions, such as sine and cosine, to capture the patterns in the data
where m is the seasonal period, α_{j} and β_{j} are regression coefficients, and ϵ_{t} is modeled as a non seasonal ARIMA process.
One of the advantages of the DHR model is that it can handle nonstationary time series data, which is common in many realworld applications. The model can account for changes in the mean and variance of the data over time, making it a useful tool for analyzing data with trends and seasonality.
Another advantage of the DHR model is its ability to handle multiseasonal patterns in the data. For example, the model can handle monthly, quarterly, and yearly patterns in the data. This makes the DHR model a useful tool for analyzing complex time series data with multiple seasonal patterns
I selected the best model by minimizing the forecasting criteria. A variety of forecasting methods often apply to any particular risk scenario. Researchers and government use multiple forecasting methods that can perform well at different phases of pandemic and take consideration in chose to best exploit the available historical data and degree of market knowledge. The key is to pick the most effective and flexible forecasting models, blend their best features, and shift between them as needed to keep forecast accuracy at its peak. This research paper delves into the details of ten forecasting methods, including why, when, and how they should be used to realize the greatest overall improvements in forecast accuracy.
Since the start of the pandemic, 1,106,824 people in the U.S. have died from COVID19. In the last week, Florida reported the highest number of new deaths with 444, followed by California with 273.
As of February 4, 2023, the Center for Disease Control and Prevention (CDC) reports 102,447,438 cases of COVID19 in the United States.
California has over 11 million cases, followed by Texas with over 8 million, and Florida with over 7 million.


Models  RMSE  MAE  MPE  MAAPE  MASE  
CA  DHR  0.286885  0.20837  −Inf  Inf  0.22897 
ARIMA(2, 1, 3)(0, 1, 1)(52)  0.29194  0.179239  −0.0357796  3.05504  0.19697  
Taxes  DHR  0.25773  0.19498  0.17058  3.6576  0.1609 

ARIMA(1, 1, 2)(0, 1, 1)(52)  0.28103  0.169164  −0.29621  3.1259  0.13962 
NYC 

0.49885  0.330884  −Inf  Inf  0.326175 

0.58931  0.31319  −0.81929  6.97477  0.30874  
NJ  DHR  0.4103  0.29194  −Inf  Inf  0.25853 
ARIMA(1, 1, 1)(0, 1, 1)(52)  0.4723  0.27888  0.436753  6.3331  0.24697  
Florida  DHR  0.36644  0.26314  0.49788  4.9343  0.29491 
ARIMA(0, 1, 1)(0, 1, 1)(52)  0.40849  0.23651  0.99494  4.1269  0.2651 
At a per capital level, the daily average of new cases in the last week was highest in New Jersey and Alabama.
As the
For the most of the state data, DHR has better performed compared to ARIMA on scale of MAE, RMSE, RRSE, and MAPE error matrices. The New York State and California are typical metropolitan around the world. The trend analysis shows rapid growth in the deaths, and prediction study shows stable rise in the weekly cases.




RMSE  MAE 

MAAPE  MASE 

DHR  0.2695 













0.26953 








−0.12463 




0.23193 








−0.0634 




0.2471 













0.27682 










New York and California were two of the states hardest hit by the COVID19 pandemic in the United States. In early 2020, New York became the epicenter of the outbreak in the US, with a significant number of cases and deaths. The state implemented strict measures to try to control the spread of the virus, including lockdowns and mandatory maskwearing, which helped to bring the situation under control. As of early 2023, the state has administered millions of vaccine doses, and the number of new cases and deaths has declined significantly.
Similarly, California also experienced a surge in cases and deaths in the early stages of the pan demic, prompting the state to implement a series of measures to control the spread, including stay athome orders and mask mandates. As of early 2023, California has also administered millions of vaccine doses, and the number of new cases and deaths has declined significantly. However, like many other parts of the world, both states have had to navigate the ongoing challenges posed by the COVID19 pandemic, including new variants of the virus and the need for continued vigilance and public health measures
better able to account for the dynamic and evolving nature of the pandemic
The objective in providing crucial statistical techniques is to enable government and public to make informed decisions regarding Covid19 pandemic. Eventually, the summary of various exist ing forecasting models can provide information to develop an appropriate forecasting model which describes the inherent feature of the series.
The author would like to thank some comments and constructive suggestions from Dr.Olusegun Michael Otunuga from the college of Science and Math and Dr.Hinton Romana from writing center in Augusta University. Several stimulating discussions and comments allowed me to develop original ideas and improve my paper.