After thorough exploration of the data and time-series visualization, now I will try my hand at forecasting methods and predicting the demand for the three states and three products seperately. We will discuss various different approach present, used and develop one model appraoch for forecasting product sales demand time series. Lets get started!!

  • As said earlier, this data has 42,000-time-series to forecast. I couldn’t do that with my limited computing capabilities.
  • So I started from the top. One time series for whole data appears to be to start the work with. But with more than a 60 million rows, the computer crashed due to low memory.
  • Then, I thought of exploring all states combined in each product category. Again my ram crashed.
  • So, instead of going one by one, I started thinking, what and how would a company look at the forecast.
  • My hypothesis here is, there will be various levels where the necessary forecast data is looked upon
    • For example, people in HQ will look at overall states and products combines and also separately.
    • At regional HQ, concerned employees will look at the single state data, all three products forecast data, and also each store data so that they can distribute the necessary product.
    • Coming to store, the store managers will be looking at item level weekly, monthly, yearly.
    • So, with this thought process and my limited computing ability, I decided to forecast on 3 states and 3 product data. I considered each state and each product combination, collected their demand/sale on each day.
  • Hence, I decide to forecast each day total demand/sale of overall items of each product category in each state i.e. forecast of Texas and food products, Texas and hobbies products etc.
  • This gave me a 9-time-series forecasting prediction problem statements and luckily my computer managed to work with enough RAM to spare.

 

METHOD


As I studied various forecasting methods through my supply chain course, I explored methods like Naïve approach, Moving average, Exponential Smoothing and ARIMA, etc.

 

A Brief Introduction to above mentioned methods.

  • NAIVE APPROACH: It simply forecasts the next day’s sales as the current day’s sales.
  • MOVING AVERAGE: Here I calculate the mean sale over the previous 30 days and forecast them as the next day’s sale.
  • EXPONENTIAL SMOOTHING: Exponential smoothing uses a weighted average, whereas the time moves back, the weight decay for that respective period. The previous time steps are exponentially weighted and added up to generate the forecast. This method is heavily depended on trend and seasonality in the data.
  • ARIMA [Auto Regressive Integrated Moving Average]: While exponential smoothing models are based on a description of trend and seasonality in data, ARIMA models aim to describe the correlations in the time series.

After exploring those models, and not being able to good results, I started to explore other methods like XGBOOST, CATBOOST, LightGBM, Prophet and LSTM. I was not able to achieve better results with XGBOOST and CATBOOST either. LSTM required neural network for good prediction and I was completely familiar with it, I stuck to LightGBM

LIGHT GBM


  • A gradient boosting framework that uses tree based learning algorithm
  • Light GBM grows tree leaf-wise while other gradient boosting algorithms grows level-wise. A leaf is chooses based upon max delta loss to grow.
  • Light GBM is sensitive to overfitting and there is no threshold on the number of rows

DATA PREPARATION


  • First after loading the files, I created new datasets from the provided with state and products combination. This gave me 9 new datasets in line with 9-time-series forecasting predictions.
     states = ['CA','TX','WI']
     products = ['FOODS','HOBBIES','HOUSEHOLD']
        
     df_foods_ca, df_foods_tx ,df_foods_wi, df_hobbies_ca, df_hobbies_tx, df_hobbies_wi, df_hh_ca, df_hh_tx, df_hh_wi = (df_train_sales_new[(df_train_sales_new['cat_id'] == p)&(df_train_sales_new['state_id']==s)] for p in products for s in states)
    
  • After gathering the 9 datasets, I started to perform pandas operation to mold the data so that in a single dataframe I could have the respective sell-price, date of sales, demand of respective item_ids, event names and event types, SNAP days.
  • I also created additional feature variable like weekends or weekdays, month, year, converting the events and SNAP days to categorical features.
      #helper functions
    
      def create_data(df):
          df_melt = pd.melt(df, id_vars = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'], var_name = 'd', value_name = 'demand')
          df_melt = reduce_mem_usage(df_melt)
    
    
          df_melt = pd.merge(df_melt, df_new_cal, how = 'left', left_on = ['d'], right_on = ['d'])
          df_melt = reduce_mem_usage(df_melt)
    
          #d_1 to d_1913
          data = df_melt.merge(
              df_prices, on=["store_id", "item_id", "wm_yr_wk"], how='left')
          data = reduce_mem_usage(data)
    
          nan_features = ['event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']
          for feature in nan_features:
              data[feature].fillna('unknown', inplace = True)
    
          col = ['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id',
              'event_name_1', 'event_name_2', 'event_type_1', 'event_type_2']
          encoder = preprocessing.LabelEncoder()
          for c in col:
              data[c] = encoder.fit_transform(data[c])
    
          data = reduce_mem_usage(data)
    
          data['is_weekend'] = (data['wday'] <= 2).astype(int)
    
    
          #create date-time features columns
          date_features = {
              "wday": "weekday",
              "week": "weekofyear",
              "month": "month",
              "quarter": "quarter",
              "year": "year",
              "mday": "day",
          }
    
    
          for date_feat_name, date_feat_func in date_features.items():
              if date_feat_name in data.columns:
                data[date_feat_name] = data[date_feat_name].astype("int16")
              else:
                data['date'] = pd.to_datetime(data["date"])
                data[date_feat_name] = getattr(
                    data["date"].dt, date_feat_func).astype("int16")
    
          data = reduce_mem_usage(data)
          return data 
    
  • After preparing necessary feature variables, I started thinking, what features will be most important for a store/company to forecast demand of products like foods, households. As explored in the previous posts, the trends and weekly, monthly seasonality played an important role for me to decide which features to add or explore.
  • How did I look at it. Week-on-week same day trends are more likely to be occur if the prior week went similar too. It would make sense to not just have the last weekday or weekend but also the mean of the whole week leading upto that day to give the model the “hint” how normal the whole week was.
  • So I took 7-day and 28-day lag sales/demand and also 7-day and 28-day lag with 7,14,28 -day window.
    • lag_7: Demand shift 7-steps downwards.
    • lag_28: Demand shift 28-steps downwards.
    • rmean_7_7: rolling mean sales of a window size of 7 over column lag_7. First value appears on the 13th index because means including nan are nan.
    • rmean_7_14: rolling mean sales of a window size of 7 over column lag_28. First value appears on the 20th index because that is the first time the mean formula gets all 7 non-nan values.
    • rmean_7_28: rolling mean sales of a window size of 7 over column lag_28. First value appears on the 34th index because that is the first time the mean formula gets all 7 non-nan values.
    • rmean_28_7: rolling mean sales of a window size of 28 over column lag_7. First value appears on the 34th index because it is the first time the mean formula gets 28 non-nan values.
    • rmean_28_14: rolling mean sales of a window size of 28 over column lag_28. First value appears on 41th index because that is the first time the formula here all non-nan values.
    • rmean_28_28: rolling mean sales of a window size of 28 over column lag_28. First value appears on 55th index because that is the first time the formula here all non-nan values
  def create_features(df):
      dayLags = [7, 28]
      lagSalesCols = [f"lag_{dayLag}" for dayLag in dayLags]
      for dayLag, lagSalesCol in zip(dayLags, lagSalesCols):
          df[lagSalesCol] = df[["id", "demand"]].groupby("id")[
              "demand"].shift(dayLag)

      reduce_mem_usage(df)

      windows = [7, 14, 28]
      for window in windows:
          for dayLag, lagSalesCol in zip(dayLags, lagSalesCols):
              df[f"rmean_{dayLag}_{window}"] = df[["id", lagSalesCol]].groupby(
                  "id")[lagSalesCol].transform(lambda x: x.rolling(window).mean())

      return df
  • The most important thing I had to keep in mind, while preparing the datasets was the RAM usage. I had to convert the datatypes of columns to lowest possible, restrict object types in the data frame. I converted the columns into either int8 or float16 datatype to reduce considerate amount of RAM usage.
  • Thanks to several posts on Kaggle, many users have creates this as function and pretty straight forward usage.

 

TRAINING AND VALIDATION DATASET


  • After developing the 9-datasets with all the features and variables, time to prepare training and validation datasets. As I had the data till 06/19/2016, I divided the data till 05/22/2016 as train dataset and rest 28days as validation dataset. As this is experimenting forecast and not competition, I wanted the model to learn as much as it can and I wanted to predict(test data) for 100 days.
  • I defined the categorical features columns and some useless columns which won’t be useful for the model to use nor learn and divided the dataset.
     cat_features = ['item_id', 'store_id', 'cat_id', 'dept_id', 'state_id', 
                     'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']
                       
        
     useless_cols = ["id", "date", "demand", "d", "wm_yr_wk", "weekday"]
        
     train_cols = train_data.columns[~ train_data.columns.isin(useless_cols)]  # columns for training.
     label_col = ['demand']
        
     train_start_date = '2014-04-23'
     train_end_date = '2016-04-22'
     valid_start_date = '2016-04-23'
     valid_end_date = '2016-05-22'
        
        
     train_dates = (train_data['date'] > train_start_date) & (train_data['date'] <= train_end_date)
     df_train = train_data.loc[train_dates]
     x_train = df_train[train_cols]
     y_train = df_train['demand']
        
     valid_dates = (train_data['date']>= valid_start_date) & (train_data['date'] <= valid_end_date)
     df_valid = train_data.loc[valid_dates]
     x_valid = df_valid[train_cols]
     y_valid = df_valid['demand']
        
     %time
        
     train_set = lgb.Dataset(x_train, y_train, categorical_feature = cat_features, free_raw_data=False)
     val_set = lgb.Dataset(x_valid, y_valid, categorical_feature = cat_features,free_raw_data=False)
        
    

 

MODELING


  • Before running the model, the most important thing is to consider the parameters for LIGHT-GBM.
  • There are some control parameters like max depth of tree, minimum number of records a leaf may have(min_data_in_leaf), feature_fraction, bagging_fraction, early_stopping_round, lambda regularization.
  • Some core parameters like application (regression, classification, multiclass), boosting (gradient boosting, random forest, gradient based one-side sampling, multiple additive regression trees), number of boosting iteration, learning rate.
  • The most important parameter: metric parameter. As the competition in Kaggle used weight average error, I decided to use root mean square error.
  • Below are the parameters I selected.
     params = {
               'boosting_type': 'gbdt',
               'objective': 'poisson',
               'metric': 'rmse',
               'subsample': 0.5,
               'subsample_freq': 1,
               'learning_rate': 0.01,
               'num_leaves': 2**11-1,
               'min_data_in_leaf': 2**12-1,
               'feature_fraction': 0.5,
               'max_bin': 100,
               'n_estimators': 1400,
               'boost_from_average': False,
               'verbose': -1,
           } 
    
  • I had to test some parameters like objective poisson gave me better results than tweedie. Learning rate of 0.02 was not giving good results and high rmse value. After some permutations and some great Kaggle user guidance, I fixed on the above parameters.
  • After selection of parameters, I started modelling.
    %%time
      
    m_lgb = lgb.train(params, train_set, num_boost_round = 1500 , early_stopping_rounds = 50, valid_sets = [val_set], verbose_eval = 50)
    
  • I used early stopping rounds, as many datasets have completed training way below 1500 boost rounds and no improvement in rmse.
  • Save the model
    m_lgb.save_model("hh_tx_model_poisson.lgb")
    
  • After every training model for the state and product, I plotted feature importance graph and looked at the the most important features for every dataset.
    lgb.plot_importance(m_lgb, figsize=(10, 12))
    
  • It is very important to look at the feature importance plot. My inference from them was that the item_ids, sell_prices, week played important role in all the models. The rolling lag mean and lag means features varied across the datasets.

 

Evaluation and Prediction


  • The main factor I focused on developing a good model was rmse factor.

    • XGBoost, CATBOOST had some higher rmse.
    • NAIVE method, ARIMA couldn’t provide better predictions as these data sets had multiple dependent variables
    • After some fine tuning, I chose the parameters and got my best model.
    • Keeping in mind my computing resources and also thinking the balance between model development, accuracy, I proceeded with predicting to check how are the predictions.
    • Root Mean Square Error tells us that our model is able to forecast the avergae sales/demand in the 1.5 to 2.5 (9 time-series models) of the real sales. The lowest RMSE was for hobbies and highest RMSE was for foods.
  • I tested the data on 05/22/2016 to 06/19/2016 and also predicted for 100 days up to 08/30/2016.

    m_lgb = lgb.Booster(model_file='foods_ca_model_poisson.lgb')
      
    start_date = '2016-02-14'
    end_date = '2016-05-22'
    test_start_date = '2016-05-23'
    test_end_date = '2016-06-19'
    forecast_start_date = '2016-06-20'
    forecast_end_date = '2016-08-30'
      
      
    needed_col = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'd', 'demand', 'date', 'wm_yr_wk', 'weekday', 'wday', 'month', 'year', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2',   'snap_CA', 'snap_TX', 'snap_WI', 'week', 'quarter', 'mday',   'sell_price', 'is_weekend']
             
             
    previous_dates_forecast = (train_data['date'] >= start_date) & (train_data['date'] <= forecast_end_date)
    forecast_previous_data = train_data.loc[previous_dates_forecast]
    forecast_previous_data = forecast_previous_data[needed_col]
    forecast_previous_data
    
    alpha = 1.2
    max_lags = 70
    cols = [f"F{i}" for i in range(1,29)]
      
    df_test = forecast_previous_data.copy()
      
    fday = datetime(2016, 5, 23)
      
    for tdelta in range(0, 100):
        day = fday + timedelta(days=tdelta)
        print('Predicting for:', day)
        test_data = df_test[(df_test.date >= day - timedelta(days=max_lags)) & (df_test.date <= day)].copy()
        create_features(test_data)
        test_data = test_data.loc[test_data.date == day , train_cols]
        df_test.loc[df_test.date == day, "demand"] = alpha*m_lgb.predict(test_data)
          
    df_test_sub = df_test.loc[df_test.date >= fday, ["id", "demand"]].copy()
      
      
    cols = [f"F{i}" for i in range(1,101)]
      
    df_test_sub["F"] = [f"F{rank}" for rank in df_test_sub.groupby("id")["id"].cumcount()+1]
    df_test_sub = df_test_sub.set_index(["id", "F" ]).unstack()["demand"][cols].reset_index()
    df_test_sub.fillna(0., inplace = True)
    df_test_sub.sort_values("id", inplace = True)
    df_test_sub.reset_index(drop=True, inplace = True)
      
    data_1942_2041.to_csv('hh_tx_forecast.csv')
    
  • The model managed to predict with good accuracy of 55%

  • As this model is focused on predicting the demand of total number of products of foods, household and hobby items in each state, the total demand predictions were pretty close to original data in the first 28days test data.

  • As the model started to predict for next 100 days, I found out the model not predicting well after a certain date. Sometimes the prediction went over board by predicting double the demand. Clearly, this is cannot be right as no previous data points have been seen which such double increment within such less time.

  • As we forecast further out into the future, it is natural for us to become less confident in our values. This is reflected in the predicted values below.

Prediction and Visualization for 100 days forecast

  • The below visual graphs show the prediction each product in the three states.

    #hide_input
    from plotly.offline import download_plotlyjs,init_notebook_mode,plot
    import plotly.graph_objs as go
    from plotly.subplots import make_subplots
    ''''
    df_time_series_ca['Date']=df_time_series_ca.index
    df_time_series_ca['Date'] = pd.to_datetime(df_time_series_ca['Date'])
    df_time_series_ca = df_time_series_ca.reset_index(drop=True)
      
    df_time_series_tx['Date']=df_time_series_tx.index
    df_time_series_tx['Date'] = pd.to_datetime(df_time_series_tx['Date'])
    df_time_series_tx = df_time_series_tx.reset_index(drop=True)
      
    df_time_series_wi['Date']=df_time_series_wi.index
    df_time_series_wi['Date'] = pd.to_datetime(df_time_series_wi['Date'])
    df_time_series_wi = df_time_series_wi.reset_index(drop=True)
    '''
      
      
    fig = make_subplots(rows=3, cols=1, subplot_titles=("California", "Texas", "Wisconson"))
      
    fig.add_trace(go.Scatter(x=df_time_series_ca.index, 
                             y=df_time_series_ca.Hobbies,
                             mode = 'markers',
                             showlegend=True,
                             name='Hobbies'),
                  row=1, col=1)
      
    fig.add_trace(go.Scatter(x=df_time_series_ca_pred.Date,
                             y=df_time_series_ca_pred.Hobbies, 
                             mode = 'markers',showlegend=True,
                             name='Hobbies'),
                  row=1, col=1)
      
    fig.add_trace(go.Scatter(x=df_time_series_tx.index, 
                             y=df_time_series_tx.Hobbies,
                             mode = 'markers',
                             showlegend=True,
                             name='Hobbies'),
                  row=2, col=1)
      
    fig.add_trace(go.Scatter(x=df_time_series_tx_pred.Date,
                             y=df_time_series_tx_pred.Hobbies, 
                             mode = 'markers',showlegend=True,
                             name='Hobbies'),
                  row=2, col=1)
      
    fig.add_trace(go.Scatter(x=df_time_series_wi.index, 
                             y=df_time_series_wi.Hobbies,
                             mode = 'markers',
                             showlegend=True,
                             name='Hobbies'),
                  row=3, col=1)
      
    fig.add_trace(go.Scatter(x=df_time_series_wi_pred.Date,
                             y=df_time_series_wi_pred.Hobbies, 
                             mode = 'markers',showlegend=True,
                             name='Hobbies'),
                  row=3, col=1)
      
      
    fig.update_layout(height=1000, title_text="Hobbies Product Sales in each state")
    fig.update_xaxes(
        rangeselector=dict(
            buttons=list([
                dict(count=1, label="1m", step="month", stepmode="backward"),
                dict(count=6, label="6m", step="month", stepmode="backward"),
                dict(count=1, label="YTD", step="year", stepmode="todate"),
                dict(count=1, label="1y", step="year", stepmode="backward"),
                dict(step="all")
            ])
        )
    )
      
    fig.write_html("/content/drive/My Drive/projects/m5-forecasting/Hobbies_products_each_state.html")
    fig.show()
      
    

 

FOODS PREDICTION

 

Food prediction are way off. The prediction for first 30 days is good, but after that they are way off the normal distribution, especially Texas and California.

 

HOBBIES PREDICTION

 

Similarly hobbies product prediction for first 25-35 days, they are following normal trend, but slowly the trend increases drastically.

 

HOUSEHOLD PREDICTION

 

Same goes with household products predictions in the three states.

FUTURE DEVELOPMENT:

  • Model using neural networks {LSTM] - seeing if better accuracy can be achieved.
  • Try a different pre-processing approach and tuning more hyperparametes model and see if performances change
  • Checking other time-series and combining various models to forecast for single model.
  • Checking for combination of time-series in this 30000 time-series models.
  • I can look into more trends and seasonality across the products and prepare combinations that follow similar trend and seasonality and train the models.
  • Modelling with various other methods like Prophet, forecast with uncertainity bounds, forecast with external data source like data of Texas where it had hurricance Harvey affecting its sales, etc.

And Finally

I want to thank a lot of practioners & users on Kaggle discussion for sharing their thoughts, direction of thoughts which helped to understand the nitty gritties of time series prediction and forecast modelling.