Time Series Analysis¶

Freddie Zhang
Mar. 10, 2019

# Import packages
import os
import sys
import fbprophet
import datetime
import holidays
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.dates as mdates
import matplotlib.pyplot as pyplot
from datetime import date
from statistics import mode
from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation
from fbprophet.diagnostics import performance_metrics
from matplotlib import pyplot as plt
from statsmodels.tsa import seasonal
from statsmodels.tsa import stattools
from search_sampler import SearchSampler
from pandas.plotting import autocorrelation_plot
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoLarsCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings('ignore')

Section 0. Brand Information¶

Brand: Netflix
Freebase ID: /m/017rf_
Source: https://www.wikidata.org

Section 1. Decompose a Time Series: Google Trends & Non-stationary data¶

In section 1, I

converted my Netflix google trends data into a stationary format that is acceptable for time series modeling.
I used statsmodels.tsa to decompose the time series.
Visualized the deconstructed components, including: trend, seasonality, and irregularities

# Read in Netflix_googletrends.csv as a DataFrame
Netflix = pd.read_csv('Netflix.googletrends.csv',parse_dates=['period'])
# Convert column of 'period' into DateTime format.
Netflix['period'] = Netflix['period'].dt.date

# Set the column of 'period' as the index of DataFrame.
Netflix.set_index('period',inplace=True)

# Have a glance
Netflix.head(5)

'''
Convert data to a stationary format that is acceptable for time series modeling by using 
YearLocator() and MonthLocator() from matplotlib.dates library.
'''
years = mdates.YearLocator() # Locate years by matplotlib.dates
months = mdates.MonthLocator() # Locate months by matplotlib.dates
yearmonth = mdates.DateFormatter('%m,%y') # Format Data to mm,yy

# Plot of General Trends (Moving Averages)
fig, ax = pyplot.subplots(figsize=(40,8))

ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(yearmonth)
ax.title.set_text('General Trends')
ax.plot(Netflix.index,Netflix['value'])

plt.show()

'''
Decomposing the time series by using 'seasonal.seasonal_decompose' from 'statsmodels.tsa' 
library.
'''
decompose_model = seasonal.seasonal_decompose(Netflix['value'].tolist(), freq=365, model='additive')
dir(decompose_model)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'nobs',
 'observed',
 'plot',
 'resid',
 'seasonal',
 'trend']

'''
A standard time series equation can be expressed as $x_t = f_t + s_t + c_t + e_t$, which 
is a sum of the trend, seasonal, cyclical, and irregular components in that order. 
Plot of 'API Value', 'Trend', 'Seasonality', 'Residual' versus time.
'''
fig, (ax1, ax2, ax3, ax4) = pyplot.subplots(4, figsize=(15,15))

ax1.plot(Netflix.index, Netflix['value'])
ax2.plot(Netflix.index, decompose_model.trend)
ax3.plot(Netflix.index, decompose_model.seasonal)
ax4.plot(Netflix.index, decompose_model.resid)

ax1.title.set_text('API Value')
ax2.title.set_text('Trend')
ax3.title.set_text('Seasonality')
ax4.title.set_text('Residual Plot')

plt.show()

What is the overall trend? Is the search interest for your brand steadily growing?

According to the visualizations of API Value and Trend shown above, the overall trend is a downward trend from the beginning of 2014 through the end of 2017. The Netflix search interest reaches its peak in mid of 2014 and mid of 2015 while visiting its trough at the beginning of 2017. In general, we could observe a downwards tendency of trends.

For a more comprehensive analysis, we could break the timeline to only one year to avoid short-run effects.

What does the seasonality look like for this brand?

The seasonality plot shows that Netflix arrives at its maximum search interest at the beginning and end of each year. We could also observe another little peak around summer period. Seasonality is driven by the consumers because the search interest of Netflix depends on the spare time of consumers.

# Add a new column of 'seasonal' component from decompose_model into the dataframe.
Netflix['seasonal'] = decompose_model.seasonal
# Have a glance
Netflix.head(5)

# Start analyzing the seasonality trend in relatively short run, one-year range.
startdate = pd.to_datetime("2015-01-01").date()
enddate = pd.to_datetime("2015-12-31").date()
oneyear = Netflix.loc[startdate:enddate]

# Plot of Seasonality in 2015
fig, ax = pyplot.subplots(figsize=(15,8))

pyplot.plot(oneyear.index, oneyear.seasonal)
ax.title.set_text('Seasonality in One-year Range')

plt.show()

The plot above shows the seasonality in the one-year range. With a more clear insight, we could observe the Google search interest arrives at its peak at the beginning and end of the year. Moreover, the search interest starts decreasing from January to May and warms up a little bit during the summer. The search interest is unstable from September to November and starts climbing in December.

Section 2. Facebook Prophet for Forecasting¶

In section 2, I

fitted the model and graph a residual plot to test the accuracy of the model by using Facebook's Prophet forecasting tool.
created a forecast for the next year's google's search interest by using the same toll and create a graph of the forecast

Netflix = pd.read_csv('Netflix.googletrends.csv',parse_dates=['period'])
prophetdataframe = pd.DataFrame() # Create a new prophet
prophetdataframe['ds'] = Netflix['period'] # DataStamp
prophetdataframe['y'] = Netflix['value'] # Value

# Have a glance
prophetdataframe.head(5)

# Fit the model by fbprophet. 
m = Prophet()
m.fit(prophetdataframe)

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

<fbprophet.forecaster.Prophet at 0x1c31b45048>

# Create a new dataframe for next one year with dates
future = m.make_future_dataframe(periods = 365) 
# Have a glance
future.tail()

'''
New DataFrame generated by the prediction from fbprophet, including columns of trend, yhat, 
additive terms, weekly, yearly, multiplicative terms(empty) and their ranges.
'''
forecast = m.predict(future)
forecast.tail()

# Model Diagnostics
erroranalytics = m.predict(prophetdataframe)
erroranalytics['value'] = prophetdataframe['y'] # import real value
erroranalytics['residuals'] = erroranalytics['value'] - erroranalytics['yhat'] # compare predict value to real value
erroranalytics['absoluteresiduals'] = abs(erroranalytics['residuals']) # absolute error
# Have a glance
erroranalytics.head()

# Visualization
fig1 = m.plot(forecast)

# component visualizations
fig2 = m.plot_components(forecast)

# sum of absolute residuals
totalerror = erroranalytics['absoluteresiduals'].sum()

print(totalerror)

2363300.1076838253

# MAE
print(totalerror/len(erroranalytics))
# Meaning: the model was off about 1300 Google trend units every day on average.

1297.8034638571253

erroranalytics['value'].describe()

count     1821.000000
mean     20097.461182
std       4249.942089
min       9673.065405
25%      16888.614395
50%      19093.643513
75%      22227.758983
max      38175.606745
Name: value, dtype: float64

Section 2.1 Create a forecast using holiday effects¶

goodholidays = [] # new list for holidays
for date, name in sorted(holidays.US(years=2014).items()): # take an example of 2014
    goodholidays.append(name)
    print(date, name)

2014-01-01 New Year's Day
2014-01-20 Martin Luther King, Jr. Day
2014-02-17 Washington's Birthday
2014-05-26 Memorial Day
2014-07-04 Independence Day
2014-09-01 Labor Day
2014-10-13 Columbus Day
2014-11-11 Veterans Day
2014-11-27 Thanksgiving
2014-12-25 Christmas Day

# new DataFrame
sales = pd.DataFrame(columns=['holiday','ds','lower_window','upper_window'])

for year in range(2014,2021): # major holidays from 2014 to 2020
    for date, name in sorted(holidays.US(years=year).items()):
        lower_window = 0 # reset the 'lower_window'
        upper_window = 0 # reset the 'upper_window'
        if name in goodholidays:
            dayofweek = date.weekday()
        #add additional dates to current holidays
            if name == "New Year's Day":
                lower_window = -1
            if name == "Thanksgiving":
                upper_window = 3
            if name == "Christmas Day":
                lower_window = -5
                upper_window = 6
            if name == "Labor Day":
                lower_window = -3
            if name == "Independence Day":
                upper_window = 3
            sales.loc[len(sales)] = [name, date, lower_window, upper_window]

'''
Add include Superbowl as a typical holiday. Live stream on Netflix for sure!
'''
superbowls = pd.DataFrame({
  'holiday': 'Superbowl',
  'ds': pd.to_datetime(['2014-02-02', '2015-02-01', '2016-02-07',
                       '2017-02-05', '2018-02-04', '2019-02-03',
                       '2020-02-02']),
  'lower_window': 0,
  'upper_window': 1,
})

superbowls['ds'] = superbowls['ds'].dt.date

sales = pd.concat([sales, superbowls],ignore_index=True)

sales = sales.sort_values(['ds'], ascending=[True]).reset_index(drop=True)
# Have a glance
sales.head()

adv_prophetdataframe = pd.DataFrame()
adv_prophetdataframe['ds'] = Netflix['period']
adv_prophetdataframe['y'] = Netflix['value']
adv_prophetdataframe.head()

adv_m = Prophet(holidays=sales)
adv_m.add_country_holidays(country_name='US')
adv_m.fit(adv_prophetdataframe)
# add Netflix up a show time

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

<fbprophet.forecaster.Prophet at 0x1c3035b898>

print(adv_m.train_holiday_names)

0                  New Year's Day
1     Martin Luther King, Jr. Day
2                       Superbowl
3           Washington's Birthday
4                    Memorial Day
5                Independence Day
6                       Labor Day
7                    Columbus Day
8                    Veterans Day
9                    Thanksgiving
10                  Christmas Day
11       Christmas Day (Observed)
12      New Year's Day (Observed)
13        Veterans Day (Observed)
14    Independence Day (Observed)
dtype: object

'''
Create new DataFrame for prediction on next one year
'''
adv_future = adv_m.make_future_dataframe(periods = 365)
adv_forecast = adv_m.predict(adv_future)

adv_erroranalytics = adv_m.predict(adv_prophetdataframe)
adv_erroranalytics['value'] = adv_prophetdataframe['y']
adv_erroranalytics['residuals'] = adv_erroranalytics['value'] - adv_erroranalytics['yhat']
adv_erroranalytics['absoluteresiduals'] = abs(adv_erroranalytics['residuals'])
adv_erroranalytics.head()

adv_totalerror = adv_erroranalytics['absoluteresiduals'].sum()

print(adv_totalerror)

2053497.7449622154

print(adv_totalerror/len(adv_erroranalytics))

1127.675862142897

How accurate does the model appear to be?

The accuracy of the model is representing by the average prediction error. Lower average prediction error indicates a higher accuracy in general. In this case, the forecast using holiday effects has a lower average prediction error (1127.7 compared with 1297.8) and is more accurate in our case.

fig, ax = pyplot.subplots(figsize=(15,8))

ax.plot(adv_erroranalytics['ds'].dt.to_pydatetime(), adv_erroranalytics['absoluteresiduals'], 'k.', color='#0072B2')
ax.title.set_text('Residual Plot')

plt.show()

Are the residuals consistent across time? If not, why do you think it varies?

Based on the residual plot above, we could find that the residuals are not consistent across time. The residuals tend to be larger at the beginning and end of each year while is it is similar on other occasions. I think the residuals are varied because of some holidays effects and outliers effect. Even though we have included holidays effect in our forecasting, the difference in search interest is still too significant to be minimized.

Section 3. Forecast Next Year's Google's Search Interest¶

In section 3, I

used the tool to forecast next year's google's search interest.
created a graph of the forecast

# create forecast plot
fig1 = m.plot(forecast)

# create component plot
fig2 = adv_m.plot_components(forecast)

What are the projected high interest times?

Based on the weekly plot, we could find Sunday and Saturday are the times that Google search interests of Netflix are high. The yearly plot also indicates that January is the time that the search interests are high. These projected high-interest times indicate consumers tend to have more free time during the weekends and New Year to watch Netflix. Since some people might have a summer break, we could also observe summer as a relatively active search time.

What are the projected low interest times?

From the same graph, we could observe search interests decrease significantly from Monday and remain low from Tuesday to Thursday. The search interests would increase again from Friday and stay high during the weekend. In the annual aspect, May tends to have the lowest search interests. The search interests decrease from January to February and keep variating from September to November. These patterns indicate consumers are busy during the weekdays and have less time watching Netflix. Also, people are super busy before some big holidays.

Part 4. Advertising Expenditure Data & Granger Causality¶

In section 4, I

used the adspend data of Netflix and used it the expenditure time series as possible predictors
analyzed whether advertising expenditures drive Google search interest

Section 4.1 Preprocessing AdSpend Data¶

# load adspend data
adspend = pd.read_csv('netflix.adspend.csv', encoding="ISO-8859-1")

# perform data cleaning
adspend['TIME PERIOD'] = adspend['TIME PERIOD'].str.replace("WEEK OF ","",regex=True)
adspend['TIME PERIOD'] = adspend['TIME PERIOD'].str.replace(" \(B\)", "",regex=True)
adspend.columns = adspend.columns.str.replace(" DOLS \(000\)", "",regex=True)

# Ignore the last 5 rows to avoid NA.
adspend = adspend[:-5]

adspend.tail()

# display all the columns names
list(adspend)

['TIME PERIOD',
 'PRODUCT',
 'TOTAL',
 'NETWORK TV',
 'CABLE TV',
 'SYNDICATION',
 'SPOT TV',
 'MAGAZINES',
 'SUNDAY MAGS',
 'NATL NEWSP',
 'NEWSPAPER',
 'NETWORK RADIO',
 'NAT SPOT RADIO',
 'OUTDOOR']

# convert time period to datetime formate.
adspend['FIXED TIME'] = pd.to_datetime(adspend['TIME PERIOD'])

# aggregated the total ads spend by time
adspendnoproducts = adspend.pivot_table(index = 'FIXED TIME', aggfunc = np.sum)

adspendnoproducts.head()

list(adspendnoproducts)

['CABLE TV',
 'MAGAZINES',
 'NAT SPOT RADIO',
 'NATL NEWSP',
 'NETWORK RADIO',
 'NETWORK TV',
 'NEWSPAPER',
 'OUTDOOR',
 'SPOT TV',
 'SUNDAY MAGS',
 'SYNDICATION',
 'TOTAL']

adspendnoproducts.index

DatetimeIndex(['2013-10-07', '2013-10-14', '2013-10-21', '2013-10-28',
               '2013-11-04', '2013-11-11', '2013-11-18', '2013-11-25',
               '2013-12-02', '2013-12-09',
               ...
               '2018-07-30', '2018-08-06', '2018-08-13', '2018-08-20',
               '2018-08-27', '2018-09-03', '2018-09-10', '2018-09-17',
               '2018-09-24', '2018-10-01'],
              dtype='datetime64[ns]', name='FIXED TIME', length=261, freq=None)

# plot total ads spend 
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, figsize =(15,15), sharex = 'all')

ax1.plot(adspendnoproducts.index, adspendnoproducts['CABLE TV']) 
ax2.plot(adspendnoproducts.index, adspendnoproducts['NETWORK TV']) 
ax3.plot(adspendnoproducts.index, adspendnoproducts['SPOT TV'])  
ax4.plot(adspendnoproducts.index, adspendnoproducts['TOTAL'])

ax1.title.set_text('CABLE TV')
ax2.title.set_text('NETWORK TV')
ax3.title.set_text('SPOT TV')
ax4.title.set_text('TOTAL')

plt.show()

# set period as index
Netflix.set_index('period', inplace=True)

# resample the data and calculate the average
Netflixweekly = Netflix.resample('W-MON', closed='left', label='left').mean()

Netflixweekly.head()

# join weekly Netflix ad spends with aggregated Google trends data.
Netflixweeklydata = adspendnoproducts.join(Netflixweekly)

Netflixweeklydata.head()

# make a visual comparison of the weekly average Netflix ad spend data.
dimensions = (10, 10)
fig, (ax1, ax2) = plt.subplots(2, figsize = dimensions, sharex = 'all')

ax1.plot(Netflixweeklydata.index, Netflixweeklydata['TOTAL']) 
ax2.plot(Netflixweeklydata.index, Netflixweeklydata['NETWORK TV']) 

ax1.title.set_text('Total')
ax2.title.set_text('NETWORK TV')

plt.show()

Make a correlation test between the total Netflix ad spend and average Google trends data. According to the result, these two factors are correlated at a very small level. To get a proper correlation, we make this data stationary.

# perform correlation test
Netflixweeklydata['TOTAL'].corr(Netflixweeklydata['value'])
# According to the result, these two factors are correlated at a very small level. 
# We need to convert the data to stationary format for time series analysis.

0.07931662612588285

Calculate one day differences in TOTAL and store the values in a column called TOTALdiff. Also calculate one day differences in value and store the values in a column called valuediff.

# calculate one day differences for Google search interest and ads spend.
Netflixweeklydata['TOTALdiff'] = Netflixweeklydata['TOTAL'].diff(1)
Netflixweeklydata['valuediff'] = Netflixweeklydata['value'].diff(1)

# calculate the correlation
Netflixweeklydata['TOTALdiff'].corr(Netflixweeklydata['valuediff'])

0.18058165833278841

Netflixweeklydata.to_csv('adspend.googletrends.weekly.csv')

Section 4.2 Granger Causality and Shifting¶

# read in data
adspend_trends = pd.read_csv('adspend.googletrends.weekly.csv', parse_dates = ['FIXED TIME'])

# set fixed time as index
adspend_trends.set_index('FIXED TIME', inplace = True)

list(adspend_trends)

['CABLE TV',
 'MAGAZINES',
 'NAT SPOT RADIO',
 'NATL NEWSP',
 'NETWORK RADIO',
 'NETWORK TV',
 'NEWSPAPER',
 'OUTDOOR',
 'SPOT TV',
 'SUNDAY MAGS',
 'SYNDICATION',
 'TOTAL',
 'Unnamed: 0',
 'value',
 'onediffvalue',
 'TOTALdiff',
 'valuediff']

# remove unknown variables we made and inspect the column names again
adspend_trends = adspend_trends.drop(columns = ['onediffvalue', 'Unnamed: 0', 'TOTALdiff', 'valuediff'])
list(adspend_trends)

['CABLE TV',
 'MAGAZINES',
 'NAT SPOT RADIO',
 'NATL NEWSP',
 'NETWORK RADIO',
 'NETWORK TV',
 'NEWSPAPER',
 'OUTDOOR',
 'SPOT TV',
 'SUNDAY MAGS',
 'SYNDICATION',
 'TOTAL',
 'value']

# use ADF to check the significance of stationarity
for acolumn in list(adspend_trends):
    adf_result = stattools.adfuller(adspend_trends[acolumn])
    print(adf_result[1])

4.944569609526588e-09
0.06237188986361679
0.6888967165130822
0.5507690801310768
0.9967319472189904
5.200387873510148e-06
0.09667874305370416
0.7976617760666814
0.34853519025821755
4.858038849581863e-29
0.2744078405823607
0.24171083022075313
0.05407759564271027

adspend_trends_diff = pd.DataFrame()

# calculate one day difference to see if we can get a stationary dataset
for acolumn in list(adspend_trends):
    columnnames = "%s_diff" % (acolumn)
    adspend_trends_diff[columnnames] = adspend_trends[acolumn].diff(1)
adspend_trends_diff.head()

for acolumn in list(adspend_trends_diff):
    adf_result = stattools.adfuller(adspend_trends_diff[acolumn].iloc[1:])
    print(adf_result[1])

2.737565394693417e-16
2.834615515519561e-14
8.44449296505387e-15
5.2624941219073705e-12
6.799586104486564e-06
4.097629531847151e-13
2.9107116033648535e-12
1.3630246836291944e-15
3.5528864600703405e-20
1.507821040229618e-13
1.8385816673648825e-17
7.158438380058312e-05
1.904983788426432e-13

list(adspend_trends_diff)

['CABLE TV_diff',
 'MAGAZINES_diff',
 'NAT SPOT RADIO_diff',
 'NATL NEWSP_diff',
 'NETWORK RADIO_diff',
 'NETWORK TV_diff',
 'NEWSPAPER_diff',
 'OUTDOOR_diff',
 'SPOT TV_diff',
 'SUNDAY MAGS_diff',
 'SYNDICATION_diff',
 'TOTAL_diff',
 'value_diff']

# calculate correlation between ads spends variables and Google search interests
correlationlist = []
for acolumn in list(adspend_trends_diff):
    if "value" not in acolumn:
        corr = adspend_trends_diff[acolumn].corr(adspend_trends_diff['value_diff'])
        print(acolumn, corr)
        correlationlist.append(corr)

CABLE TV_diff 0.2203641011484586
MAGAZINES_diff 0.05606900447146863
NAT SPOT RADIO_diff 0.02480378803853413
NATL NEWSP_diff 0.004636281702527561
NETWORK RADIO_diff -0.014175420952063408
NETWORK TV_diff 0.09689017500866812
NEWSPAPER_diff 0.14791680031854618
OUTDOOR_diff 0.09041040738031347
SPOT TV_diff 0.011802221229822218
SUNDAY MAGS_diff 0.04533792349282706
SYNDICATION_diff 0.12708521821198435
TOTAL_diff 0.1805816583327881

# calculate the mean of these correlations.
np.mean(correlationlist)

0.08264351319865625

# calculate the correlation again by using the original data.
correlationlist = []
for acolumn in list(adspend_trends):
    if "value" not in acolumn:
        corr = adspend_trends[acolumn].corr(adspend_trends['value'])
        print(acolumn, corr)
        correlationlist.append(corr)

CABLE TV 0.21732364780401534
MAGAZINES 0.04282481664567579
NAT SPOT RADIO 0.003628544658318102
NATL NEWSP -0.07741452486981847
NETWORK RADIO -0.03582219339670275
NETWORK TV 0.04137759675561025
NEWSPAPER 0.035156521348975484
OUTDOOR -0.0883332173235734
SPOT TV -0.1230979007347311
SUNDAY MAGS -0.017685645562027594
SYNDICATION 0.4945035319256308
TOTAL 0.0793166261258828

np.mean(correlationlist)

0.047648150281437936

# generate heatmap for correlation
f, ax = plt.subplots(figsize = (10, 8))
corr = adspend_trends_diff.corr()
sns.heatmap(corr, mask = np.zeros_like(corr, dtype = np.bool), 
            cmap = sns.diverging_palette(220, 10, as_cmap = True),
            square = True, ax = ax)
plt.show()

# perform granger causality test
# set the measurement period
numofweeks = 12
significantlags = []
for acolumn in list(adspend_trends_diff):
    if 'value' not in acolumn:
        testframe = adspend_trends_diff[['value_diff', acolumn]]
        testframe = testframe.iloc[1:]
        results = stattools.grangercausalitytests(testframe, numofweeks, verbose = False)
        for week in range(1, numofweeks + 1):
            # set level of significance.
            if results[week][0]['params_ftest'][1] < 0.05:
                print('%s is significnat at %s weeks' % (acolumn, week))
                significantlags.append(week)

CABLE TV_diff is significnat at 1 weeks
NAT SPOT RADIO_diff is significnat at 1 weeks
NAT SPOT RADIO_diff is significnat at 3 weeks
NAT SPOT RADIO_diff is significnat at 4 weeks
NAT SPOT RADIO_diff is significnat at 5 weeks
NAT SPOT RADIO_diff is significnat at 6 weeks
NAT SPOT RADIO_diff is significnat at 7 weeks
NAT SPOT RADIO_diff is significnat at 8 weeks
NAT SPOT RADIO_diff is significnat at 9 weeks
NAT SPOT RADIO_diff is significnat at 10 weeks
NAT SPOT RADIO_diff is significnat at 11 weeks
NAT SPOT RADIO_diff is significnat at 12 weeks
NETWORK RADIO_diff is significnat at 3 weeks
NETWORK RADIO_diff is significnat at 4 weeks
NETWORK RADIO_diff is significnat at 5 weeks
NETWORK RADIO_diff is significnat at 6 weeks
NETWORK RADIO_diff is significnat at 7 weeks
NETWORK RADIO_diff is significnat at 8 weeks
NETWORK RADIO_diff is significnat at 10 weeks
OUTDOOR_diff is significnat at 2 weeks
OUTDOOR_diff is significnat at 3 weeks
OUTDOOR_diff is significnat at 5 weeks
SUNDAY MAGS_diff is significnat at 2 weeks
SUNDAY MAGS_diff is significnat at 3 weeks
SUNDAY MAGS_diff is significnat at 4 weeks
SUNDAY MAGS_diff is significnat at 5 weeks
SUNDAY MAGS_diff is significnat at 6 weeks
SUNDAY MAGS_diff is significnat at 7 weeks
SUNDAY MAGS_diff is significnat at 8 weeks
SUNDAY MAGS_diff is significnat at 9 weeks
SUNDAY MAGS_diff is significnat at 10 weeks
SUNDAY MAGS_diff is significnat at 11 weeks
SUNDAY MAGS_diff is significnat at 12 weeks
SYNDICATION_diff is significnat at 4 weeks
SYNDICATION_diff is significnat at 5 weeks
SYNDICATION_diff is significnat at 11 weeks
SYNDICATION_diff is significnat at 12 weeks
TOTAL_diff is significnat at 5 weeks
TOTAL_diff is significnat at 6 weeks

Do any advertising expenditures drive Google search interest?

Yes, I think CABLE TV and NEWSPAPER could drive the Google search interest. Moreover, NAT SPOT RADIO, NETWORK RADIO, and SYNDICATION also have long term effect on Google search interest. According to the correlation test between ad spend difference and Google search interest difference, we could find CABLE TV and NEWSPAPER have the highest positive correlation with the Google search interest, which suggests increasing the spend on these two categories might also increase the Google search interest. CABLE TV and NEWSPAPER are also effective approaches to advertise Netflix. As the most significant competitor of cable tv, Netflix could show its benefits through a direct comparison on TV.

The Granger causality test couldn't tell correlation directly; however, it indicates the significance of lags, which means the advertising expenditures are still able to drive search interest after seeing the ads for N weeks.

fig, ax = pyplot.subplots(figsize=(15,8))

ax.plot(adspend_trends_diff.index, adspend_trends_diff['TOTAL_diff'])
ax.plot(adspend_trends_diff.index, adspend_trends_diff['value_diff'])

ax.title.set_text('Time Series Visualizations of Ad Spend and Google Search Interest')
ax.legend(('Ad Spend','Google Search Interest'))

plt.show()

Create a time series visualizations that overlay both ad spend and Google search interest, to see if they visually align.

According to the plot above, we could find that the visualization of Google search interest does align with the visualization of ad spend.

Process data for modeling dataset¶

# The most common lags in the list is 5 weeks.
mode(significantlags)

5

modelingdataset = adspend_trends_diff

modelingdataset.tail()

# created a 5-week lagged version of data
# attach 5 more weeks to the end of dataframe
date = pd.to_datetime('2018-10-08')
modelingdataset.loc[date] = np.nan

date = pd.to_datetime('2018-10-15')
modelingdataset.loc[date] = np.nan

date = pd.to_datetime('2018-10-22')
modelingdataset.loc[date] = np.nan

date = pd.to_datetime('2018-10-29')
modelingdataset.loc[date] = np.nan

date = pd.to_datetime('2018-11-05')
modelingdataset.loc[date] = np.nan

modelingdataset.tail()

# shift the column down by 5 rows.
for acolumn in list(modelingdataset):
    if 'value' not in acolumn:
        # get lag up to 5
        for alag in range(1,6):
            columnname = '%s_lag%s' % (acolumn, alag)
            modelingdataset[columnname] = modelingdataset[acolumn].shift(alag)

# sort the dataset by the name of column names
modelingdataset.sort_index(axis=1, inplace = True)
modelingdataset.head()

modelingdataset.to_csv('google_adspend_differenced_lagged_fordatarobot.csv')

Section 5. Advanced Time Series Analysis Forecasting¶

In section 5, I

forcasted the Google search interests through neural network
forcasted the Google search interests through lassolarscv
compared the results

Section 5.1 Neural Network¶

# load final modeling data
alldata = pd.read_csv('google_adspend_differenced_lagged_fordatarobot.csv')

# drop rows contain empty values
alldata = alldata.iloc[6:]
alldata = alldata.iloc[:-5]

alldata['FIXED TIME'] = pd.to_datetime((alldata['FIXED TIME']))

Perform neural network modeling based on the github instructions.

# Gradient descent algorithms perform better if the variables are wihtin range [-1, 1].
# The `value_diff` variable is mixmax scaled to bound the tranformed variable within [-1,1].
scaler = MinMaxScaler(feature_range=(-1, 1))
alldata['scaled_value_diff'] = scaler.fit_transform(np.array(alldata['value_diff']).reshape(-1, 1))

# split data in two parts - train set and validation set
# The neural network is trained on the train set. 
# This means computation of the loss function, back propagation and weights updated by a gradient descent algorithm is done on the train set.
split_date = datetime.datetime(year=2017, month=11, day=10, hour=0)
df_train = alldata.loc[alldata['FIXED TIME']<split_date]
df_val = alldata.loc[alldata['FIXED TIME']>=split_date]
print('Shape of train:', df_train.shape)
print('Shape of test:', df_val.shape)

Shape of train: (208, 75)
Shape of test: (47, 75)

# reset index
df_train.reset_index(drop=True, inplace=True)
df_val.reset_index(drop=True, inplace=True)

# plot train times series of the standardized value_diff
plt.figure(figsize=(15, 6))
g = sns.tsplot(df_train['scaled_value_diff'], color='b')
g.set_title('Time series of Google Trends data differences in train set')
g.set_xlabel('Index')
g.set_ylabel('Google Trends data differences')

# plot validation times series of the standardized value_diff
plt.figure(figsize=(15, 6))
g = sns.tsplot(df_val['scaled_value_diff'], color='r')
g.set_title('Time series of Google Trends data differences in validation set')
g.set_xlabel('Index')
g.set_ylabel('Google Trends data differences')

plt.show()

def makeXy(ts, nb_timesteps):
    X = []
    y = []
    for i in range(nb_timesteps, ts.shape[0]):
        X.append(list(ts.loc[i-nb_timesteps:i-1]))
        y.append(ts.loc[i])
    X, y = np.array(X), np.array(y)
    return X, y

# generate regressors (X) and target variable (y) for train and validation sets
X_train, y_train = makeXy(df_train['scaled_value_diff'], 7)
print('Shape of train arrays:', X_train.shape, y_train.shape)

Shape of train arrays: (201, 7) (201,)

X_val, y_val = makeXy(df_val['scaled_value_diff'], 7)
print('Shape of validation arrays:', X_val.shape, y_val.shape)

Shape of validation arrays: (40, 7) (40,)

# initialize model  
model = Sequential()
model.add(Dense(7, activation='relu', kernel_initializer='uniform', input_shape=(7,)))
# training with noise
layer = GaussianNoise(0.1)
model.add(GaussianNoise(0.01, input_shape=(620,)))
model.add(Dense(16, kernel_initializer='uniform', activation='relu'))
model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))

WARNING:tensorflow:From /anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.

# compile the model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(x=X_train, y=y_train, batch_size=201, epochs=100, verbose=1, validation_data=(X_val, y_val), shuffle=True)

WARNING:tensorflow:From /anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.

Train on 201 samples, validate on 40 samples
Epoch 1/100
201/201 [==============================] - 0s 2ms/step - loss: 0.6931 - acc: 0.0000e+00 - val_loss: 0.6924 - val_acc: 0.0000e+00
Epoch 2/100
201/201 [==============================] - 0s 12us/step - loss: 0.6924 - acc: 0.0000e+00 - val_loss: 0.6917 - val_acc: 0.0000e+00
Epoch 3/100
201/201 [==============================] - 0s 13us/step - loss: 0.6918 - acc: 0.0000e+00 - val_loss: 0.6910 - val_acc: 0.0000e+00
Epoch 4/100
201/201 [==============================] - 0s 13us/step - loss: 0.6911 - acc: 0.0000e+00 - val_loss: 0.6902 - val_acc: 0.0000e+00
Epoch 5/100
201/201 [==============================] - 0s 13us/step - loss: 0.6903 - acc: 0.0000e+00 - val_loss: 0.6895 - val_acc: 0.0000e+00
Epoch 6/100
201/201 [==============================] - 0s 17us/step - loss: 0.6896 - acc: 0.0000e+00 - val_loss: 0.6887 - val_acc: 0.0000e+00
Epoch 7/100
201/201 [==============================] - 0s 12us/step - loss: 0.6888 - acc: 0.0000e+00 - val_loss: 0.6879 - val_acc: 0.0000e+00
Epoch 8/100
201/201 [==============================] - 0s 12us/step - loss: 0.6880 - acc: 0.0000e+00 - val_loss: 0.6871 - val_acc: 0.0000e+00
Epoch 9/100
201/201 [==============================] - 0s 14us/step - loss: 0.6873 - acc: 0.0000e+00 - val_loss: 0.6862 - val_acc: 0.0000e+00
Epoch 10/100
201/201 [==============================] - 0s 15us/step - loss: 0.6865 - acc: 0.0000e+00 - val_loss: 0.6854 - val_acc: 0.0000e+00
Epoch 11/100
201/201 [==============================] - 0s 12us/step - loss: 0.6857 - acc: 0.0000e+00 - val_loss: 0.6845 - val_acc: 0.0000e+00
Epoch 12/100
201/201 [==============================] - 0s 15us/step - loss: 0.6848 - acc: 0.0000e+00 - val_loss: 0.6836 - val_acc: 0.0000e+00
Epoch 13/100
201/201 [==============================] - 0s 15us/step - loss: 0.6840 - acc: 0.0000e+00 - val_loss: 0.6827 - val_acc: 0.0000e+00
Epoch 14/100
201/201 [==============================] - 0s 24us/step - loss: 0.6831 - acc: 0.0000e+00 - val_loss: 0.6818 - val_acc: 0.0000e+00
Epoch 15/100
201/201 [==============================] - 0s 24us/step - loss: 0.6822 - acc: 0.0000e+00 - val_loss: 0.6809 - val_acc: 0.0000e+00
Epoch 16/100
201/201 [==============================] - 0s 19us/step - loss: 0.6813 - acc: 0.0000e+00 - val_loss: 0.6799 - val_acc: 0.0000e+00
Epoch 17/100
201/201 [==============================] - 0s 24us/step - loss: 0.6804 - acc: 0.0000e+00 - val_loss: 0.6789 - val_acc: 0.0000e+00
Epoch 18/100
201/201 [==============================] - 0s 19us/step - loss: 0.6794 - acc: 0.0000e+00 - val_loss: 0.6779 - val_acc: 0.0000e+00
Epoch 19/100
201/201 [==============================] - 0s 27us/step - loss: 0.6784 - acc: 0.0000e+00 - val_loss: 0.6769 - val_acc: 0.0000e+00
Epoch 20/100
201/201 [==============================] - 0s 21us/step - loss: 0.6774 - acc: 0.0000e+00 - val_loss: 0.6758 - val_acc: 0.0000e+00
Epoch 21/100
201/201 [==============================] - 0s 17us/step - loss: 0.6764 - acc: 0.0000e+00 - val_loss: 0.6747 - val_acc: 0.0000e+00
Epoch 22/100
201/201 [==============================] - 0s 14us/step - loss: 0.6754 - acc: 0.0000e+00 - val_loss: 0.6736 - val_acc: 0.0000e+00
Epoch 23/100
201/201 [==============================] - 0s 17us/step - loss: 0.6743 - acc: 0.0000e+00 - val_loss: 0.6725 - val_acc: 0.0000e+00
Epoch 24/100
201/201 [==============================] - 0s 13us/step - loss: 0.6732 - acc: 0.0000e+00 - val_loss: 0.6713 - val_acc: 0.0000e+00
Epoch 25/100
201/201 [==============================] - 0s 16us/step - loss: 0.6721 - acc: 0.0000e+00 - val_loss: 0.6701 - val_acc: 0.0000e+00
Epoch 26/100
201/201 [==============================] - 0s 28us/step - loss: 0.6709 - acc: 0.0000e+00 - val_loss: 0.6688 - val_acc: 0.0000e+00
Epoch 27/100
201/201 [==============================] - 0s 42us/step - loss: 0.6697 - acc: 0.0000e+00 - val_loss: 0.6675 - val_acc: 0.0000e+00
Epoch 28/100
201/201 [==============================] - 0s 29us/step - loss: 0.6684 - acc: 0.0000e+00 - val_loss: 0.6662 - val_acc: 0.0000e+00
Epoch 29/100
201/201 [==============================] - 0s 21us/step - loss: 0.6672 - acc: 0.0000e+00 - val_loss: 0.6649 - val_acc: 0.0000e+00
Epoch 30/100
201/201 [==============================] - 0s 16us/step - loss: 0.6659 - acc: 0.0000e+00 - val_loss: 0.6635 - val_acc: 0.0000e+00
Epoch 31/100
201/201 [==============================] - 0s 23us/step - loss: 0.6645 - acc: 0.0000e+00 - val_loss: 0.6621 - val_acc: 0.0000e+00
Epoch 32/100
201/201 [==============================] - 0s 22us/step - loss: 0.6632 - acc: 0.0000e+00 - val_loss: 0.6606 - val_acc: 0.0000e+00
Epoch 33/100
201/201 [==============================] - 0s 13us/step - loss: 0.6617 - acc: 0.0000e+00 - val_loss: 0.6591 - val_acc: 0.0000e+00
Epoch 34/100
201/201 [==============================] - 0s 14us/step - loss: 0.6603 - acc: 0.0000e+00 - val_loss: 0.6575 - val_acc: 0.0000e+00
Epoch 35/100
201/201 [==============================] - 0s 13us/step - loss: 0.6588 - acc: 0.0000e+00 - val_loss: 0.6559 - val_acc: 0.0000e+00
Epoch 36/100
201/201 [==============================] - 0s 14us/step - loss: 0.6572 - acc: 0.0000e+00 - val_loss: 0.6543 - val_acc: 0.0000e+00
Epoch 37/100
201/201 [==============================] - 0s 14us/step - loss: 0.6556 - acc: 0.0000e+00 - val_loss: 0.6526 - val_acc: 0.0000e+00
Epoch 38/100
201/201 [==============================] - 0s 13us/step - loss: 0.6540 - acc: 0.0000e+00 - val_loss: 0.6509 - val_acc: 0.0000e+00
Epoch 39/100
201/201 [==============================] - 0s 13us/step - loss: 0.6524 - acc: 0.0000e+00 - val_loss: 0.6491 - val_acc: 0.0000e+00
Epoch 40/100
201/201 [==============================] - 0s 13us/step - loss: 0.6506 - acc: 0.0000e+00 - val_loss: 0.6473 - val_acc: 0.0000e+00
Epoch 41/100
201/201 [==============================] - 0s 14us/step - loss: 0.6489 - acc: 0.0000e+00 - val_loss: 0.6454 - val_acc: 0.0000e+00
Epoch 42/100
201/201 [==============================] - 0s 13us/step - loss: 0.6471 - acc: 0.0000e+00 - val_loss: 0.6435 - val_acc: 0.0000e+00
Epoch 43/100
201/201 [==============================] - 0s 14us/step - loss: 0.6452 - acc: 0.0000e+00 - val_loss: 0.6415 - val_acc: 0.0000e+00
Epoch 44/100
201/201 [==============================] - 0s 19us/step - loss: 0.6432 - acc: 0.0000e+00 - val_loss: 0.6394 - val_acc: 0.0000e+00
Epoch 45/100
201/201 [==============================] - 0s 24us/step - loss: 0.6413 - acc: 0.0000e+00 - val_loss: 0.6373 - val_acc: 0.0000e+00
Epoch 46/100
201/201 [==============================] - 0s 18us/step - loss: 0.6392 - acc: 0.0000e+00 - val_loss: 0.6351 - val_acc: 0.0000e+00
Epoch 47/100
201/201 [==============================] - 0s 28us/step - loss: 0.6371 - acc: 0.0000e+00 - val_loss: 0.6329 - val_acc: 0.0000e+00
Epoch 48/100
201/201 [==============================] - 0s 16us/step - loss: 0.6350 - acc: 0.0000e+00 - val_loss: 0.6306 - val_acc: 0.0000e+00
Epoch 49/100
201/201 [==============================] - 0s 14us/step - loss: 0.6327 - acc: 0.0000e+00 - val_loss: 0.6283 - val_acc: 0.0000e+00
Epoch 50/100
201/201 [==============================] - 0s 22us/step - loss: 0.6305 - acc: 0.0000e+00 - val_loss: 0.6258 - val_acc: 0.0000e+00
Epoch 51/100
201/201 [==============================] - 0s 20us/step - loss: 0.6282 - acc: 0.0000e+00 - val_loss: 0.6233 - val_acc: 0.0000e+00
Epoch 52/100
201/201 [==============================] - 0s 18us/step - loss: 0.6258 - acc: 0.0000e+00 - val_loss: 0.6208 - val_acc: 0.0000e+00
Epoch 53/100
201/201 [==============================] - 0s 19us/step - loss: 0.6232 - acc: 0.0000e+00 - val_loss: 0.6181 - val_acc: 0.0000e+00
Epoch 54/100
201/201 [==============================] - 0s 15us/step - loss: 0.6207 - acc: 0.0000e+00 - val_loss: 0.6155 - val_acc: 0.0000e+00
Epoch 55/100
201/201 [==============================] - 0s 16us/step - loss: 0.6181 - acc: 0.0000e+00 - val_loss: 0.6127 - val_acc: 0.0000e+00
Epoch 56/100
201/201 [==============================] - 0s 16us/step - loss: 0.6154 - acc: 0.0000e+00 - val_loss: 0.6098 - val_acc: 0.0000e+00
Epoch 57/100
201/201 [==============================] - 0s 15us/step - loss: 0.6127 - acc: 0.0000e+00 - val_loss: 0.6069 - val_acc: 0.0000e+00
Epoch 58/100
201/201 [==============================] - 0s 15us/step - loss: 0.6101 - acc: 0.0000e+00 - val_loss: 0.6040 - val_acc: 0.0000e+00
Epoch 59/100
201/201 [==============================] - 0s 13us/step - loss: 0.6071 - acc: 0.0000e+00 - val_loss: 0.6009 - val_acc: 0.0000e+00
Epoch 60/100
201/201 [==============================] - 0s 12us/step - loss: 0.6041 - acc: 0.0000e+00 - val_loss: 0.5977 - val_acc: 0.0000e+00
Epoch 61/100
201/201 [==============================] - 0s 15us/step - loss: 0.6009 - acc: 0.0000e+00 - val_loss: 0.5945 - val_acc: 0.0000e+00
Epoch 62/100
201/201 [==============================] - 0s 14us/step - loss: 0.5979 - acc: 0.0000e+00 - val_loss: 0.5912 - val_acc: 0.0000e+00
Epoch 63/100
201/201 [==============================] - 0s 16us/step - loss: 0.5945 - acc: 0.0000e+00 - val_loss: 0.5878 - val_acc: 0.0000e+00
Epoch 64/100
201/201 [==============================] - 0s 15us/step - loss: 0.5914 - acc: 0.0000e+00 - val_loss: 0.5843 - val_acc: 0.0000e+00
Epoch 65/100
201/201 [==============================] - 0s 15us/step - loss: 0.5880 - acc: 0.0000e+00 - val_loss: 0.5807 - val_acc: 0.0000e+00
Epoch 66/100
201/201 [==============================] - 0s 13us/step - loss: 0.5847 - acc: 0.0000e+00 - val_loss: 0.5771 - val_acc: 0.0000e+00
Epoch 67/100
201/201 [==============================] - 0s 13us/step - loss: 0.5811 - acc: 0.0000e+00 - val_loss: 0.5734 - val_acc: 0.0000e+00
Epoch 68/100
201/201 [==============================] - 0s 16us/step - loss: 0.5775 - acc: 0.0000e+00 - val_loss: 0.5695 - val_acc: 0.0000e+00
Epoch 69/100
201/201 [==============================] - 0s 16us/step - loss: 0.5739 - acc: 0.0000e+00 - val_loss: 0.5656 - val_acc: 0.0000e+00
Epoch 70/100
201/201 [==============================] - 0s 16us/step - loss: 0.5702 - acc: 0.0000e+00 - val_loss: 0.5616 - val_acc: 0.0000e+00
Epoch 71/100
201/201 [==============================] - 0s 14us/step - loss: 0.5661 - acc: 0.0000e+00 - val_loss: 0.5576 - val_acc: 0.0000e+00
Epoch 72/100
201/201 [==============================] - 0s 14us/step - loss: 0.5623 - acc: 0.0000e+00 - val_loss: 0.5534 - val_acc: 0.0000e+00
Epoch 73/100
201/201 [==============================] - 0s 19us/step - loss: 0.5584 - acc: 0.0000e+00 - val_loss: 0.5491 - val_acc: 0.0000e+00
Epoch 74/100
201/201 [==============================] - 0s 14us/step - loss: 0.5540 - acc: 0.0000e+00 - val_loss: 0.5448 - val_acc: 0.0000e+00
Epoch 75/100
201/201 [==============================] - 0s 14us/step - loss: 0.5499 - acc: 0.0000e+00 - val_loss: 0.5403 - val_acc: 0.0000e+00
Epoch 76/100
201/201 [==============================] - 0s 14us/step - loss: 0.5456 - acc: 0.0000e+00 - val_loss: 0.5358 - val_acc: 0.0000e+00
Epoch 77/100
201/201 [==============================] - 0s 17us/step - loss: 0.5413 - acc: 0.0000e+00 - val_loss: 0.5311 - val_acc: 0.0000e+00
Epoch 78/100
201/201 [==============================] - 0s 20us/step - loss: 0.5368 - acc: 0.0000e+00 - val_loss: 0.5264 - val_acc: 0.0000e+00
Epoch 79/100
201/201 [==============================] - 0s 15us/step - loss: 0.5321 - acc: 0.0000e+00 - val_loss: 0.5216 - val_acc: 0.0000e+00
Epoch 80/100
201/201 [==============================] - 0s 18us/step - loss: 0.5275 - acc: 0.0000e+00 - val_loss: 0.5166 - val_acc: 0.0000e+00
Epoch 81/100
201/201 [==============================] - 0s 17us/step - loss: 0.5231 - acc: 0.0000e+00 - val_loss: 0.5116 - val_acc: 0.0000e+00
Epoch 82/100
201/201 [==============================] - 0s 21us/step - loss: 0.5180 - acc: 0.0000e+00 - val_loss: 0.5065 - val_acc: 0.0000e+00
Epoch 83/100
201/201 [==============================] - 0s 22us/step - loss: 0.5132 - acc: 0.0000e+00 - val_loss: 0.5013 - val_acc: 0.0000e+00
Epoch 84/100
201/201 [==============================] - 0s 20us/step - loss: 0.5082 - acc: 0.0000e+00 - val_loss: 0.4960 - val_acc: 0.0000e+00
Epoch 85/100
201/201 [==============================] - 0s 19us/step - loss: 0.5027 - acc: 0.0000e+00 - val_loss: 0.4906 - val_acc: 0.0000e+00
Epoch 86/100
201/201 [==============================] - 0s 24us/step - loss: 0.4976 - acc: 0.0000e+00 - val_loss: 0.4852 - val_acc: 0.0000e+00
Epoch 87/100
201/201 [==============================] - 0s 16us/step - loss: 0.4925 - acc: 0.0000e+00 - val_loss: 0.4797 - val_acc: 0.0000e+00
Epoch 88/100
201/201 [==============================] - 0s 26us/step - loss: 0.4870 - acc: 0.0000e+00 - val_loss: 0.4740 - val_acc: 0.0000e+00
Epoch 89/100
201/201 [==============================] - 0s 13us/step - loss: 0.4818 - acc: 0.0000e+00 - val_loss: 0.4683 - val_acc: 0.0000e+00
Epoch 90/100
201/201 [==============================] - 0s 17us/step - loss: 0.4763 - acc: 0.0000e+00 - val_loss: 0.4625 - val_acc: 0.0000e+00
Epoch 91/100
201/201 [==============================] - 0s 14us/step - loss: 0.4707 - acc: 0.0000e+00 - val_loss: 0.4566 - val_acc: 0.0000e+00
Epoch 92/100
201/201 [==============================] - 0s 22us/step - loss: 0.4652 - acc: 0.0000e+00 - val_loss: 0.4506 - val_acc: 0.0000e+00
Epoch 93/100
201/201 [==============================] - 0s 34us/step - loss: 0.4596 - acc: 0.0000e+00 - val_loss: 0.4446 - val_acc: 0.0000e+00
Epoch 94/100
201/201 [==============================] - 0s 17us/step - loss: 0.4537 - acc: 0.0000e+00 - val_loss: 0.4384 - val_acc: 0.0000e+00
Epoch 95/100
201/201 [==============================] - 0s 17us/step - loss: 0.4474 - acc: 0.0000e+00 - val_loss: 0.4322 - val_acc: 0.0000e+00
Epoch 96/100
201/201 [==============================] - 0s 18us/step - loss: 0.4418 - acc: 0.0000e+00 - val_loss: 0.4259 - val_acc: 0.0000e+00
Epoch 97/100
201/201 [==============================] - 0s 16us/step - loss: 0.4358 - acc: 0.0000e+00 - val_loss: 0.4196 - val_acc: 0.0000e+00
Epoch 98/100
201/201 [==============================] - 0s 16us/step - loss: 0.4296 - acc: 0.0000e+00 - val_loss: 0.4132 - val_acc: 0.0000e+00
Epoch 99/100
201/201 [==============================] - 0s 12us/step - loss: 0.4233 - acc: 0.0000e+00 - val_loss: 0.4067 - val_acc: 0.0000e+00
Epoch 100/100
201/201 [==============================] - 0s 21us/step - loss: 0.4169 - acc: 0.0000e+00 - val_loss: 0.4001 - val_acc: 0.0000e+00

<keras.callbacks.History at 0x1c3166fc88>

# predict the results
preds = model.predict(X_val)

# calculate R^2
r2 = r2_score(df_val['scaled_value_diff'].loc[7:], preds)
print('R-squared for the validation set:', round(r2,4))

R-squared for the validation set: -9.7842

Section 5.2 LassoLarsCV¶

Applying LassoLarsCV work from class

alldata.set_index(alldata['FIXED TIME'], inplace=True)
alldata = alldata.drop(columns=['FIXED TIME'])

listofallpredictors = []
for i in list(alldata):
    if "value" not in i:
        listofallpredictors.append(i)

predictors = alldata[listofallpredictors]

target = alldata['value_diff']

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123)

model = LassoLarsCV(cv=10, precompute=False)

model = model.fit(pred_train.values, tar_train.values)

predictors_model = pd.DataFrame(listofallpredictors)
predictors_model.columns = ['label']
predictors_model['coeff'] = model.coef_

for index, row in predictors_model.iterrows():
    if row['coeff'] > 0:
        print(row.values)

['CABLE TV_diff' 0.1285734995019148]
['CABLE TV_diff_lag1' 0.15987756252672494]
['NAT SPOT RADIO_diff_lag1' 0.8078840763738762]
['NETWORK RADIO_diff_lag1' 0.06786399492262714]
['NETWORK TV_diff_lag5' 0.1336582158383376]
['NEWSPAPER_diff' 0.053342876501953115]
['SUNDAY MAGS_diff_lag3' 3.8581864776861896]
['SYNDICATION_diff' 0.2710473033753247]
['TOTAL_diff' 0.05915094829297049]

test_error = mean_squared_error(tar_test, model.predict(pred_test))
print('test data MSE')
print(test_error)

test data MSE
2651567.986036804

rsquared_test = model.score(pred_test, tar_test)
print('test data r-square')
print(rsquared_test)

test data r-square
-0.06699879197221481

Assess the degree to which your advanced approach outperforms the methods taught in class.

Both of my approaches are not a good fit for my data. The LassoLarsCV has R^2 as -0.067 and Neural Network has R^2 as -9.784. Negative R^2 indicates the fitted model is even worse than a straight line.

I think the Neural Network is not a good fit because the dataset is too small. Deep learning requires a relatively large dataset, and the model will always overfit the small dataset.
I could get a better LassoLarsCV result if I set the lags as 2. My model has too many predictors for logistic regression, which creates a lot of error when fitting the model.
The best approach to solve these issues might be improving the training set.

	Unnamed: 0	date	value	onediffvalue
period
2013-10-07	0	Oct 07 2013	20979.550508	NaN
2013-10-08	1	Oct 08 2013	19372.317342	-1607.233166
2013-10-09	2	Oct 09 2013	19643.282124	270.964782
2013-10-10	3	Oct 10 2013	19588.818874	-54.463250
2013-10-11	4	Oct 11 2013	20986.168158	1397.349284

	Unnamed: 0	date	value	onediffvalue	seasonal
period
2013-10-07	0	Oct 07 2013	20979.550508	NaN	-2179.300695
2013-10-08	1	Oct 08 2013	19372.317342	-1607.233166	-1482.657901
2013-10-09	2	Oct 09 2013	19643.282124	270.964782	466.350198
2013-10-10	3	Oct 10 2013	19588.818874	-54.463250	1031.644542
2013-10-11	4	Oct 11 2013	20986.168158	1397.349284	1388.189943

	ds	y
0	2013-10-07	20979.550508
1	2013-10-08	19372.317342
2	2013-10-09	19643.282124
3	2013-10-10	19588.818874
4	2013-10-11	20986.168158

	ds
2181	2019-09-27
2182	2019-09-28
2183	2019-09-29
2184	2019-09-30
2185	2019-10-01

	ds	trend	yhat_lower	yhat_upper	trend_lower	trend_upper	additive_terms	additive_terms_lower	additive_terms_upper	weekly	weekly_lower	weekly_upper	yearly	yearly_lower	yearly_upper	yhat
2181	2019-09-27	20369.812455	15281.410259	21145.725617	18735.467006	22241.341869	-2305.527187	-2305.527187	-2305.527187	-1162.812517	-1162.812517	-1162.812517	-1142.714670	-1142.714670	-1142.714670	18064.285268
2182	2019-09-28	20372.361037	20346.334021	26219.796783	18730.608963	22254.719438	2793.341771	2793.341771	2793.341771	3854.264822	3854.264822	3854.264822	-1060.923052	-1060.923052	-1060.923052	23165.702808
2183	2019-09-29	20374.909620	21572.735124	27440.054563	18725.346779	22267.311382	4013.627179	4013.627179	4013.627179	4980.858450	4980.858450	4980.858450	-967.231271	-967.231271	-967.231271	24388.536800
2184	2019-09-30	20377.458203	15728.690127	21758.448491	18719.614868	22278.190239	-1717.283273	-1717.283273	-1717.283273	-853.269562	-853.269562	-853.269562	-864.013712	-864.013712	-864.013712	18660.174930
2185	2019-10-01	20380.006786	14524.255804	20542.918715	18713.882957	22287.666294	-2927.161767	-2927.161767	-2927.161767	-2173.206551	-2173.206551	-2173.206551	-753.955217	-753.955217	-753.955217	17452.845018

	ds	trend	yhat_lower	yhat_upper	trend_lower	trend_upper	additive_terms	additive_terms_lower	additive_terms_upper	weekly	...	yearly	yearly_lower	yearly_upper	yhat	value	residuals	absoluteresiduals
0	2013-10-07	24253.506485	21039.075364	25672.945075	24253.506485	24253.506485	-935.867106	-935.867106	-935.867106	-853.269562	...	-82.597544	-82.597544	-82.597544	23317.639379	20979.550508	-2338.088871	2338.088871
1	2013-10-08	24243.191682	19864.427188	24467.541027	24243.191682	24243.191682	-2191.201605	-2191.201605	-2191.201605	-2173.206551	...	-17.995054	-17.995054	-17.995054	22051.990078	19372.317342	-2679.672736	2679.672736
2	2013-10-09	24232.876879	19661.510384	24186.239106	24232.876879	24232.876879	-2304.887576	-2304.887576	-2304.887576	-2334.001433	...	29.113856	29.113856	29.113856	21927.989303	19643.282124	-2284.707179	2284.707179
3	2013-10-10	24222.562077	19624.441298	24288.204168	24222.562077	24222.562077	-2254.781087	-2254.781087	-2254.781087	-2311.833210	...	57.052123	57.052123	57.052123	21967.780990	19588.818874	-2378.962116	2378.962116
4	2013-10-11	24212.247274	20864.271577	25378.721595	24212.247274	24212.247274	-1098.177716	-1098.177716	-1098.177716	-1162.812517	...	64.634801	64.634801	64.634801	23114.069557	20986.168158	-2127.901400	2127.901400

	holiday	ds	lower_window	upper_window
0	New Year's Day	2014-01-01	-1	0
1	Martin Luther King, Jr. Day	2014-01-20	0	0
2	Superbowl	2014-02-02	0	1
3	Washington's Birthday	2014-02-17	0	0
4	Memorial Day	2014-05-26	0	0

	ds	trend	yhat_lower	yhat_upper	trend_lower	trend_upper	...	yearly	yearly_lower	yearly_upper	yhat	value	residuals	absoluteresiduals
0	2013-10-07	24236.365093	20907.321575	25062.675489	24236.365093	24236.365093	...	28.781891	28.781891	28.781891	23111.483470	20979.550508	-2131.932961	2131.932961
1	2013-10-08	24225.796741	20278.063695	24154.509700	24225.796741	24225.796741	...	56.151747	56.151747	56.151747	22202.049520	19372.317342	-2829.732178	2829.732178
2	2013-10-09	24215.228388	20062.796087	23876.912388	24215.228388	24215.228388	...	68.460592	68.460592	68.460592	22040.139980	19643.282124	-2396.857856	2396.857856
3	2013-10-10	24204.660036	19894.285164	24003.722991	24204.660036	24204.660036	...	65.206428	65.206428	65.206428	21950.387871	19588.818874	-2361.568997	2361.568997
4	2013-10-11	24194.091683	21008.293163	25021.448389	24194.091683	24194.091683	...	46.269666	46.269666	46.269666	23039.317894	20986.168158	-2053.149737	2053.149737

	TIME PERIOD	PRODUCT	TOTAL	CABLE TV	SPOT TV	MAGAZINES	NATL NEWSP	NEWSPAPER	OUTDOOR
1759	OCT 01, 2018	Netflix-Maniac : Online	31.2	0.0	0.0	0.0	0.0	0.0	31.2
1760	OCT 01, 2018	Netflix-Private Life : Online	421.2	0.0	16.1	93.5	293.6	18.1	0.0
1761	OCT 01, 2018	Netflix-Quincy : Online	172.9	0.0	0.0	172.9	0.0	0.0	0.0
1762	OCT 01, 2018	Netflix-Super Monsters : Online	2.3	2.3	0.0	0.0	0.0	0.0	0.0
1763	OCT 01, 2018	Private Life : Movie	27.2	0.0	27.2	0.0	0.0	0.0	0.0

	CABLE TV	MAGAZINES	NAT SPOT RADIO	NATL NEWSP	NETWORK RADIO	NETWORK TV	NEWSPAPER	OUTDOOR	SPOT TV	SUNDAY MAGS	SYNDICATION	TOTAL
FIXED TIME
2013-10-07	54.6	0.0	0.0	0.0	0.0	222.9	0.0	0.0	19.8	0.0	263.2	560.5
2013-10-14	21.3	0.0	0.0	0.0	0.0	117.2	0.0	0.0	9.7	0.0	413.5	561.7
2013-10-21	14.1	0.0	0.0	0.0	0.0	145.8	0.0	0.0	10.3	0.0	279.1	449.3
2013-10-28	3.7	0.0	0.0	0.0	0.0	44.7	0.0	233.0	21.5	0.0	326.1	629.1
2013-11-04	13.0	0.0	0.0	0.0	0.0	175.9	0.0	0.0	15.1	0.0	215.3	419.3

	Unnamed: 0	value	onediffvalue
period
2013-10-07	3	23072.559082	1311.731696
2013-10-14	10	22692.250310	-168.587560
2013-10-21	17	22073.755462	-106.178985
2013-10-28	24	21326.559388	259.806443
2013-11-04	31	21788.654927	100.181845

	CABLE TV_diff	MAGAZINES_diff	NAT SPOT RADIO_diff	NATL NEWSP_diff	NETWORK RADIO_diff	NETWORK TV_diff	NEWSPAPER_diff	OUTDOOR_diff	SPOT TV_diff	SUNDAY MAGS_diff	SYNDICATION_diff	TOTAL_diff	value_diff
FIXED TIME
2018-09-03	-220.6	-645.5	49.3	0.0	-71.7	1342.4	-39.3	-2511.7	-114.9	0.0	0.0	-2212.0	757.374014
2018-09-10	-18.3	782.7	0.0	727.9	0.0	-28.2	-28.5	0.0	-51.9	0.0	0.0	1383.7	-1960.312593
2018-09-17	187.1	-259.4	0.0	-94.7	0.0	-1051.6	-7.8	0.0	14.0	0.0	7.3	-1205.2	1205.581802
2018-09-24	-180.5	-98.6	0.0	-595.5	0.0	943.4	2.7	0.0	162.2	0.0	56.7	290.4	-1161.818662
2018-10-01	59.1	470.5	-50.7	416.2	0.0	-1078.4	95.6	2763.9	-11.3	0.0	-64.0	2600.7	-1087.065425

	CABLE TV_diff	MAGAZINES_diff	NAT SPOT RADIO_diff	NATL NEWSP_diff	NETWORK RADIO_diff	NETWORK TV_diff	NEWSPAPER_diff	OUTDOOR_diff	SPOT TV_diff	SUNDAY MAGS_diff	SYNDICATION_diff	TOTAL_diff	value_diff
FIXED TIME
2018-10-08	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-10-15	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-10-22	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-10-29	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2018-11-05	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN