Intro
It’s a short yet handy guide on how to analyze time series data in a few clicks using modern Python libraries.
Tech stack for this guide consists of
duckdb
andpandas
packages for data processing- Facebook’s
prophet
for data exploration and forecasting plotly
for data visualization
Code
import duckdb as db
import pandas as pd
import numpy as np
orders = pd.read_csv('orders_data.csv')
orders.sample()
Created Date | Country | City | Restaurant ID | Restaurant Name | Order State | Cancel Reason | Cuisine | Platform | Products in Order | Order Value € (Gross) | Delivery Fee | Delivery Time | Order ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
44255 | 10.01.2020 | Portugal | Lisbon | 6167 | H3 Armazéns do Chiado | delivered | NaN | Mediterranean | ios | 1 | €9.55 | 0.0 | 14.7 | 379143844 |
Are there any discernible patterns or seasonality present?
Usually it’s good to start any time series analysis with tackling four key questions:
- Is there any trend?
- Is there seasonality, if so which type: additive or multiplicative?
- Is there any explicit change points, which we should consider?
- Are there outliers? - Get rid of them!
Data
If you never experienced duckdb
before, I totally recommend you to try as it boosts your efficiency allowing simultaneous usage of best features from Python and SQL
Code
df = db.query("""
SELECT strptime("Created Date", '%d.%m.%Y') as date
,"City" as city
,count(*) as orders
FROM
orders o
WHERE "Order State" = 'delivered'
GROUP BY 1, 2
""").to_df()
pivot = df.pivot(index='date', columns='city', values='orders')
pivot.head()
city | Accra | Lisbon |
---|---|---|
date | ||
2020-01-01 | 124 | 882 |
2020-01-02 | 126 | 1104 |
2020-01-03 | 76 | 1110 |
2020-01-04 | 169 | 784 |
2020-01-05 | 198 | 788 |
Code
print(f"We have data for {pivot.shape[0]} consecutive days")
We have data for 59 consecutive days
Libraries
I’m using my own time series model wrappers to explore and forecast the data, they are based upon popular libraries from big tech companies, in this particular case, on Facebook Prophet
Code
import datetime
import re
from typing import Optional, List, Tuple
from plotly import express, graph_objects
from plotly.subplots import make_subplots
from prophet import Prophet as FP
from prophet.utilities import regressor_coefficients
from sklearn.preprocessing import MinMaxScaler
Code
class ModelWrapper:
"""
Custom Wrapper for popular models for TimeSeries analysis, including but not limited to:
- Google Causal Impact https://pypi.org/project/causalimpact/
- Facebook Prophet https://facebook.github.io/prophet/
Parameters
----------
df: DataFrame containg target variable with `y` name and time frame variable with `date` name
start_date: the start date of experiment with the format YYYY-MM-DD
test_days: the number of last days in which the experiment was run
end_date: the end date of experiment with the format YYYY-MM-DD
date: timeframe column name
y: target column name
Attributes
----------
explore: the chart exploring pre-experiment correlations
show: visualization of the trained model, please notet that you need to run the model first
"""
def __init__(
self,
df: pd.DataFrame,
start_date: Optional[str] = None,
test_days: Optional[int] = None,
end_date: Optional[str] = None,
date: str = "date",
y: str = "y"
) -> None:
self.df = df.copy()
self.date = date
self.y = y
if start_date:
self.start_dt = datetime.datetime.strptime(start_date, "%Y-%m-%d").date()
elif test_days:
if pd.core.dtypes.common.is_datetime64_dtype(self.df[self.date].dtype):
self.start_dt = self.df[self.date].max().date() - datetime.timedelta(days=test_days)
else:
self.start_dt = datetime.datetime.strptime(self.df[self.date].max(), "%Y-%m-%d").date() - datetime.timedelta(days=test_days)
else:
raise ValueError("You must specify start_date or test_days variable")
self.end_dt = datetime.datetime.strptime(end_date, "%Y-%m-%d").date() if end_date else None
self.x = list(self.df.columns.difference([self.date, self.y]).values)
self._preprocess()
def _preprocess(self):
self.df[self.date] = pd.to_datetime(self.df[self.date]).dt.date
self.df = self.df[[self.y, self.date] + self.x].set_index(self.date).sort_index()
for column in self.df.columns[self.df.dtypes == "object"]:
try:
self.df[column] = self.df[column].astype(float)
except ValueError:
raise ValueError("All DataFrame columns except Date must be numeric")
@staticmethod
def _save(figure, title: str) -> None:
figure.write_html(
re.sub('[-:<>|\/\*\?\"\\\\ ]+', '_', title.lower()) + ".html"
)
def explore(
self,
scale: bool = True,
title: str = "pre-experiment correlation",
x_label: Optional[str] = None,
y_label: Optional[str] = None,
width: int = 900,
height: int = 600,
save: bool = False,
) -> None:
"""
Plot the dynamic of pre-experiment correlation
Parameters
----------
scale: whether the samples should be scaled to [0, 1] interval
title: the title of the chart
x_label: label for X axis
y_label: label for Y axis
width: the width of the chart
height: the height of the chart
save: whether you want to save the chart as HTML
"""
data = self.df.copy()
if scale:
scaler = MinMaxScaler(feature_range=(0, 1))
data[data.columns] = scaler.fit_transform(data)
corr = data[data.index < self.start_dt].corr().iloc[0, 1:]
data = pd.melt(
data,
value_vars=list(data.columns),
var_name="variable",
value_name="value",
ignore_index=False,
)
chart_title = title + f"""<br><sup>{', '.join(
[f"{feature}: {round(value, 2)}"
for feature, value in zip(self.x, corr)]
)}</sup></br>"""
figure = express.line(
data,
x=data.index,
y="value",
color="variable",
height=height,
width=width,
title=chart_title,
)
figure.add_vline(x=self.start_dt, line_width=2, line_dash="dash", line_color="white")
if self.end_dt:
figure.add_vline(x=self.end_dt, line_width=2, line_dash="dash", line_color="white")
figure.update_traces(hovertemplate="%{y}")
figure.update_xaxes(title_text=x_label if x_label else "Date")
figure.update_yaxes(title_text=y_label if y_label else "Scaled Axis" if scale else "Original Axis")
figure.update_layout(
title={
"x": 0.5,
},
legend={
"x": 0.05,
"y": 1.05,
"orientation": "h",
"title": None
},
hovermode="x",
template="plotly_dark",
xaxis=dict(hoverformat="%a, %b %d, %Y"),
)
if save:
self._save(figure, title)
figure.show()
@staticmethod
def _add_chart(
figure,
data: pd.DataFrame,
titles: list,
y: str,
name: str,
row: int,
actual: bool = False,
y_label: Optional[str] = None,
) -> None:
figure.add_trace(
graph_objects.Scatter(
x=data.index,
y=data[y],
name=name,
hovertemplate="%{y}",
line={"color": "white"},
legendgroup=f"{row}",
legendgrouptitle={"text": titles[row]},
connectgaps=True,
),
row=row,
col=1,
)
figure.add_trace(
graph_objects.Scatter(
x=data.index,
y=data[y + "_upper"],
name="Upper bound",
hovertemplate="%{y}",
line={"color": "deepskyblue", "width": 0.5},
legendgroup=f"{row}",
connectgaps=True,
),
row=row,
col=1,
)
figure.add_trace(
graph_objects.Scatter(
x=data.index,
y=data[y + "_lower"],
name="Lower bound",
hovertemplate="%{y}",
fill="tonexty",
line={"color": "deepskyblue", "width": 0.5},
legendgroup=f"{row}",
connectgaps=True,
),
row=row,
col=1,
)
figure.update_yaxes(title_text="" if not y_label else "% Effect" if row == 4 else y_label)
if actual:
figure.add_trace(
graph_objects.Scatter(
x=data.index,
y=data["response"],
name="Actual",
hovertemplate="%{y}",
line={"color": "red"},
legendgroup=f"{row}",
connectgaps=True,
),
row=row,
col=1,
)
else:
figure.add_hline(y=0, line_width=1, line_color="white", row=row, col=1)
def show(
self,
keep_n_prior_days: Optional[int] = None,
title: str = "Causal Impact",
x_label: Optional[str] = None,
y_label: Optional[str] = None,
width: int = 900,
height: int = 600,
save: bool = False,
) -> None:
"""
Plot the trained model results
Parameters
----------
keep_n_prior_days: specify the exact number of pre-experiment days you want to keep, skip if you want to show all the time frame
title: the title of the chart
x_label: label for X axis
y_label: label for Y axis
width: the width of the chart
height: the height of the chart
save: whether you want to save the chart as HTML
"""
try:
if keep_n_prior_days:
data = self.result[
self.result.index > self.start_dt - datetime.timedelta(days=keep_n_prior_days)
]
else:
data = self.result.iloc[1:]
except AttributeError:
raise AttributeError("To show the results run the model first, use .run() method")
titles = [
"Model Overview",
"Actual vs Expected",
"Effect Size: Actual - Expected",
"Cumulative Effect",
"Effect Relative to Expected",
]
if isinstance(self, CausalImpact):
figure = make_subplots(
rows=4, cols=1, shared_xaxes=True, subplot_titles=titles[1:]
)
for y, name, row in zip(
["point_pred", "point_effect", "cum_effect", "rel_effect"],
["Expected Values", "Effect Size", "Cumulative Effect", "Relative Effect"],
range(1, 5)
):
self._add_chart(
figure,
data,
titles,
y=y,
name=name,
row=row,
actual=(row == 1),
y_label=y_label
)
elif isinstance(self, Prophet):
row = 1
figure = make_subplots()
self._add_chart(
figure,
data,
titles,
y="yhat",
name="Expected Value",
row=row,
actual=True,
y_label=y_label
)
figure.update_xaxes(title_text=x_label if x_label else "Date", row=row, col=1)
figure.add_vline(x=self.start_dt, line_width=2, line_dash="dash", line_color="white")
if self.end_dt:
figure.add_vline(x=self.end_dt, line_width=2, line_dash="dash", line_color="white")
figure.update_layout(
title={
"x": 0.5,
"text": title,
},
width=width,
height=height,
hovermode="x",
template="plotly_dark",
legend={
"x": 0.0,
"y": -0.2,
"orientation": "h",
"groupclick": "toggleitem",
"traceorder": "grouped",
},
xaxis=dict(hoverformat="%a, %b %d, %Y"),
)
if save:
self._save(figure, title)
figure.show()
Code
class CausalImpact(ModelWrapper):
def run(
self,
nseasons: int = 7,
season_duration: int = 1,
alpha: float = 0.05,
**kwargs
) -> pd.DataFrame:
"""
Run causal impact analysis
Parameters
----------
nseasons: Period of the seasonal components.
In order to include a seasonal component, set this to a whole number greater than 1.
For example, if the data represent daily observations, use 7 for a day-of-week component.
This interface currently only supports up to one seasonal component.
season_duration: Duration of each season, i.e., number of data points each season spans.
For example, to add a day-of-week component to data with daily granularity, use model_args = list(nseasons = 7, season_duration = 1).
To add a day-of-week component to data with hourly granularity, set model_args = list(nseasons = 7, season_duration = 24).
alpha : Desired tail-area probability for posterior intervals. Defaults to 0.05, which will produce central 95% intervals
Other Parameters
----------------
**kwargs : model_args variables, available options:
ndraws: number of MCMC samples to draw.
More samples lead to more accurate inferences. Defaults to 1000.
nburn: number of burn in samples.
This specifies how many of the initial samples will be discarded. defaults to 10% of ndraws.
standardize_data: whether to standardize all columns of the data before fitting the model.
This is equivalent to an empirical Bayes approach to setting the priors.
It ensures that results are invariant to linear transformations of the data.
prior_level_sd: prior standard deviation of the Gaussian random walk of the local level.
Expressed in terms of data standard deviations. Defaults to 0.01.
A typical choice for well-behaved and stable datasets with low residual volatility after regressing out known predictors.
When in doubt, a safer option is to use 0.1, as validated on synthetic data,
although this may sometimes give rise to unrealistically wide prediction intervals.
dynamic_regression: whether to include time-varying regression coefficients.
In combination with a time-varying local trend or even a time-varying local level,
this often leads to overspecification, in which case a static regression is safer. Defaults to FALSE.
"""
data = self.df.copy().reset_index()
prior = [
data.index.min(),
int(data[data[self.date] < self.start_dt].index.max())
]
posterior = [
int(data[data[self.date] >= (self.end_dt if self.end_dt else self.start_dt)].index.min()),
data.index.max()
]
data.drop(columns=[self.date], inplace=True)
self.ci = CI(
data,
prior,
posterior,
model_args={
"nseasons": nseasons,
"season_duration": season_duration,
**kwargs
},
alpha=alpha,
)
self.ci.run()
def summary(self, format: str = "summary") -> None:
"""
Print the summary for Causal Impact Analysis model
Parameters
----------
format: can be 'summary' to return a table or 'report' to return a natural language description
"""
try:
self.ci.summary(format)
except AttributeError:
raise AttributeError("To get the summary run the model first, use .run() method")
self.result = self.ci.inferences.set_index(self.df.index)
for suffix in ["", "_lower", "_upper"]:
self.result["rel_effect" + suffix] = self.result["point_effect" + suffix] / self.result["point_pred"]
Code
class Prophet(ModelWrapper):
def run(
self,
growth: str = "linear",
weekly_seasonality: bool = True,
monthly_seasonality: bool = True,
yearly_seasonality: bool = True,
seasonality_mode: str = "additive",
country_holidays: Optional[str] = None,
outliers: Optional[List[Tuple[str]]] = None,
floor: Optional[int] = None,
cap: Optional[int] = None,
alpha: float = 0.05,
**kwargs,
) -> pd.DataFrame:
"""
Run time-series forecasting
Parameters
----------
growth: String 'linear', 'logistic' or 'flat' to specify a linear, logistic or flat trend.
weekly_seasonality: Fit weekly seasonality.
monthly_seasonality: Fit monthly seasonality.
yearly_seasonality: Fit yearly seasonality.
seasonality_mode: 'additive' (default) or 'multiplicative'.
country_holidays: country code (e.g. 'RU') of the country whose holiday are to be considered
outliers: list of time intervals (start date, end date) with the format YYYY-MM-DD where there are outliers
floor: minimum allowed value for the target variable. It's particulary useful with "logistic" growth type
cap: maximum allowed value for the target variable. It's particulary useful with "logistic" growth type
alpha: 1 - width of the uncertainty intervals provided for the forecast.
Other Parameters
----------------
**kwargs : model_args variables, to reveal the whole list follow the Prophet documentation, for example:
n_changepoints: Number of potential changepoints to include. Not used
if input `changepoints` is supplied. If `changepoints` is not supplied,
then n_changepoints potential changepoints are selected uniformly from
the first `changepoint_range` proportion of the history.
changepoint_range: Proportion of history in which trend changepoints will
be estimated. Defaults to 0.8 for the first 80%. Not used if
`changepoints` is specified.
changepoint_prior_scale: Parameter modulating the flexibility of the
automatic changepoint selection. Large values will allow many
changepoints, small values will allow few changepoints.
"""
data = self.df.copy().reset_index().rename(columns={self.date: "ds", self.y: "y"}).sort_values("ds")
if cap:
data["cap"] = cap
if floor:
data["floor"] = floor
train, test = data[data["ds"] < self.start_dt], data[data["ds"] >= self.start_dt]
self.model = FP(
growth=growth,
weekly_seasonality=weekly_seasonality,
yearly_seasonality=yearly_seasonality,
seasonality_mode=seasonality_mode,
interval_width=1-alpha,
**kwargs,
)
if monthly_seasonality:
self.model.add_seasonality(name="monthly", period=30.5, fourier_order=5)
if country_holidays:
self.model.add_country_holidays(country_name=country_holidays)
if outliers:
for pair in outliers:
train.loc[
(train["ds"] > datetime.datetime.strptime(pair[0], "%Y-%m-%d").date()) &
(train["ds"] < datetime.datetime.strptime(pair[1], "%Y-%m-%d").date()),
"y"] = None
for feature in self.x:
self.model.add_regressor(feature)
self.model.fit(train)
future = self.model.make_future_dataframe(periods=test.shape[0])
self.result = self.model.predict(future.set_index("ds").join(data.set_index("ds")).reset_index())
self.result["ds"] = self.result["ds"].dt.date
self.result = self.result.set_index("ds").join(data[["ds", "y"]].set_index("ds")).rename(columns={"y": "response"})
def summary(
self,
width: int = 900,
height: int = 600,
save: bool = False,
) -> pd.DataFrame:
"""
Plot the regressors statistics: Coefficients, Impact and Impact Share
The estimated beta coefficient for each regressor roughly represents the increase
in prediction value for a unit increase in the regressor value.
Note that the coefficients returned are always on the scale of the original data
In addition the credible interval for each coefficient is also returned,
which can help identify whether each regressor is “statistically significant”.
On the basis of `seasonality_mode` the model looks like:
Additive: y(t) ~ trend(t) + seasonality(t) + beta * regressor(t)
Multiplicative: y(t) ~ trend(t) * ( 1 + seasonality(t) + beta * regressor(t) )
Therefore, the incremental impact are:
Additive: increasing the value of the regressor by a unit leads to an increase in y(t) by beta units
Multiplicative: increasing the value of the regressor by a unit leads to increase in y(t) by beta * trend(t) units
The Impact is the product of incremental impact(t) * regressor(t) and finally, Share is the percentage of absolute Impact
Parameters
----------
width: the width of the chart
height: the height of the chart
save: whether you want to save the chart as HTML
"""
try:
data = regressor_coefficients(self.model)
except AttributeError:
raise AttributeError("To get the summary run the model first, use .run() method")
last_day_data, last_day_result = self.df.iloc[-1, :], self.result.iloc[-1, :]
data["incremental_impact"] = data["coef"] * data["regressor_mode"].apply(lambda x: last_day_result["trend"] if x == "multiplicative" else 1)
data["impact"] = data.apply(lambda x: x["incremental_impact"] * last_day_data[x["regressor"]], axis=1)
data["share"] = round(100 * np.abs(data["impact"]) / np.sum(np.abs(data["impact"])), 2)
def plot_bar(data, y, title):
figure = express.bar(
data,
x="regressor",
y=y,
color="regressor",
color_discrete_sequence=express.colors.sequential.Jet,
)
figure.add_hline(y=0, line_width=1, line_color="white")
figure.update_xaxes(title_text=None)
figure.update_layout(
title={
"x": 0.5,
"text": title,
},
width=width,
height=height,
hovermode="x",
template="plotly_dark",
showlegend=False,
)
if save:
self._save(figure, title)
figure.show()
for y, title in zip(["coef", "impact"], ["Coefficients", "Impact"]):
plot_bar(data.sort_values(by="coef", ascending=False), y, title)
pie = express.pie(
data.sort_values(by="coef", ascending=False),
color_discrete_sequence=express.colors.sequential.Jet,
values="share",
names="regressor",
color="regressor",
)
pie.update_layout(
title={
"x": 0.5,
"text": "Impact Share",
},
width=width,
height=height,
hovermode="x",
template="plotly_dark",
showlegend=False,
)
pie.update_traces(
marker=dict(
line=dict(color="black", width=3)
)
)
if save:
self._save(pie, "Impact Share")
pie.show()
Model
Sure, you could spend more time trying different models like moving averages and seasonal arima to evaluate and choose better one eventually, but here it’s a short way to get pretty good base model, which is fast and robust enough.
Code
model = Prophet(pivot.reset_index(), y="Lisbon", date="date", test_days=7)
model.explore(scale=True, title="Correlation: Lisbon vs Accra", save=True)
The default implemented strategy saves plotly
figures as html
code to be viewable aside of the iPython environment.
However if one is interested in more uniform way of storage, here is the function to save html
as json
without building Plotly figure from scratch.
Code
import plotly
import json
import re
def plotly_html_to_json(name: str) -> None:
with open(f"{name}.html", "r", encoding="utf-8") as file:
html = file.read()
call_arg_str = re.findall(r"Plotly\.newPlot\((.*)\)", html[-2**16:])[0]
call_args = json.loads(f'[{call_arg_str}]')
plotly_json = {'data': call_args[1], 'layout': call_args[2]}
figure = plotly.io.from_json(json.dumps(plotly_json))
figure.write_json(f"{name}.json")
Key findings:
- There is a steadily growing trend for both cities
- There is different seasonality for cities:
- additive weekly seasonality for Lisbon (Thursday, Friday - max, Sunday - min)
- for Accra it’s no so explicit, and the pattern is different (the maximum is often reached on weekends, and Friday is min)
- There are different change points for cities:
- For Lisbon since the beginning of February the increasing trend was replaced by flatten one
- For Accra the trend is more subtle, but given the difference in absolute number the confidence is such a trend much lower
- There are outliers:
- For Lisbon Feb 26 is enormously higher, the reason could be the national holiday - carnival on Feb 25
- In addition Feb 14 - not a public holiday but a nice time to have a romantic dinner at home by candlelight
- For Accra Feb 7 is enormously lower, could be the object for detailed analysis (a bit covered below)
- Moreover for Accra Feb 26 is the maximum for Ghana as well, may be some promo took place for both cities
- And the last point of data should be the minimum for Accra, but anyway looks too low, might be the dataset doesn’t include all the orders until the end of the day
Outliers Investigation
To interpret outliers one of the possible options is to explore Delivery Success rate
Code
db.query("""
WITH temp AS (
SELECT "Created Date"
,"City"
,"Order State"
,COUNT("Order ID") as "Orders Count"
FROM
orders o
WHERE "City" in ('Accra', 'Lisbon')
AND "Created Date" IN (
'06.02.2020',
'07.02.2020',
'08.02.2020',
'26.02.2020',
'27.02.2020',
'28.02.2020'
)
GROUP BY ALL
),
totals AS (
SELECT "Created Date"
,"City"
,SUM("Orders Count") AS "Orders Total"
FROM temp
GROUP BY 1, 2
)
SELECT temp.*, 100 * ROUND("Orders Count"/"Orders Total", 4) AS "Share"
FROM temp
JOIN totals
USING ("Created Date", "City")
ORDER BY ALL
""").to_df()
Created Date | City | Order State | Orders Count | Share | |
---|---|---|---|---|---|
0 | 06.02.2020 | Accra | delivered | 229 | 91.60 |
1 | 06.02.2020 | Accra | failed | 11 | 4.40 |
2 | 06.02.2020 | Accra | rejected | 10 | 4.00 |
3 | 06.02.2020 | Lisbon | delivered | 1855 | 98.78 |
4 | 06.02.2020 | Lisbon | failed | 1 | 0.05 |
5 | 06.02.2020 | Lisbon | rejected | 22 | 1.17 |
6 | 07.02.2020 | Accra | delivered | 161 | 86.10 |
7 | 07.02.2020 | Accra | failed | 23 | 12.30 |
8 | 07.02.2020 | Accra | rejected | 3 | 1.60 |
9 | 07.02.2020 | Lisbon | delivered | 1725 | 99.19 |
10 | 07.02.2020 | Lisbon | failed | 4 | 0.23 |
11 | 07.02.2020 | Lisbon | rejected | 10 | 0.58 |
12 | 08.02.2020 | Accra | delivered | 264 | 95.65 |
13 | 08.02.2020 | Accra | failed | 4 | 1.45 |
14 | 08.02.2020 | Accra | rejected | 8 | 2.90 |
15 | 08.02.2020 | Lisbon | delivered | 1246 | 99.12 |
16 | 08.02.2020 | Lisbon | failed | 6 | 0.48 |
17 | 08.02.2020 | Lisbon | rejected | 5 | 0.40 |
18 | 26.02.2020 | Accra | delivered | 355 | 96.47 |
19 | 26.02.2020 | Accra | failed | 6 | 1.63 |
20 | 26.02.2020 | Accra | rejected | 7 | 1.90 |
21 | 26.02.2020 | Lisbon | delivered | 2307 | 98.84 |
22 | 26.02.2020 | Lisbon | failed | 8 | 0.34 |
23 | 26.02.2020 | Lisbon | rejected | 19 | 0.81 |
24 | 27.02.2020 | Accra | delivered | 300 | 92.31 |
25 | 27.02.2020 | Accra | failed | 13 | 4.00 |
26 | 27.02.2020 | Accra | rejected | 12 | 3.69 |
27 | 27.02.2020 | Lisbon | delivered | 2201 | 98.79 |
28 | 27.02.2020 | Lisbon | failed | 6 | 0.27 |
29 | 27.02.2020 | Lisbon | rejected | 21 | 0.94 |
30 | 28.02.2020 | Accra | delivered | 206 | 85.83 |
31 | 28.02.2020 | Accra | failed | 17 | 7.08 |
32 | 28.02.2020 | Accra | rejected | 17 | 7.08 |
33 | 28.02.2020 | Lisbon | delivered | 2030 | 98.88 |
34 | 28.02.2020 | Lisbon | failed | 7 | 0.34 |
35 | 28.02.2020 | Lisbon | rejected | 16 | 0.78 |
If you’re a fan of pandas
of course it may be easier to apply similar operations within familiar package, and it even may seem as more Pythonic way to do so, although believe me, once you start using duckdb
, it turns out that some operations are just more native to SQL like complex joins and window functions and you naturally perform them faster in SQL; work smarter, not harder - if you’re inventing a bicycle writing SQL alike transformations in Pandas, nobody beyond your internal perfectionist will reward it.
Pandas alternative
Code
temp = orders[
(
(orders["City"] == "Accra") |
(orders["City"] == "Lisbon")
) &
(
(orders["Created Date"] == '06.02.2020') |
(orders["Created Date"] == '07.02.2020') |
(orders["Created Date"] == '08.02.2020') |
(orders["Created Date"] == '26.02.2020') |
(orders["Created Date"] == '27.02.2020') |
(orders["Created Date"] == '28.02.2020')
)
].groupby(["Created Date", "City", "Order State"]).agg({"Order ID": "count"})
temp.join(
temp.groupby(["Created Date", "City"]).sum(), on=["Created Date", "City"], rsuffix=" Total"
).apply(
lambda x: 100 * round(x["Order ID"] / x["Order ID Total"], 4), axis=1
)
Insight: Quick analysis for Feb 7 and for Feb 28 shows that the percentage of delivered orders in Accra is lower than usual while for Lisbon there is no such a drop, that is one of the potential reason of outliers
Make a forecast of the number of orders expected in the following 4 weeks
The correlation between two cities is not that high (Pearson coef is less than 0.5) to leverage it, providing the other country as a predictor
Portugal (Lisbon)
Fit
Code
model = Prophet(pivot["Lisbon"].reset_index(), y="Lisbon", date="date", test_days=7)
Incorporate all the logic described within the previous point:
- linear trend
- weekly additive seasonality
- Carnival, Valentine Day as outliers
Last week as a validation time frame
Code
model.run(
growth='linear',
yearly_seasonality=False,
monthly_seasonality=False,
seasonality_mode='additive',
country_holidays='PT',
outliers=[("2020-02-13", "2020-02-15")]
)
model.show(
title="Lisbon",
y_label="Orders, #",
save=True,
)
Summary: model fits well, the last week expectedly exceeds the upper bounds due to the holiday
Predict
Code
libon = model.model.make_future_dataframe(periods=36).join(pivot["Lisbon"], on="ds")
model = Prophet(libon, y="Lisbon", date="ds", test_days=28)
model.run(
growth='linear',
yearly_seasonality=False,
monthly_seasonality=False,
seasonality_mode='additive',
country_holidays='PT',
outliers=[("2020-02-13", "2020-02-15")]
)
model.show(
title="Lisbon Prognosis",
y_label="Orders, #",
save=True,
)
Ghana (Accra)
The same algorithm for Accra
Fit
Code
model = Prophet(pivot["Accra"].reset_index(), y="Accra", date="date", test_days=7)
Incorporate all the logic described within the previous point:
- linear trend
- weekly additive seasonality
- Feb 7 & Feb 28 as outliers
Last week as a validation time frame
Code
model.run(
growth='linear',
yearly_seasonality=False,
monthly_seasonality=False,
seasonality_mode='additive',
outliers=[("2020-02-06", "2020-02-08"), ("2020-02-27", "2020-02-29")]
)
model.show(
title="Accra",
y_label="Orders, #",
save=True,
)
Summary: model fits well, but the last week exceeds the lower bounds of CI, as it was explored below one of the reason is the drop in the delivery success rate, anyway let’s take into account this week for the forecast for March.
Predict
Code
accra = model.model.make_future_dataframe(periods=36).join(pivot["Accra"], on="ds")
model = Prophet(accra, y="Accra", date="ds", test_days=28)
model.run(
growth='linear',
yearly_seasonality=False,
monthly_seasonality=False,
seasonality_mode='additive',
outliers=[("2020-02-06", "2020-02-08"), ("2020-02-27", "2020-02-29")]
)
model.show(
title="Accra Prognosis",
y_label="Orders, #",
save=True,
)
Several Notes:
- Lisbon:
- Valentine Day: Feb 14 was unusual day, my initial idea seems to be right based upon the cuisine which was specific to this date. The burgers which is the main cuisine were not so popular this day, but there were spikes for the meals for the group of people and “romantic” food, namely Japanese (Sushi) & Italian (Pizza) as well as Breakfast, Ice Cream, Desserts
- Average Check: During the top weekdays in addition to orders number the average order values grows, not only more people order, but each person order more
- Accra:
- Delivered Percentage: first of all the delivered percentage for Accra is less than for Lisbon in general (statistical significance is checked below) and the second point is more accurate look at the dynamic, indeed there were some problems on Feb 7 (driver: ios failed) & Feb 28 (driver: android rejected)
- Delivery Time: there is a positive trend in delivery time, one of the reasons can be the launch of the new suppliers, but it’s impossible to check precisely using the given time frame only
Conclusion
Instead of Summary: If you don’t have enough time, to go too much in detail (it doesn’t mean that one can solve only well formulated problems, sometimes it only means that a person indeed doesn’t have much time) use this high-level approach to explore you data and get immediate insights about your data structure; because after all arbitrarily amount of time can be spent on researching the data, but business never wants to wait.