上QQ阅读APP看书，第一时间看更新

Preparing the data for our models

Now that we have downloaded and validated the data, we can use it to create a dataset for our binary classification and regression model tasks. We want to be able to predict which customers will visit the shop in the next two weeks for the binary classification task, and how much they will spend in the next two weeks for the regression task. The Chapter4/prepare_data.R script transforms the raw transactional data into a format suitable for machine learning. You need to run the code to create the dataset for the models, but you do not have to understand exactly how it works. Feel free to skip ahead if you want to focus on the deep learning model building.

We need to transform our data into a suitable format for prediction tasks. This should be a single row for each instance we want to predict. The columns will include some fields that are features (X variables) and another field that is our predictor value (Y variable). We want to predict whether a customer returns or not and their spend, so our dataset will have a single row per customer with features and predictor variables.

The first step is to find the cut-off date that separates the variables used to predict (X) and the variable we will predict for (Y). The code looks at the data, finds the last transaction date; and then subtracts 13 days from that date. This is a cut-off date, we want to predict which customers will spend in our shops on or after the cut-off date; based on what happens before the cut-off date. The data before the cut-off date will be used to make our X, or feature variables, and sales data on or after the cut-off date will be used to make our Y, or predictor variables. The following is that part of the code:

library(readr)
library(reshape2)
library(dplyr)

source("import.R")

# step 1, merge files
import_data(data_directory,bExploreData=0)

# step 2, group and pivot data
fileName <- paste(data_directory,"all.csv",sep="")
fileOut <- paste(data_directory,"predict.csv",sep="")
df <- read_csv(fileName,col_types = cols(.default = col_character()))

# convert spend to numeric field
df$SPEND<-as.numeric(df$SPEND)

# group sales by date. we have not converted the SHOP_DATE to date
# but since it is in yyyymmdd format,
# then ordering alphabetically will preserve date order
sumSalesByDate<-df %>%
   group_by(SHOP_WEEK,SHOP_DATE) %>%
   summarise(sales = sum(SPEND)
   )

# we want to get the cut-off date to create our data model
# this is the last date and go back 13 days beforehand
# therefore our X data only looks at everything from start to max date - 13 days
# and our Y data only looks at everything from max date - 13 days to end (i.e. 14 days)
max(sumSalesByDate$SHOP_DATE)
[1] "20080706"
sumSalesByDate2 <- sumSalesByDate[order(sumSalesByDate$SHOP_DATE),]
datCutOff <- as.character(sumSalesByDate2[(nrow(sumSalesByDate2)-13),]$SHOP_DATE)
datCutOff
[1] "20080623"
rm(sumSalesByDate,sumSalesByDate2)

If this code does not run, the most probable reason is that the source data was not saved in the correct location. The dataset must be unzipped into a directory called dunnhumby/in under the code folder, that is, at the same level as the chapter folders.

The last date in our data is 20080706, which July 7th, 2008, and the cut-off date is June 23rd, 2008. Although we have data going back to 2006, we will only use sales data from 2008. Any data that is older than six months is unlikely to influence a future customer sale. The task is to predict whether a customer will return between June 23rd, 2008 - July 7th, 2008 based on their activity before June 23rd, 2008.

We now need to create features from our data; so that we can use the spend broken down by department code, we will use the PROD_CODE_40 field. We could just group the sales on this department code, but that would give equal weighting to spends in Jan 2008 as to spends in June 2008. We would like to incorporate some time factor in our predictor columns. Instead, we will create features on a combination of the department code and the week. This will allow our models to place more importance on recent activities. First, we group by customer code, week, and department code, and create the fieldName column. We then pivot this data to create our features (X) dataset. The cell values in this dataset are the sales for that customer (row) and that week-department code (column). Here is an example of how the data is transformed for two customers. Table 4.2 shows the sales spend by week and the PROD_CODE_40 field. Table 4.3 then uses a pivot to create a dataset that has a single row per customer and the aggregate fields are now columns with the spend as values:

Table 4.2: Summary of sales by customer code, department code, and week

Table 4.3: Data from Table 4.2 after transformation

Here is the code that does this transformation:

# we are going to limit our data here from year 2008 only
# group data and then pivot it
sumTemp <- df %>%
   filter((SHOP_DATE < datCutOff) & (SHOP_WEEK>="200801")) %>%
   group_by(CUST_CODE,SHOP_WEEK,PROD_CODE_40) %>%
   summarise(sales = sum(SPEND)
   )
sumTemp$fieldName <- paste(sumTemp$PROD_CODE_40,sumTemp$SHOP_WEEK,sep="_")
df_X <- dcast(sumTemp,CUST_CODE ~ fieldName, value.var="sales")
df_X[is.na(df_X)] <- 0

The predictor (Y) variable is a flag as to whether the customer visited the site on the weeks from 200818 to 200819. We perform a grouping on the data after the cut-off date and group the sales by customer, these form the basis of our Y values. We join the X and Y datasets, ensuring that we keep all rows on the X side by doing a left-join. Finally we create a 1/0 flag for our binary classification task. When we are done, we see that we have 3933 records in our dataset : 1560 customers who did not return and 2373 customers who did. We finish by saving our file for the model-building. The following code shows these steps:

# y data just needs a group to get sales after cut-off date
df_Y <- df %>%
   filter(SHOP_DATE >= datCutOff) %>%
   group_by(CUST_CODE) %>%
   summarise(sales = sum(SPEND)
   )
colnames(df_Y)[2] <- "Y_numeric"

# use left join on X and Y data, need to include all values from X
# even if there is no Y value
dfModelData <- merge(df_X,df_Y,by="CUST_CODE", all.x=TRUE)
# set binary flag
dfModelData$Y_categ <- 0
dfModelData[!is.na(dfModelData$Y_numeric),]$Y_categ <- 1
dfModelData[is.na(dfModelData$Y_numeric),]$Y_numeric <- 0
rm(df,df_X,df_Y,sumTemp)

nrow(dfModelData)
[1] 3933
table(dfModelData$Y_categ)
   0    1 
1560 2373 

# shuffle data
dfModelData <- dfModelData[sample(nrow(dfModelData)),]

write_csv(dfModelData,fileOut)

We use the sales data to create our predictor fields, but there were some customer attributes that we ignored for this task. These fields included Customers Price Sensitivity and Customers Lifestage. The main reason we did not use these fields was to avoid data-leakage. Data-leakage can occur when building prediction models; it occurs when some of your fields have values that will not be available or different when creating datasets in production. These fields could cause data-leakage because we do not know when they were set; it could have been when a customer signs up, or it could be a process that runs nightly. If these were created after our cut-off date, this would mean that these fields could unfairly predict our Y variables.

For example, Customers Price Sensitivity has values for Less Affluent, Mid Market, and Up Market, which probably are derived from what the customer purchases. Therefore, using these fields in a churn-prediction task would result in data-leakage if these fields were updated after the cut-off date used to create our dataset for our prediction models. A value of Up Market for Customers Price Sensitivity could be strongly linked to return spend, but this value is actually a summary of the value it is predicting. Data-leakage is one of the main causes of data models performing worse in production as the model was trained with data that is linked to the Y variable and can never exist in reality. You should always check for data-leakage for time-series tasks and ask yourself whether any field (especially lookup attributes) could have been modified after the date used to create the model data.