上QQ阅读APP看书，第一时间看更新

Data download and exploration

When you go to the preceding link, there are a few different data options; the one we will use is called Let’s Get Sort-of-Real. This dataset is data for over two years for a fictional retail loyalty scheme. The data consists of purchases that are linked by basket ID and customer code, that is, we can track transactions by customers over time. There are a number of options here, including the full dataset, which is 4.3 GB zipped and over 40 GB unzipped. For our first models, we will use the smallest dataset, and will download the data titled All transactions for a randomly selected sample of 5,000 customers; this is 1/100th the size of the full database.

I wish to thank dunnhumby for releasing this dataset and for allowing us permission to use it. One of the problems in deep learning and machine learning in general is the lack of large scale real-life datasets that people can practice their skills on. When a company makes the effort to release such a dataset, we should appreciate the effort and not use the dataset outside the terms and conditions specified. Please take the time to read the terms and conditions and use the dataset for personal learning purposes only. Remember that any misuse of this dataset (or datasets released by other companies) means that companies will be more reluctant to make other datasets available in the future.

Once you have read the terms and conditions and downloaded the dataset to your computer, unzip it into a directory called dunnhumby/in under the code folder. Ensure the files are unzipped directly under this folder, and not a sub-directory, or you may have to copy them after unzipping the data. The data files are in comma-delimited (CSV) format, with a separate file for each week of data. The files can be opened and viewed using a text editor. We will use some of the fields in Table 4.1 for our analysis:

Table 4.1: Partial data dictionary for transactional dataset

The data stores details of customer transactions. Every unique item that a person purchases in a shopping transaction is represented by one line, and items in a transaction will have the same BASKET_ID field. A transaction can also be linked to a customer using the CUST_CODE field. A PDF is included in the ZIP files if you want more information on the field types.

We are going to use this dataset for a churn prediction task. A churn prediction task is where we predict which customers will return in the next x days. Churn prediction is used to find customers who are in danger of leaving your program. It is used by companies in shopping loyalty schemes, mobile phone subscriptions, TV subscriptions, and so on to ensure they maintain enough customers. For most companies that rely on revenue from recurring subscriptions, it is more effective to spend resources on maintaining their existing customer base than trying to acquire new customers. This is because of the high cost of acquiring new customers. Also, as time progresses after a customer has left, it is harder to win them back, so there is a small window of time in which to send them special offers that may entice them to stay.

As well as binary classification, we will build a regression model. This will predict the amount that a person will spend in the next 14 days. Fortunately, we can build a dataset that is suitable for both prediction tasks.

The data was supplied as 117 CSV files (ignore time.csv, which is a lookup file). The first step is to perform some basic data exploration to verify that the data was downloaded successfully and then perform some basic data quality checks. This is an important first step in any analysis: especially when you are using an external dataset, you should run some validation checks on the data before creating any machine learning models. The Chapter4/0_Explore.Rmd script creates a summary file and does some exploratory analysis of the data. This is an RMD file, so it needs to be run from RStudio. For brevity, and because this book is about deep learning and not data processing, I will include just some of the output and plots from this script rather than reproducing all the code. You should also run the code in this file to ensure the data was imported correctly, although it may take a few minutes the first time it runs. Here are some summaries on the data from that script:

Number of weeks we have data: 117.
Number of transaction lines: 2541019.
Number of transactions (baskets): 390320.
Number of unique Customers: 5000.
Number of unique Products: 4997.
Number of unique Stores: 761.

If we compare this to the website and the PDF, it looks in order. We have over 2.5 million records, and data for 5,000 customers across 761 stores. The data-exploration script also creates some plots to give us a feel for the data. Figure 4.3 shows the sales over the 117 weeks; we see the variety in the data (it is not a flat line indicating that each day is different) and there are no gaps indicating missing data. There are seasonal patterns, with large peaks toward the end of the calendar year, namely the holiday season:

Figure 4.3: Sales plotted over time.

The plot in figure 4.3 shows that the data has been imported successfully. The data looks consistent and is what we expect for a retail transaction file, we do not see any gaps and there is seasonality.

For each item a person purchases, there is a product code (PROD_CODE) and four department codes (PROD_CODE_10, PROD_CODE_20, PROD_CODE_30, PROD_CODE_40). We will use these department codes in our analysis; the code in Chapter4/0_Explore.Rmd creates a summary for them. We want to see how many unique values there are for each department code, whether the codes represent a hierarchy (each code has at most one parent), and whether there are repeated codes:

PROD_CODE: Number of unique codes: 4997. Number of repeated codes: 0.
PROD_CODE_10: Number of unique codes:250. Number of repeated codes: 0.
PROD_CODE_20: Number of unique codes:90. Number of repeated codes: 0.
PROD_CODE_30: Number of unique codes:31. Number of repeated codes: 0.
PROD_CODE_40: Number of unique codes:9.

We have 4,997 unique product codes with 4 department codes. Our department codes go from PROD_CODE_10, which has 250 unique codes, to PROD_CODE_40, which has 9 unique codes. This is a product department code hierarchy, where PROD_CODE_40 is the primary category and PROD_CODE_10 is the lowest department code in the hierarchy. Each code in PROD_CODE_10, PROD_CODE_20, and PROD_CODE_30 has only one parent; for example, there are no repeating codes, that is, a department code belongs in only one super-category. We are not given a lookup file to say what these codes represent, but an example of a product code hierarchy for a product might be something similar to this:

PROD_CODE_40 : Chilled goods
  PROD_CODE_30 : Dairy
    PROD_CODE_20 : Fresh Milk
      PROD_CODE_10 : Full-fat Milk
        PROD_CODE : Brand x Full-fat Milk

To get a sense of these department codes, we can also plot the sales data over time by the number of unique product department codes in Figure 4.4. This plot is also created in Chapter4/0_Explore.Rmd:

Figure 4.4: Unique product codes purchased by date

Note that for this graph, the y axis is unique product codes, not sales. This data also looks consistent; there are some peaks and dips in the data, but they are not as pronounced as in Figure 4.3, which is as expected.