mysqlmysql-workbenchmarket-basket-analysis

Creating a MySQL market basket analysis


For the final project of my data class, I'm attempting to create a market basket analysis in MySQL (using Workbench 6.3 CE and Server 5.7) from a database of 900,000-plus transactions.

Here is the sketch of a table I've put together to get me what I need (some of the stuff is for later queries)...

Table:

sales

Columns:

pos_trans_id INT(11) 
basketId VARCHAR(45) 
productNumber VARCHAR(25) 
productDescription VARCHAR(255) 
categoryCode VARCHAR(25) 
categoryDescription VARCHAR(255) 
subcategoryCode VARCHAR(25) 
subcategoryDescription VARCHAR(255) 
quantity INT(11) 
purchaseAmt DECIMAL(12,2) 
dateOfSale INT(11)

This is the query I whipped together to try to make my own version of a market basket analysis:

SELECT purchaseone.productNumber, purchaseone.productDescription, purchasetwo.productNumber, purchasetwo.productDescription, purchaseone.basketId
FROM 
    (SELECT DISTINCT productNumber, productDescription, basketId
    FROM sales) AS purchaseone
        JOIN
        (SELECT DISTINCT productNumber, productDescription, basketId
        FROM sales) AS purchasetwo
        ON 
        (
            purchaseone.basketId = purchasetwo.basketId AND
            purchaseone.productNumber != purchasetwo.productNumber AND
            purchaseone.productNumber < purchasetwo.productNumber
            );

Problem I'm running into is the query will run for hours and hours and hours and give me nothing in the end, or it'll take so long to visualize in Workbench that it's useless.

Does anybody have any idea if I'm doing something wrong here or if there's something I should to to speed things up? Thanks in advance.

*Edit: Here's the extra context you asked for. *

Full raw data file

Two custom tables I've created so far:

CREATE TABLE `sales` (\n `pos_trans_id` int(11) NOT NULL DEFAULT \'0\', \n `basketId` varchar(45) DEFAULT NULL, \n `productNumber` varchar(25) DEFAULT NULL, \n `productDescription` varchar(255) DEFAULT NULL, \n `categoryCode` varchar(25) DEFAULT NULL, \n `categoryDescription` varchar(255) DEFAULT \'0\', \n `subcategoryCode` varchar(25) DEFAULT NULL, \n `subcategoryDescription` varchar(255) DEFAULT \'0\', \n `quantity` int(11) DEFAULT NULL, \n `purchaseAmt` decimal(12,2) DEFAULT NULL, \n `dateOfSale` int(11) DEFAULT NULL \n) ENGINE=InnoDB DEFAULT CHARSET=latin1

CREATE TABLE 'date' (\n `dateKey` int(11) NOT NULL, \n `dayOfTheMonth` int(2) DEFAULT NULL, \n `dayOfTheWeek` int(1) DEFAULT NULL, \n PRIMARY KEY (`dateKey`) \n) \n ENGINE=InnoDB DEFAULT CHARSET=latin1


Solution

  • From your query, it appears you are trying to produce a "also purchased" or "purchased with" type of analysis. The query below should run significantly faster. It eliminates the sub-queries in your query and simplifies the task quite a bit.

    SELECT 
        a.productNumber, 
        a.productDescription, 
        b.productNumber, 
        b.productDescription, 
        a.basketId
    FROM `purchaseone` a
    LEFT JOIN `purchaseone` b
    ON a.basketId = b.basketId AND
        a.productNumber != b.productNumber AND
        a.productNumber < b.productNumber
    GROUP BY a.basketId, a.productNumber, b.productNumber