For the final project of my data class, I'm attempting to create a market basket analysis in MySQL (using Workbench 6.3 CE and Server 5.7) from a database of 900,000-plus transactions.
Here is the sketch of a table I've put together to get me what I need (some of the stuff is for later queries)...
Table:
sales
Columns:
pos_trans_id INT(11)
basketId VARCHAR(45)
productNumber VARCHAR(25)
productDescription VARCHAR(255)
categoryCode VARCHAR(25)
categoryDescription VARCHAR(255)
subcategoryCode VARCHAR(25)
subcategoryDescription VARCHAR(255)
quantity INT(11)
purchaseAmt DECIMAL(12,2)
dateOfSale INT(11)
This is the query I whipped together to try to make my own version of a market basket analysis:
SELECT purchaseone.productNumber, purchaseone.productDescription, purchasetwo.productNumber, purchasetwo.productDescription, purchaseone.basketId
FROM
(SELECT DISTINCT productNumber, productDescription, basketId
FROM sales) AS purchaseone
JOIN
(SELECT DISTINCT productNumber, productDescription, basketId
FROM sales) AS purchasetwo
ON
(
purchaseone.basketId = purchasetwo.basketId AND
purchaseone.productNumber != purchasetwo.productNumber AND
purchaseone.productNumber < purchasetwo.productNumber
);
Problem I'm running into is the query will run for hours and hours and hours and give me nothing in the end, or it'll take so long to visualize in Workbench that it's useless.
Does anybody have any idea if I'm doing something wrong here or if there's something I should to to speed things up? Thanks in advance.
*Edit: Here's the extra context you asked for. *
Full raw data file
Two custom tables I've created so far:
CREATE TABLE `sales`
(\n `pos_trans_id` int(11) NOT NULL DEFAULT \'0\',
\n `basketId` varchar(45) DEFAULT NULL,
\n `productNumber` varchar(25) DEFAULT NULL,
\n `productDescription` varchar(255) DEFAULT NULL,
\n `categoryCode` varchar(25) DEFAULT NULL,
\n `categoryDescription` varchar(255) DEFAULT \'0\',
\n `subcategoryCode` varchar(25) DEFAULT NULL,
\n `subcategoryDescription` varchar(255) DEFAULT \'0\',
\n `quantity` int(11) DEFAULT NULL,
\n `purchaseAmt` decimal(12,2) DEFAULT NULL,
\n `dateOfSale` int(11) DEFAULT NULL
\n)
ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE 'date'
(\n `dateKey` int(11) NOT NULL,
\n `dayOfTheMonth` int(2) DEFAULT NULL,
\n `dayOfTheWeek` int(1) DEFAULT NULL,
\n PRIMARY KEY (`dateKey`)
\n)
\n ENGINE=InnoDB DEFAULT CHARSET=latin1
From your query, it appears you are trying to produce a "also purchased" or "purchased with" type of analysis. The query below should run significantly faster. It eliminates the sub-queries in your query and simplifies the task quite a bit.
SELECT
a.productNumber,
a.productDescription,
b.productNumber,
b.productDescription,
a.basketId
FROM `purchaseone` a
LEFT JOIN `purchaseone` b
ON a.basketId = b.basketId AND
a.productNumber != b.productNumber AND
a.productNumber < b.productNumber
GROUP BY a.basketId, a.productNumber, b.productNumber