I have a project that I want to find the rules of correlation between goods in the shopping cart .to do this ,I use the ML service(Python) in Sql Server ,and I use the mlxtend
library to find the association rule.but the problem I have is that the fpgrowth
function apparently uses a lot of memory, to the point where it stops working and gives errors.as far as possible, do the data preprocessing with sql server to be more efficient.
Part of the code :
-- =============================================
-- Author: Me
-- Create date: 2021-01-01
-- Description: Association Rule
-- =============================================
--pip install mlxtend==0.16.0 --no-cache-dir
CREATE PROCEDURE [Shopping].[CalculateAssociationRule]
@CompanyID UNIQUEIDENTIFIER,
@confidence DECIMAL(7,6) =0.5 --DEFAULT
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- Declare Variable
DECLARE @input_query NVARCHAR(MAX) = N'SELECT Int_DocShoppingID [TID],Int_StuffID [SID] FROM ##DocDetails'
DECLARE @Pattern VARCHAR(150)=N'[0-9]+\.?[0-9,.]*'
-- Declare Python Code
DECLARE @pyScript NVARCHAR(max)='
# import lib
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori,fpgrowth
from mlxtend.frequent_patterns import association_rules
#Change Data Shape to TranEncoder
#dataset=[data.SID.tolist() for id, data in data.groupby("TID")]
l=data[''list''].tolist()
del data
dataset=[i.split('','') for i in l]
del l
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
del dataset
#Run Association Rule (FPG) Algorithm
Result =pd.DataFrame(columns=[''antecedents'',''consequents'',''antecedent support'',''consequent support'',''support'',''confidence'',''lift'' ,''leverage'',''conviction''])
frequent_itemsets = fpgrowth(df, min_support=0.02, use_colnames=True)
if frequent_itemsets.empty == False:
Result=association_rules(frequent_itemsets, metric="confidence", min_threshold='+CAST(@confidence AS VARCHAR(max))+')
del frequent_itemsets
Result[''antecedents'']= Result[''antecedents''].astype(str)
Result[''consequents'']= Result[''consequents''].astype(str)
#Result Output
OutputDataSet=Result
'
EXECUTE sys.sp_execute_external_script
@language = N'python37'
,@script = @pyScript
,@input_data_1 = @input_query
,@input_data_1_name = N'data'
--WITH result sets (([antecedents] VARCHAR(max),[consequents] VARCHAR(max),[antecedent support] VARCHAR(max),[consequent support] VARCHAR(max),[support] VARCHAR(max),[confidence] VARCHAR(max),[lift] VARCHAR(max),[leverage] VARCHAR(max),[conviction] VARCHAR(max)));
END
Python Code :
# import lib
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori,fpgrowth
from mlxtend.frequent_patterns import association_rules
#Change Data Shape to TranEncoder
#dataset=[data.SID.tolist() for id, data in data.groupby("TID")]
l=data['list'].tolist()
del data
dataset=[i.split(',') for i in l]
del l
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
del dataset
#Run Association Rule (FPG) Algorithm
Result =pd.DataFrame(columns=['antecedents','consequents','antecedent support','consequent support','support','confidence','lift' ,'leverage','conviction'])
frequent_itemsets = fpgrowth(df, min_support=0.02, use_colnames=True)
if frequent_itemsets.empty == False:
Result=association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
del frequent_itemsets
Result['antecedents']= Result['antecedents'].astype(str)
Result['consequents']= Result['consequents'].astype(str)
#Result Output
OutputDataSet=Result
Error : MemoryError
Msg 39004, Level 16, State 20, Line 3
A 'python37' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004.
STDOUT message(s) from external script:
2021-12-08 10:49:22.38 Error: Python error: <class 'MemoryError'>: File "<string>", line 16, in <module>
File "C:\Program Files\Python37\lib\site-packages\mlxtend\frequent_patterns\fpgrowth.py", line 72, in fpgrowth
File "C:\Program Files\Python37\lib\site-packages\mlxtend\frequent_patterns\fpcommon.py", line 38, in generate_itemsets
Tools used in the project :
1. Sql Sever 2019-CU14
2.Python 3.7.9 (External Language in sql server)
3.Python Lib And Version
Package Version
--------------- -------
cycler 0.11.0
fonttools 4.28.3
joblib 1.1.0
kiwisolver 1.3.2
matplotlib 3.5.0
mlxtend 0.17.0
numpy 1.21.4
packaging 21.3
pandas 1.3.4
Pillow 8.4.0
pip 21.3.1
pyodbc 4.0.32
pyparsing 3.0.6
python-dateutil 2.8.2
pytz 2021.3
scikit-learn 1.0.1
scipy 1.7.3
setuptools 47.1.0
setuptools-scm 6.3.2
six 1.16.0
threadpoolctl 3.0.0
tomli 1.2.2
------------------------
4. 16GB Ram & Cpu core 12
Is there a solution to do this more efficiently and avoid errors?
To prevent low memory error Resource governor
can be enabled
ALTER EXTERNAL RESOURCE POOL [default] WITH (max_memory_percent=94, AFFINITY CPU = AUTO)
GO
ALTER RESOURCE GOVERNOR RECONFIGURE;
GO