pythonsql-serverpandassql-server-2019sql-server-ml-services

Reading the results of my Python GET request in SQL Server Machine Learning Services


I am putting together a simple Python script, utilising the Python Machine Learning Services features in SQL Server. The goal is that my SQL server can ping an external API using a python script and read the results into variables.

I have successfully built the GET request, I am just struggling with getting the output of the API fed back into the database

what I have so far:

EXECUTE sp_execute_external_script @language = N'Python', 
@script = N'

import requests 
import pandas as pd
URL = "https://api.website.io/verify?email=dave@davidson.com&apikey=test_44fbcce85ba0f270273f4452bea2311de50f9c"
r = requests.get(url = URL)
data = r.text
print(data)
'
, @input_data_1 = N''
WITH RESULT SETS(([Col1] varchar(MAX) NOT NULL));

So print(data) gives me the following output in SSMS:

{"success":false,"message":"Invalid API key"}

But I don't want to print it, I want to read it into SQL. I don't really care how it comes out, so for now I've just specified a single column named "Col1" to hold the output, but I eventually need to load "False" and "Invalid API key" into two variables in my SQL Server Stored Procedure, so that I can use those values in the next step. If I have to parse those out of the JSON manually then I'm fine with that, but ideally they would come out as separate columns in the SQL output.

I've tried using r.json() in various permutations but I get a lot of different errors:

for SQL Server Machine Learning, I understand that I need to make a variable named "OutputDataSet" and that needs to be of the type "pandas dataframe", so I need to convert the JSON data held in "r" into a pandas dataframe. However everything I try on that front gives me cryptic error codes

so for example:

OutputDataSet = pd.DataFrame.from_dict(r.json(), orient="index")

gives: "Error in execution."

or

OutputDataSet = pd.DataFrame.from_dict(r.text, orient="index")

gives "AttributeError: 'str' object has no attribute 'values'"

Is it the syntax to convert JSON to a DataFrame that I'm not getting? Or is there an extra step or library needed to convert the output from the Requests lib into something that the Pandas lib will accept?


Solution

  • After much trial-and-error I have a solution that works:

    I have an InputData table in my database, my stored proc reads in all rows from InputData, runs them through the API using Python, then returns the API results into a temp table, which I can use as I would any table in SQL.

    In my "InputData" table I have the following columns:

    Reference - My company's proprietary reference number so I can tell which company record this API data relates to.

    APIURL - The pre-configured URL for my API request. This field contains all of the parameters for the API request for this Reference, and the API key for this API. Basically the entire HTTP API request is in here. I create this URL in an earlier SQL step, since it's awkward for me to do it Python and there's no real need to in this instance.

    JSONDATA - An empty column with datatype nvarchar(MAX), all records are NULL in this column. Sort of a cheat's way to do it. By passing this NULL value as a column in our input data, we don't have to specify it or name it in our output data. Since having an empty column in a SQL table makes little to no difference, this is the way I've chosen to simplify getting the output data out of Python and into SQL.

    -- Define query
    DECLARE @sqlscript NVARCHAR(MAX);
        SET @sqlscript = N'SELECT Reference, APIURL, JSONData FROM dbo.InputData';
    
    -- Define Python script
    DECLARE @pscript NVARCHAR(MAX);
        SET @pscript = N'
    import requests                       # Import the "requests" library that contains the requests.get command we will use to query the API (supports SSL, https etc)
    InputDF = InputDataSet                # Assign the input data (named InputDataSet by default in SQL Python) to a Python dataframe object named "InputDF"
    OutputDF = InputDF                    # Copy all the input data to a new Python dataframe object named "OutputDF" 
    for row in InputDF.itertuples():      # Start a loop which iterates over every row in InputDF, using "itertuples" Python function
        URL = row.KickBoxURL              # Read the KickBox API URL for this case/email address from the input dataframe
        r = requests.get(url = URL)       # Use the "get" command from the requests library to send our request to the API using that URL
        OutputDF.at[row.Index,4]= r.text  # Update column 4 (JSONData) of the output dataframe for the current iteruples row index, with the output JSON text from the .get command (stored in "r")
    OutputDataSet = OutputDF              # Return the OutputDF dataset as a SQL Server "OutputDataSet" (the default name in SQL Python) which returns the data as a query output, which we will INSERT into #Results
    ';
     
    -- Declare temp table to hold results
      DROP TABLE IF EXISTS #Results
    CREATE TABLE #Results (Reference varchar(10) NOT NULL PRIMARY KEY CLUSTERED,
                           APIURL varchar(1000) NOT NULL,
                           JSONOutput nvarchar(MAX) NULL)
     
    -- Run procedure using Python script and T-SQL query and Insert results to #Results
     INSERT #Results
       EXEC sp_execute_external_script
            @language = N'Python',
            @script = @pscript,
            @input_data_1 = @sqlscript;  
    

    After running the above, I now have a temp table named #Results which contains all the reference numbers and response JSON from the API for each individual request that I sent it.

    I can then perform some JSON queries on that temp table to extract specific fields from the JSON that contain the data I need. Since I'm no longer querying the API at that point I can now perform as many SELECT queries as I want, but I have only ever hit the external API once per record.