excelvb.netssisscript-taskexcel-web-query

SSIS: Programmatically create new Excel file based on web query results - how do I save the web query table names?


What am I trying to do?

I have a list of URLs that I want to scrape using Excel's Web Query functionality. I'm trying to completely automate the process, so I'm developing an SSIS package that calls a Script Task for each URL. The Script Task creates a new Excel workbook with a worksheet, activates the worksheet, adds a QueryTable connection, refreshes the QueryTable to get the data, using XlWebSelectionType.xlAllTables. It then saves the workbook and closes the workbook and the Excel application.

What technologies am I utilizing?

What's the problem?

While the script task does save all the data from the tables on the web page, it puts them all into the single worksheet, and does not save the table names. So while my data is correctly grouped in the worksheet, I have no way of knowing which "group" of data corresponds to which table.

What do I want to do about it?

Ideally I would want each QueryTable table to be saved into its own Worksheet, with the table name set as the Worksheet name. Barring that, I need a way to save the table name with the corresponding data. Adding it as a new column in the QueryTable would be the best in this scenario.

What do I have so far?

Here's the main part of the script:

Public Sub Main()
    Dim URL As String = Dts.Variables("User::URL").Value.ToString()
    Dim FileName As String = Dts.Variables("User::FileName").Value.ToString()
    Dim xlNone As XlWebFormatting = XlWebFormatting.xlWebFormattingNone
    Dim Format As XlFileFormat = XlFileFormat.xlCSVWindows
    Dim ScrapeStatus As Integer = 1

    Dim excel As New Microsoft.Office.Interop.Excel.ApplicationClass

    With excel
        .SheetsInNewWorkbook = 1
        .DisplayAlerts = False
    End With

    Dim wb As Microsoft.Office.Interop.Excel.Workbook = excel.Workbooks.Add()

    With wb
        .Activate()
        .Worksheets.Select(1)
    End With

    Try

        Dim rnStart As Range = wb.ActiveSheet.Range("A1:Z100")
        Dim qtQtrResults As QueryTable = wb.ActiveSheet.QueryTables.Add(Connection:="URL;" + URL, Destination:=rnStart)

        With qtQtrResults
            .BackgroundQuery = False
            .WebFormatting = xlNone
            .WebSelectionType = XlWebSelectionType.xlAllTables
            .Refresh()
        End With

        excel.CalculateUntilAsyncQueriesDone()
        wb.SaveAs(FileName)

        wb.Close()
        excel.Quit()
        System.Runtime.InteropServices.Marshal.ReleaseComObject(excel)
        GC.Collect()
        GC.WaitForPendingFinalizers()
        Dts.TaskResult = ScriptResults.Success

    Catch ex As Exception

        Dts.Variables("User::Error").Value = ex.Message.ToString()
        wb.Saved = True
        wb.Close()
        excel.Quit()
        System.Runtime.InteropServices.Marshal.ReleaseComObject(excel)
        GC.Collect()
        GC.WaitForPendingFinalizers()
        Dts.TaskResult = ScriptResults.Failure

    End Try

End Sub

What results am I getting?

For the URL http://athletics.chabotcollege.edu/information/directory/home#directory, if I use the Web Query functionality while inside Excel, I get the following to select from: enter image description here All the table names are displayed

However, when I pull all tables via the Script Task, I end up with a worksheet that looks similar to this: enter image description here

Other info

I should also note that while most of the web pages have a similar structure, not all are the same. So I can't assume every page will have the same table names, or structure the tables in the same way. My solution needs to be dynamic and flexible.


Solution

  • By changing .WebSelectionType = XlWebSelectionType.xlAllTables to .WebSelectionType = XlWebSelectionType.xlEntirePage I'm able to capture the "names" of the tables. They are actually aria-title values inside the parent <section> tag of each table. It's ugly, but it does return the strings I'm looking for.

    I ended up saving both the xlAllTables and xlEntirePage QueryTables as text files. Then I split the xlAllTables file into separate chunks for each table, and then search the xlEntirePage text file for the "string" that represents the table, and I copy the preceding line, which has the title. I then save the table text as a new file with the copied title as the filename. It's very hacky, but it did what I needed it to do.