What am I trying to do?
I have a list of URLs that I want to scrape using Excel's Web Query functionality. I'm trying to completely automate the process, so I'm developing an SSIS package that calls a Script Task for each URL. The Script Task creates a new Excel workbook with a worksheet, activates the worksheet, adds a QueryTable connection, refreshes the QueryTable to get the data, using XlWebSelectionType.xlAllTables
. It then saves the workbook and closes the workbook and the Excel application.
What technologies am I utilizing?
What's the problem?
While the script task does save all the data from the tables on the web page, it puts them all into the single worksheet, and does not save the table names. So while my data is correctly grouped in the worksheet, I have no way of knowing which "group" of data corresponds to which table.
What do I want to do about it?
Ideally I would want each QueryTable table to be saved into its own Worksheet, with the table name set as the Worksheet name. Barring that, I need a way to save the table name with the corresponding data. Adding it as a new column in the QueryTable would be the best in this scenario.
What do I have so far?
Here's the main part of the script:
Public Sub Main()
Dim URL As String = Dts.Variables("User::URL").Value.ToString()
Dim FileName As String = Dts.Variables("User::FileName").Value.ToString()
Dim xlNone As XlWebFormatting = XlWebFormatting.xlWebFormattingNone
Dim Format As XlFileFormat = XlFileFormat.xlCSVWindows
Dim ScrapeStatus As Integer = 1
Dim excel As New Microsoft.Office.Interop.Excel.ApplicationClass
With excel
.SheetsInNewWorkbook = 1
.DisplayAlerts = False
End With
Dim wb As Microsoft.Office.Interop.Excel.Workbook = excel.Workbooks.Add()
With wb
.Activate()
.Worksheets.Select(1)
End With
Try
Dim rnStart As Range = wb.ActiveSheet.Range("A1:Z100")
Dim qtQtrResults As QueryTable = wb.ActiveSheet.QueryTables.Add(Connection:="URL;" + URL, Destination:=rnStart)
With qtQtrResults
.BackgroundQuery = False
.WebFormatting = xlNone
.WebSelectionType = XlWebSelectionType.xlAllTables
.Refresh()
End With
excel.CalculateUntilAsyncQueriesDone()
wb.SaveAs(FileName)
wb.Close()
excel.Quit()
System.Runtime.InteropServices.Marshal.ReleaseComObject(excel)
GC.Collect()
GC.WaitForPendingFinalizers()
Dts.TaskResult = ScriptResults.Success
Catch ex As Exception
Dts.Variables("User::Error").Value = ex.Message.ToString()
wb.Saved = True
wb.Close()
excel.Quit()
System.Runtime.InteropServices.Marshal.ReleaseComObject(excel)
GC.Collect()
GC.WaitForPendingFinalizers()
Dts.TaskResult = ScriptResults.Failure
End Try
End Sub
What results am I getting?
For the URL http://athletics.chabotcollege.edu/information/directory/home#directory
, if I use the Web Query functionality while inside Excel, I get the following to select from:
All the table names are displayed
However, when I pull all tables via the Script Task, I end up with a worksheet that looks similar to this:
Other info
I should also note that while most of the web pages have a similar structure, not all are the same. So I can't assume every page will have the same table names, or structure the tables in the same way. My solution needs to be dynamic and flexible.
By changing .WebSelectionType = XlWebSelectionType.xlAllTables
to .WebSelectionType = XlWebSelectionType.xlEntirePage
I'm able to capture the "names" of the tables. They are actually aria-title
values inside the parent <section>
tag of each table. It's ugly, but it does return the strings I'm looking for.
I ended up saving both the xlAllTables
and xlEntirePage
QueryTables as text files. Then I split the xlAllTables
file into separate chunks for each table, and then search the xlEntirePage
text file for the "string" that represents the table, and I copy the preceding line, which has the title. I then save the table text as a new file with the copied title as the filename. It's very hacky, but it did what I needed it to do.