I have a webpage which has various tables on it. These tables are Javascript components, not just pure HTML tables. I need to process the text of this webpage (somewhat similar to screen scraping) with a Delphi program (Delphi 10.3).
I do a Ctrl-A/Ctrl-C to select all the webpage and copy everything to the clipboard. If I paste this into a TMemo
component in my program, I am only getting text outside the table. If I paste into MS Word, I can see all the content, including the text inside the table.
I can paste this properly into TAdvRichEditor
(3rd party), but it takes forever, and I often run out of memory. This leads me to believe that I need to directly read the clipboard with an HTML clipboard format.
I set up a clipboard HTML format. When I inspect the clipboard contents, I get what looks like all Kanji characters.
How do I read the contents of the clipboard when the contents are HTML?
In a perfect world, I would like ONLY the text, not the HTML itself, but I can strip that out later. Here is what I am doing now...
On initialization.. (where CF_HTML
is a global variable)
CF_HTML := RegisterClipboardFormat('HTML Format');
then my routine is...
function TMain.ClipboardAsHTML: String;
var
Data: THandle;
Ptr: PChar;
begin
Result := '';
with Clipboard do
begin
Open;
try
Data := GetAsHandle(CF_HTML);
if Data <> 0 then
begin
Ptr := PChar(GlobalLock(Data));
if Ptr <> nil then
try
Result := Ptr;
finally
GlobalUnlock(Data);
end;
end;
finally
Close;
end;
end;
end;
** ADDITIONAL INFO - When I copy from the webpage... I can then inspect the contents of the Clipboard buffer using a free tool called InsideClipBoard. It shows that the clipboard contains 1 entry, with 5 formats: CT_TEXT
, CF_OEMTEXT
, CF_UNICODETEXT
, CF_LOCALE
, and 'HTML Format'
(with Format ID of 49409). Only 'HTML Format'
contains what I am looking for.... and that is what I am trying to access with the code that I have shown.
The HTML format is documented here. It is placed on the clipboard as UTF-8 encoded text, and you can extract it like this.
{$APPTYPE CONSOLE}
uses
System.SysUtils,
Winapi.Windows,
Vcl.Clipbrd;
procedure Main;
var
CF_HTML: Word;
Data: THandle;
Ptr: Pointer;
Error: DWORD;
Size: NativeUInt;
utf8: UTF8String;
Html: string;
begin
CF_HTML := RegisterClipboardFormat('HTML Format');
Clipboard.Open;
try
Data := Clipboard.GetAsHandle(CF_HTML);
if Data=0 then begin
Writeln('HTML data not found on clipboard');
Exit;
end;
Ptr := GlobalLock(Data);
if not Assigned(Ptr) then begin
Error := GetLastError;
Writeln('GlobalLock failed: ' + SysErrorMessage(Error));
Exit;
end;
try
Size := GlobalSize(Data);
if Size=0 then begin
Error := GetLastError;
Writeln('GlobalSize failed: ' + SysErrorMessage(Error));
Exit;
end;
SetString(utf8, PAnsiChar(Ptr), Size - 1);
Html := string(utf8);
Writeln(Html);
finally
GlobalUnlock(Data);
end;
finally
Clipboard.Close;
end;
end;
begin
try
Main;
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
Readln;
end.