i am trying to fetch some text from HTML string from a table. the text are like this and i am trying to get the output
ID,TEXT
1,<font face="Microsoft Sans Serif" size="8" color="#000000">ABC 123</font>
2,<font face="Microsoft Sans Serif" size="8" color="#000000">1 ETG</font>
3,<font face="Microsoft Sans Serif" size="8" color="#000000">Gatelys PERC</font>
4,<font face="Microsoft Sans Serif" size="8" color="#000000">Gml TK 144</font>
ID,TEXT
1,ABC 123
2,1 ETG
3,Gatelys PERC
4,Gml TK 144
I have tried this but not getting right output.
select ID,regexp_substr(TEXT, '[^>]+',1,2) from TABLE
where TEXT is not null;
If your HTML is always valid XHTML then you should use XML functions to parse the string:
SELECT id,
XMLQUERY('*/text()' PASSING XMLTYPE(text) RETURNING CONTENT) AS text
FROM table_name;
Which, for the sample data:
CREATE TABLE table_name (id, text) AS
Select 1, '<font face="Microsoft Sans Serif" size="8" color="#000000">ABC 123</font>' FROM DUAL UNION ALL
Select 2, '<font face="Microsoft Sans Serif" size="8" color="#000000">1 ETG</font>' FROM DUAL UNION ALL
Select 3, '<font face="Microsoft Sans Serif" size="8" color="#000000">Gatelys PERC</font>' FROM DUAL UNION ALL
Select 4, '<font face="Microsoft Sans Serif" size="8" color="#000000">Gml TK 144</font>' FROM DUAL UNION ALL
Select 5, '<font attr=">">XYZ</font>' FROM DUAL;
Outputs:
ID | TEXT |
---|---|
1 | ABC 123 |
2 | 1 ETG |
3 | Gatelys PERC |
4 | Gml TK 144 |
5 | XYZ |
If you want to naively remove all tags using string functions then you can use:
SELECT id,
REGEXP_REPLACE(text, '<.*?>') AS text
FROM table_name;
Which, for your sample data, outputs:
ID | TEXT |
---|---|
1 | ABC 123 |
2 | 1 ETG |
3 | Gatelys PERC |
4 | Gml TK 144 |
5 | ">XYZ |
Note, it will fail if you have attributes inside a tag that contain a >
character.
If you want a more comprehensive query then:
SELECT id,
REGEXP_REPLACE(text, q'{<([^"']+?|".*?"|'.*?')+?>}') AS text
FROM table_name;
Which outputs the same as the XML query.