sqloraclesubstrregexp-substrinstr

Substr instr help needed in sql oracle


i am trying to fetch some text from HTML string from a table. the text are like this and i am trying to get the output

ID,TEXT
1,<font face="Microsoft Sans Serif" size="8" color="#000000">ABC 123</font>
2,<font face="Microsoft Sans Serif" size="8" color="#000000">1 ETG</font>
3,<font face="Microsoft Sans Serif" size="8" color="#000000">Gatelys PERC</font>
4,<font face="Microsoft Sans Serif" size="8" color="#000000">Gml TK 144</font>

ID,TEXT
1,ABC 123
2,1 ETG
3,Gatelys PERC
4,Gml TK 144

I have tried this but not getting right output.

select ID,regexp_substr(TEXT, '[^>]+',1,2) from TABLE
where TEXT is not null;

Solution

  • If your HTML is always valid XHTML then you should use XML functions to parse the string:

    SELECT id,
           XMLQUERY('*/text()' PASSING XMLTYPE(text) RETURNING CONTENT) AS text
    FROM   table_name;
    

    Which, for the sample data:

    CREATE TABLE table_name (id, text) AS
      Select 1, '<font face="Microsoft Sans Serif" size="8" color="#000000">ABC 123</font>' FROM DUAL UNION ALL
      Select 2, '<font face="Microsoft Sans Serif" size="8" color="#000000">1 ETG</font>' FROM DUAL UNION ALL
      Select 3, '<font face="Microsoft Sans Serif" size="8" color="#000000">Gatelys PERC</font>' FROM DUAL UNION ALL
      Select 4, '<font face="Microsoft Sans Serif" size="8" color="#000000">Gml TK 144</font>' FROM DUAL UNION ALL
      Select 5, '<font attr=">">XYZ</font>' FROM DUAL;
    

    Outputs:

    ID TEXT
    1 ABC 123
    2 1 ETG
    3 Gatelys PERC
    4 Gml TK 144
    5 XYZ

    If you want to naively remove all tags using string functions then you can use:

    SELECT id,
           REGEXP_REPLACE(text, '<.*?>') AS text
    FROM   table_name;
    

    Which, for your sample data, outputs:

    ID TEXT
    1 ABC 123
    2 1 ETG
    3 Gatelys PERC
    4 Gml TK 144
    5 ">XYZ

    Note, it will fail if you have attributes inside a tag that contain a > character.

    If you want a more comprehensive query then:

    SELECT id,
           REGEXP_REPLACE(text, q'{<([^"']+?|".*?"|'.*?')+?>}') AS text
    FROM   table_name;
    

    Which outputs the same as the XML query.

    fiddle