I have a table that has a column that holds an html string that may or may not be well formed (tried to use xmltable route and it didn't work) which is why I am trying to use the following sql syntax. I am trying to create a query that uses REGEXP_SUBSTR, LEVEL and CONNECT BY LEVEL based on examples I found online but I am not able to get the results I would expect:
I created an example query below:
with qry as (
select 1 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div>' as html_string from dual
union
select 2 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div><a href="#" data-jump="114">Other Service</a></div>' as html_string from dual
)
SELECT ID,
REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, LEVEL, NULL, 1) as contents,
REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, LEVEL, NULL, 1) as data_jump
FROM qry
CONNECT BY LEVEL <= REGEXP_COUNT(html_string, '<a.*?>(.*?).</a>')
By running this query, I would expect 7 rows to come back:
ID | Contents | Data_Jump |
---|---|---|
1 | Client Servic | 111 |
1 | Customer Servic | 112 |
1 | Tecnical Servic | 113 |
2 | Other Servic | 114 |
2 | Tecnical Servic | 113 |
2 | Other Servic | 114 |
2 | Customer Servic | 112 |
Instead its coming back with 22 rows.
ID | Contents | Data_Jump |
---|---|---|
1 | Client Servic | 111 |
1 | Customer Servic | 112 |
1 | Tecnical Servic | 113 |
2 | Other Servic | 114 |
2 | Tecnical Servic | 113 |
2 | Other Servic | 114 |
2 | Customer Servic | 112 |
1 | Tecnical Servic | 113 |
2 | Other Servic | 114 |
2 | Tecnical Servic | 113 |
2 | Other Servic | 114 |
2 | Client Servic | 111 |
1 | Customer Servic | 112 |
1 | Tecnical Servic | 113 |
2 | Other Servic | 114 |
2 | Tecnical Servic | 113 |
2 | Other Servic | 114 |
2 | Customer Servic | 112 |
1 | Tecnical Servic | 113 |
2 | Other Servic | 114 |
2 | Tecnical Servic | 113 |
2 | Other Servic | 114 |
I don't know where the problem is that I am getting duplicates. On the actual production table there are a lot of rows so the query just spins and never returns. I suspect its because a lot of duplicates are being generated. Can anyone tell me how to fix the above query so that I can try to transport it onto my actual query?
Thanks,
You need to include the ID in the connect-by clause; but for this to wotk you also need to include a non-deterministic function call; e.g.:
AND PRIOR id = id
AND PRIOR dbms_random.value IS NOT NULL
So
with qry as (
select 1 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div>' as html_string from dual
union
select 2 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div><a href="#" data-jump="114">Other Service</a></div>' as html_string from dual
)
SELECT ID,
REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, LEVEL, NULL, 1) as contents,
REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, LEVEL, NULL, 1) as data_jump
FROM qry
CONNECT BY LEVEL <= REGEXP_COUNT(html_string, '<a.*?>(.*?).</a>')
AND PRIOR id = id
AND PRIOR dbms_random.value IS NOT NULL
ID | CONTENTS | DATA_JUMP |
---|---|---|
1 | Client Servic | 111 |
1 | Customer Servic | 112 |
1 | Tecnical Servic | 113 |
2 | Client Servic | 111 |
2 | Customer Servic | 112 |
2 | Tecnical Servic | 113 |
2 | Other Servic | 114 |
Or you could use recursive subquery factoring instead of a hierarchical query, which is maybe easier to understand:
WITH rcte (id, html_string, lvl, contents, data_jump) AS (
SELECT ID, html_string, 1,
REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, 1, NULL, 1),
REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, 1, NULL, 1)
FROM qry
UNION ALL
SELECT ID, html_string, lvl + 1,
REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, lvl + 1, NULL, 1),
REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, lvl + 1, NULL, 1)
FROM rcte
WHERE REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, lvl + 1, NULL, 1) IS NOT NULL
OR REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, lvl + 1, NULL, 1) IS NOT NULL
)
SELECT id, contents, data_jump
FROM rcte
ORDER BY id, lvl
Or even, preserving the last character of the 'contents' by removing the stray .
from the regex as @Gary_W pointed out:
with qry as (
select 1 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div>' as html_string from dual
union
select 2 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div><a href="#" data-jump="114">Other Service</a></div>' as html_string from dual
),
rcte (id, html_string, lvl, contents, data_jump) AS (
SELECT ID, html_string, 1,
REGEXP_SUBSTR(html_string, '<a.*?>(.*?)</a>', 1, 1, NULL, 1),
REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, 1, NULL, 1)
FROM qry
UNION ALL
SELECT ID, html_string, lvl + 1,
REGEXP_SUBSTR(html_string, '<a.*?>(.*?)</a>', 1, lvl + 1, NULL, 1),
REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, lvl + 1, NULL, 1)
FROM rcte
WHERE REGEXP_SUBSTR(html_string, '<a.*?>(.*?)</a>', 1, lvl + 1, NULL, 1) IS NOT NULL
OR REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, lvl + 1, NULL, 1) IS NOT NULL
)
SELECT id, contents, data_jump
FROM rcte
ORDER BY id, lvl
ID | CONTENTS | DATA_JUMP |
---|---|---|
1 | Client Service | 111 |
1 | Customer Service | 112 |
1 | Tecnical Service | 113 |
2 | Client Service | 111 |
2 | Customer Service | 112 |
2 | Tecnical Service | 113 |
2 | Other Service | 114 |