sqlregexgoogle-bigqueryre2

re2 bigquery regex remove last period from each word in a string


I want the period removed from each word in a senetence but only if the word is greater then a length of 1. Remove eriod from Spain. but not P. I have tried the following. It words on the first word Spain. but has the undesired effect of replacing U.S.A. with US.A and not U.S.A There could be any combination of countries or country codes ending in a period.

WITH
  t1 AS (
  SELECT
    'Spain. S.S. Spain.' sentence,
    'Spain S.S Spain' expected
  UNION ALL
  SELECT
    'U.S.A. S.S. Spain.',
    'U.S.A S.S Spain'
  UNION ALL
  SELECT
    'P. SMITH S.S. Spain.',
    'P. SMITH S.S Spain' )
SELECT
  sentence,
  REGEXP_REPLACE(sentence,r'(\w+)(\.)(.+)','\\1\\3') AS actual,
  expected
FROM
  t1;

Regex Pattern

enter image description here


Solution

  • We can replace (\S+)\. with the first capture group:

    SELECT sentence, REGEXP_REPLACE(sentence, r'(\S{2,})\.', r'\1') AS expected
    FROM t1;