bashawk

Recombine data with multiple different delimiters


There is an array of data:

https://example.com:description of the site/application:category
http://example.com:description of the site/application:category
android://package name:description of the site/application:category
android://package name|description of the site/application|category

I want to split the data into 3 columns:

URL Description Category
https://example.com description of the site/application category
http://example.com description of the site/application category
android://package name description of the site/application category
android://package name description of the site/application category

As I understand it, it is necessary to add a regEx to ignore the first ":" and also 2 argument for the divisor "|"

I tried this expression, but the output is incorrect

cat * | awk -F["|"][:] '{print $1,$2, $3}'

Solution

  • Using any awk:

    $ awk -F'[:|]' -v OFS='\t' '{sub(/:/,RS); sub(RS,":",$1)} 1' file
    https://example.com     description of the site/application     category
    http://example.com      description of the site/application     category
    android://package name  description of the site/application     category
    android://package name  description of the site/application     category
    

    or, if the OFS character can't be present in the URL in the input:

    $ awk -F'[:|]' -v OFS='\t' '{$1=$1; sub(OFS,":")} 1' file
    https://example.com     description of the site/application     category
    http://example.com      description of the site/application     category
    android://package name  description of the site/application     category
    android://package name  description of the site/application     category
    

    Set OFS to something other than \t as you see fit.

    Please read the POSIX spec to learn what bracket expression such as the ones you used, ["|"][:], and the one I used, [:|], mean.

    Having said that, I suspect the OPs real input probably looks something like this (where additional :s or |s can appear in the URL and/or description, but no literal blanks can be in the URL):

    $ cat file
    https://example.com:description of : the site/application:category
    http://example.com:description: of the site/application:category
    android://package%20name:description of the site/application:category
    android://package%20name|description of the site/application|category
    android://package_name:17:something:description of the :huge: site/application:category
    

    and then you can get the output you want using the following sed script (using a sed that has -E to enable EREs, e.g. GNU and BSD seds):

    $ sed -E 's/([^ ]+)[:|]([^ ].*)[:|]/\1\t\2\t/' file
    https://example.com     description of : the site/application   category
    http://example.com      description: of the site/application    category
    android://package%20name        description of the site/application     category
    android://package%20name        description of the site/application     category
    android://package_name:17:something     description of the :huge: site/application      category
    

    or using any sed:

    $ sed 's/\([^ ]*\)[:|]\([^ ].*\)[:|]/\1\t\2\t/' file
    https://example.com     description of : the site/application   category
    http://example.com      description: of the site/application    category
    android://package%20name        description of the site/application     category
    android://package%20name        description of the site/application     category
    android://package_name:17:something     description of the :huge: site/application      category
    

    Those sed commands assume a description contains at least 1 blank and doesn't start with a : or word:word - if that's not the case then there is no way to separate a description from a URL given what we know so far about the input.