regexbashawktext-editortext-processing

How to retrieve text from the current line at specified cursor position before and after up to specified boundary characters?


Examples of boundary characters can be: "", '', (), space, ^$ (start and end of the line if any other boundary characters are not specified explicitly). Boundary characters should be easily configurable.

"", '', () boundary characters should have more priority over space boundary character. If no boundary characters are found in the line, ^$ (start and end of the line) should be considered as boundary characters.

Text inside boundary characters is URI (directory path, file path, http address, etc). But it is not required to check URIs on pattern correctness. Let assume that inside boundary characters can be any text.

Redundant space characters should be trimmed.

Below different examples with lines of text, cursor position, boundary characters and what should be extracted.

Symbol | in examples shows current cursor position. It will not be present in actual lines as symbol |. Only its position number is known. But in examples I will use | character for visual clarity. And it will be visually present only if it affects the result (otherwise it can be anywhere in the line).

Examples of lines with cursor positions and boundary characters, text that should be retrieved from them (after ->) and comment (after #):

(line with cursor -> extracted text  # comment)
~ -> ~  # cursor can be anywhere in the line, result should be the same, "~" extracted
 ~  -> ~  # the same but with trimmed spaces, "~" extracted

 ~/.local/share/applications  -> ~/.local/share/applications  # spaces trimmed, "~/.local/share/applications" extracted
 /home  -> /home
 
'~/te st' -> ~/te st  # result should be the same no matter where the cursor is located

|"~/te st1" '~/te st2' -> ~/te st1  # border case: cursor touches boundary character, "~/te st1" extracted
"~/te st1"| '~/te st2' -> ~/te st1  # border case, same result
"~/te st1" |'~/te st2' -> ~/te st2  # border case, "~/te st2" extracted
"~/te st1" '|~/te st2' -> ~/te st2
"~/te st1" '~/te st2|' -> ~/te st2
"~/te st1" '~/te st2'| -> ~/te st2  # border case, "~/te st2" extracted

https://stackoverflow.com/ -> https://stackoverflow.com/
  https://stackoverflow.com   -> https://stackoverflow.com
  'https://stackoverflow.com'   -> https://stackoverflow.com

~ /etc 1 /etc 2 '/home' 3 /etc 4 '/home' 5 "/media" 6 "~/te st" 7 ~ "$HO|ME" ~  -> $HOME  # many URIs in one line, "$HOME" extracted

Later in processing tilde sign ~ and environment variables like $HOME will be expanded. But this can be excluded from scope of the current question. Trimming of spaces can also be excluded for keeping this question more simple.

Below I will present my own script with current solution. But I do not like it because it contains code duplication for processing each boundary characters set and is not easily configurable because of this. Nevertheless it works as described above (except support of () boundary chars for now). And performance is good since only one string traversal is used. Probably will be better to rewrite it with using function calls.

But may be already is there some better solution since as I understand this must be very typical task for text editors and text processing?


Update

Use Cases

Thanks to Renaud Pacalet for clarifications, I'am adding description of use cases to answer the question: What should be returned for "a'b(c|c)b'a", "aaa"bbb" and similar?

This script is for Kate text editor to be able to open links like Notepad++ or Vim editors do but in much more flexible way (unfortunatelly Kate does not "see" links and cannot open it out of the box; but it have External Tools option which allows to handle text files with external scripts like bash).

Imagine there is a text file that have lines with URIs as described in examples for this question. User put cursor over some URI, press hotkey and this URI is extracted and opened by corresponding program (by web browser for http links and by file manager for directories). In Linux xdg-open is used for this.

So cases like "a'b(c|c)b'a", "aaa|"bbb" is just incorrect line in user text file and can be considered as out of scope for this use cases. Let anything will be returned in this case: for example, cc, aaa. User will instantly notice that URI written not in the right way and correct it.


extract-url-and-xdg-open.sh
#!/bin/bash

# extracts URI based on cursor position from strings like:
# 1 "/media" 2 '/home' 3 /etc 4 '/home' 5 "/media" 6

LOG_TO_FILE=0  # 0 is disabled, 1 is enabled
LOG_FILE_NAME="extract-url-and-xdg-open.log"
LOG_TO_CONSOLE=0  # 0 is disabled, 1 is enabled

get_tmp_dir() {
    virtual_tmp_dir="/mnt/tmpfs"
    if [ -d "$virtual_tmp_dir" ]; then
        tmp_dir="$virtual_tmp_dir"
    else
        tmp_dir="/tmp"
    fi
    echo $tmp_dir
}

log(){
    if [  "$LOG_TO_FILE" -eq 0 ] && [ "$LOG_TO_CONSOLE" -eq 0 ]; then
        return
    fi

    msg="$(date +%Y-%m-%dT%H:%M:%S): $1"

    if [ "$LOG_TO_FILE" -eq 1 ]; then
        log_file="$(get_tmp_dir)/$LOG_FILE_NAME"
        # printf '%s' "$msg" >> "$logfile"
        echo -e "$msg" >> "$log_file"
    fi
    if [ "$LOG_TO_CONSOLE" -eq 1 ]; then
        echo -e "$msg"
    fi
}

uri_with_env_vars_expanded=

BORDER_CHARS="\"'()"

#extracts uri from file, line number (from 0), cursor position (from 0)
extract_uri() {
    filepath="$1"
    linenumber="$2"
    cursorposition="$3"

    log "filepath: $filepath"
    log "linenumber: $linenumber"
    log "cursorposition: $cursorposition"

    linenumber=$((linenumber+1))

    line=$(sed "${linenumber}q;d" "$filepath")

    log "line: $line"

    extract_uri_from_string2 "$line" "$cursorposition"
}

# checking URI patter corectness
validate_uri() {
    local uri=$1
    local length=${#uri}
    local first_letter=${uri:0:1}
    local last_letter=${uri:$length-1:1}

    log "validating uri: $uri"
    log "first_letter: $first_letter"
    log "last_letter: $last_letter"

    if [[ $first_letter = " " || $last_letter = " " ]]; then
        log "uri is not valid"
        return 1
    else
        log "uri is valid"
        return 0
    fi
}

extract_uri_from_string() {
    start_time=$(date +%s%3N)

    line="$1"
    cursorposition="$2"

    length=${#line}

    log "line: $line"
    log "line length: $length"
    log "cursorposition: $cursorposition"

    ###########################################
    ### trim spaces, moving cursor accordingly ###

    prefix_spaces=${line%%[![:space:]]*}
    prefix_spaces_length=${#prefix_spaces}
    log "prefix_spaces_length: $prefix_spaces_length"

    suffix_spaces=${line##*[![:space:]]}
    suffix_spaces_length=${#suffix_spaces}
    log "suffix_spaces_length: $suffix_spaces_length"

    line="${line#"${line%%[![:space:]]*}"}"  # remove leading whitespace characters
    line="${line%"${line##*[![:space:]]}"}"  # remove trailing whitespace characters
    log "line after trimming spaces: $line"

    length=${#line}
    log "line length after trimming spaces: $length"

    if (( $prefix_spaces_length > 0 ));  then
        (( cursorposition = cursorposition - prefix_spaces_length ))
        log "cursor position moved to the left on prefix_spaces_length amount: $prefix_spaces_length, new value: $cursorposition"

        if (( cursorposition < 0 )); then
            log "cursor position after trimmed spaces is beyond line in the left side - setting it to 0"
            cursorposition=0
        fi
    fi

    if (( cursorposition > length )); then
        log "cursor position is beyond line length - shrinking it to the line length amount: $length"
        cursorposition=$length
    fi

    ###########################################
    ### moving cursor to support cases when cursor touches boundary chars ###

    if [ "$cursorposition" -eq 0 ]; then  # when cursor is at the beginning of line touches border chars: |'/home'

        first_char="${line:0:1}"
        log "first_char: $first_char"

        if [[ "$BORDER_CHARS" == *"$first_char"* ]]; then  # check that first char is one of the border chars
            (( cursorposition++ ))  # after cursor moved: '|/home'
            log "cursor position moved 1 char forward because it points on the beginning of line and next char is a border char"
        fi
    elif [ "$cursorposition" -eq $length ]; then  # when cursor is in the end of line: '/home'|

        last_char="${line:$length-1:1}"
        log "last_char: $last_char"

        if [[ "$BORDER_CHARS" == *"$last_char"* ]]; then
            (( cursorposition-- ))  #  after cursor moved: '/home|'
            log "cursor position moved 1 char bacwards because it points on the end of line and touches border char"
        fi
    else
        char_at_cursor="${line:$cursorposition:1}"
        log "char at current cursor position: $char_at_cursor"

        if [[ "$BORDER_CHARS" == *"$char_at_cursor"* ]] ; then  # start |'/home' end

            previous_char="${line:$cursorposition-1:1}"
            log "previous_char: $previous_char"

            if [ "$previous_char" = " " ]; then
                (( cursorposition++ ))  # after moving cursor: start '|/home' end
                log "cursor position moved 1 char forward because it points on border char and the previous char is space"
            fi
        elif [ "$char_at_cursor" = " " ]; then  # start '/home'| end

            previous_char="${line:$cursorposition-1:1}"
            log "previous_char: $previous_char"

            if [[ "$BORDER_CHARS" == *"$previous_char"* ]]; then
                (( cursorposition-- ))  # after moving cursor: start '/home|' end
                log "cursor position moved 1 char backword because it points on space and the previous char is border char"
            fi
        fi
    fi

    ###########################################
    ### search boundary chars in the left side from cursor ###

    left_singlequote_position=
    left_doublequote_position=
    left_space_position=

    #left direction: search in range [current-position-1 .. 0]
    for (( i=$cursorposition-1; i>=0; i-- )); do
        currchar=${line:$i:1}
        log "[left direction] current index: $i ; char: $currchar"
        if [[ $currchar = '"' && -z $left_doublequote_position ]]; then
            left_doublequote_position=$i
            log "found \"" 
        fi
        if [[ $currchar = "'" && -z $left_singlequote_position ]]; then
            left_singlequote_position=$i
            log "found '" 
        fi
        if [[ $currchar = " " && -z $left_space_position ]]; then
            left_space_position=$i
            log "found space" 
        fi
    done

    log "left_doublequote_position $left_doublequote_position"
    log "left_singlequote_position: $left_singlequote_position"
    log "left_space_position: $left_space_position"

    ###########################################
    ### search boundary chars in the right side from cursor ###

    right_singlequote_position=
    right_doublequote_position=
    right_space_position= 

    #right direction: search in range [current-position .. length-1]
    for (( i=$cursorposition; i<$length; i++ )); do
        currchar=${line:$i:1}
        
        log "[right direction] current index: $i ; char: $currchar"
        
        if [[ $currchar = '"' && -z $right_doublequote_position ]]; then
            right_doublequote_position=$i
            log "found \"" 
        fi
        if [[ $currchar = "'" && -z $right_singlequote_position ]]; then
            right_singlequote_position=$i
            log "found '" 
        fi
        if [[ $currchar = " " && -z $right_space_position ]]; then
            right_space_position=$i
            log "found space" 
        fi
    done

    log "right_doublequote_position $right_doublequote_position"
    log "right_singlequote_position: $right_singlequote_position"
    log "right_space_position: $right_space_position"

    ###########################################
    ### selecting one uri based on priority of boundary chars ###

    uri_between_doublequotes=
    if [[ -n $left_doublequote_position && -n $right_doublequote_position ]]; then
        ((length_between_doublequotes=right_doublequote_position-left_doublequote_position-1))
        log "length_between_doublequotes: $length_between_doublequotes"
        uri_between_doublequotes=${line:$left_doublequote_position+1:$length_between_doublequotes}
        log "uri_between_doublequotes: $uri_between_doublequotes"

        if ! validate_uri "$uri_between_doublequotes"; then
            uri_between_doublequotes=
            log "uri_between_doublequotes is not valid, clearing it, uri_between_doublequotes: $uri_between_doublequotes"
        fi
    fi

    uri_between_singlequotes=
    if [[ -n $left_singlequote_position && -n $right_singlequote_position ]]; then
        ((length_between_singlequotes=right_singlequote_position-left_singlequote_position-1))
        log "length_between_singlequotes: $length_between_singlequotes"
        uri_between_singlequotes=${line:$left_singlequote_position+1:$length_between_singlequotes}
        log "uri_between_singlequotes: $uri_between_singlequotes"

        if ! validate_uri "$uri_between_singlequotes"; then
            uri_between_singlequotes=
            log "uri_between_singlequotes is not valid, clearing it, uri_between_singlequotes: $uri_between_singlequotes"
        fi
    fi

    #if space char borders not found, ^$ (start and end of line) are considered as space borders
    
    if [[ -n $left_space_position ]]; then  # [ -n $left_space_position ] without quoting empty var is not working properly 
        left_space_or_before_start_position=$left_space_position
    else
        # if space border is not found, start of line (^), position before first char (-1), is used as border char
        left_space_or_before_start_position=-1
        log "space border is not found, instead start of line is used as a border, left_space_or_before_start_position: $left_space_or_before_start_position"
    fi
    
    if [[ -n $right_space_position ]]; then 
        right_space_or_length_position=$right_space_position
    else
        # if space border is not found, end of line ($), position after last char (line length) is used as border char
        right_space_or_length_position=$length
        log "space border is not found, instead end of line is used as a border, right_space_or_length_position: $right_space_or_length_position"
    fi

    ((length_between_spaces_or_length=right_space_or_length_position-left_space_or_before_start_position-1))
    log "length_between_spaces_or_length: $length_between_spaces_or_length"
    uri_between_spaces_or_line=${line:$left_space_or_before_start_position+1:$length_between_spaces_or_length}
    log "uri_between_spaces_or_line: $uri_between_spaces_or_line"

    ########

    uri=  # vars clean up is needed to make this code rerunnable for tests
    if [ -n "$uri_between_singlequotes" ]; then
        uri=$uri_between_singlequotes
        log "uri_between_singlequotes case"
    elif [ -n "$uri_between_doublequotes" ]; then
        uri=$uri_between_doublequotes
        log "uri_between_doublequotes case"
    elif [ -n "$uri_between_spaces_or_line" ]; then
        uri=$uri_between_spaces_or_line
        log "uri_between_spaces_or_line case"
    fi

    log "uri: $uri"

    ###########################################
    #### ~ and env vars expansion ###

    uri_with_tilde_expanded="${uri/#\~/$HOME}"
    log "uri_with_tilde_expanded: $uri_with_tilde_expanded"
    uri_with_env_vars_expanded=$(echo "$uri_with_tilde_expanded" | envsubst)
    log "uri_with_env_vars_expanded: $uri_with_env_vars_expanded"


    ###########################################

    end_time=$(date +%s%3N)
    duration_ms=$((end_time - start_time))

    log "Execution time in ms: $duration_ms"

    ###########################################
}

extract_uri_from_string2() {
    start_time=$(date +%s%3N)

    line="$1"
    cursorposition="$2"

    length=${#line}

    log "line: $line"
    log "line length: $length"
    log "cursorposition: $cursorposition"
    ((cursorposition++ ))  # for awk cursor position must start from 1, not from 0
    #awk -f foo.awk -v d="\"\"''()  " -v n=1 -v p=5 <<< "'~/te st'"
    uri=$(awk -f /usr/local/bin/foo.awk -v d="\"\"''()" -v n=1 -v p=$cursorposition <<< $line)
    log "uri: $uri"

    ###########################################
    #### ~ and env vars expansion ###

    uri_with_tilde_expanded="${uri/#\~/$HOME}"
    # uri_with_env_vars_expanded=$(echo "$uri_with_tilde_expanded" | envsubst)
    uri_with_env_vars_expanded=$(envsubst <<< "$uri_with_tilde_expanded")
    log "uri_with_env_vars_expanded: $uri_with_env_vars_expanded"
}

extract_and_open() {
    extract_uri "$@"

    if [ -n "$uri_with_env_vars_expanded" ]; then
        log "calling xdg-open"
        xdg-open "$uri_with_env_vars_expanded"
    fi
}

if [[ "$0" == *"test"* ]]; then  # do not run "$@" if called from test file
   return
fi

"$@"

Unit tests:

extract-url-and-xdg-open-test.sh
#!/bin/bash

source extract-url-and-xdg-open.sh  # import for tested script

logt(){  # log version for tests
    msg="$(date +%Y-%m-%dT%H:%M:%S): $1"

    echo -e "$msg"

    if [ "$LOG_TO_FILE" -eq 1 ]; then
        log_file="$(get_tmp_dir)/$LOG_FILE_NAME"
        # printf '%s' "$msg" >> "$logfile"
        echo -e "$msg" >> "$log_file"
    fi
}

#1 string
#2 substring or char to search index
strindex() {
  x="${1%%"$2"*}"
  [[ "$x" = "$1" ]] && echo -1 || echo "${#x}"
}

# cursor position for Kate editor
# It is position is before character, between chars or after, not under
# cursor marked with | symbol
# For "|'/home'" cursor position is 0
# For "'|/home'" cursor position is 1
# For "'/home'|" cursor position is 7
get_cursor_position() {
    string=$1
    position=$(strindex "$string" "|")
    echo "$position"
}

delete_cursor_char() {
    string=$1
    string=${string/|/}
    echo "$string"
}

<<'###BLOCK-COMMENT'
tests:
'/home'
  '/home'  
~
  ~  
  ~/.local/share/applications  
 ~ /etc 1 /etc 2 '/home' 3 /etc 4 '/home' 5 "/media" 6 "~/scripts" 7 ~ "$HOME" ~  
"/home/guest" '/home/guest'
'~/scripts'
###BLOCK-COMMENT
tests() {
    local test_start_time=$(date +%s%3N)
   
    local failed_tests=0
    local all_tests=0
    #local tested_func=${1:-"extract_uri_from_string"}

    logt "Running tests... for: $tested_func"
    
    test01_two_char_uri_with_spaces

    # cursor_posit<=$test_length; from 0 to lenhth inclusive because cursor in editor is in between chars, not over them

    #test
    test_line="'/t'"; expected_result="/t"
    test_length=${#test_line}
    for (( cursor_posit=0; cursor_posit<=$test_length; cursor_posit++ )); do
        $tested_func "$test_line" "$cursor_posit"
        if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
            logt "Test01ts: $test_line; cursor_posit: $cursor_posit; Expected result: $expected_result, got: $uri_with_env_vars_expanded"
            (( failed_tests++ ))
        fi
        (( all_tests++ ))
    done

    #test
    test_line="\"/t\""; expected_result="/t"
    test_length=${#test_line}
    for (( cursor_posit=0; cursor_posit<=$test_length; cursor_posit++ )); do
        $tested_func "$test_line" "$cursor_posit"
        if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
            logt "Test01td: $test_line; cursor_posit: $cursor_posit; Expected result: $expected_result, got: $uri_with_env_vars_expanded"
            (( failed_tests++ ))
        fi
        (( all_tests++ ))
    done
   
    #test
    test_line="'/home'"; expected_result="/home"
    test_length=${#test_line}
    for (( cursor_posit=0; cursor_posit<=$test_length; cursor_posit++ )); do
        $tested_func "$test_line" "$cursor_posit"
        if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
            logt "Test01h: $test_line; cursor_posit: $cursor_posit; Expected result: $expected_result, got: $uri_with_env_vars_expanded"
            (( failed_tests++ ))
        fi
        (( all_tests++ ))
    done

    #test
    test_line="\"$HOME\""; expected_result="/home/$USER"
    for (( cursor_posit=0; cursor_posit<=${#test_line}; cursor_posit++ )); do
        $tested_func "$test_line" "$cursor_posit"
        if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
            logt "Test02: $test_line; cursor_posit: $cursor_posit; Expected result: $expected_result, got: $uri_with_env_vars_expanded"
            (( failed_tests++ ))
        fi
        (( all_tests++ ))
    done

    #test
    test_line="  '/home'  "; expected_result="/home"
    test_length=${#test_line}
    for (( cursor_posit=0; cursor_posit<=$test_length; cursor_posit++ )); do
        $tested_func "$test_line" "$cursor_posit"
        if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
            logt "Test03: $test_line; cursor_posit: $cursor_posit; Expected result: $expected_result, got: $uri_with_env_vars_expanded"
            (( failed_tests++ ))
        fi
        (( all_tests++ ))
    done

    #test
    visual_test_line="|'/home'"; expected_result="/home" 
    cursorposition=$(get_cursor_position "$visual_test_line"); test_line=$(delete_cursor_char "$visual_test_line")
    $tested_func "$test_line" "$cursorposition"
    if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
        logt "Test04: $visual_test_line; Expected result: $expected_result, got: $uri_with_env_vars_expanded"
        (( failed_tests++ ))
    fi
    (( all_tests++ ))

    #test
    visual_test_line="/h|ome"; expected_result="/home" 
    cursorposition=$(get_cursor_position "$visual_test_line"); test_line=$(delete_cursor_char "$visual_test_line")
    $tested_func "$test_line" "$cursorposition"
    if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
        logt "Test05: $visual_test_line; Expected result: $expected_result, got: $uri_with_env_vars_expanded"
        (( failed_tests++ ))
    fi
    (( all_tests++ ))

    #test
    visual_test_line="'/home'|"; expected_result="/home" 
    cursorposition=$(get_cursor_position "$visual_test_line"); test_line=$(delete_cursor_char "$visual_test_line")
    $tested_func "$test_line" "$cursorposition"
    if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
        logt "Test06: $visual_test_line; Expected result: $expected_result, got: $uri_with_env_vars_expanded"
        (( failed_tests++ ))
    fi
    (( all_tests++ ))

    #test
    test_line="~"; expected_result="/home/$USER"
    test_length=${#test_line}
    for (( cursor_posit=0; cursor_posit<=$test_length; cursor_posit++ )); do
        $tested_func "$test_line" "$cursor_posit"
        if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
            logt "TestA06: $test_line; cursor_posit: $cursor_posit; Expected result: $expected_result; got: $uri_with_env_vars_expanded"
            (( failed_tests++ ))
        fi
        (( all_tests++ ))
    done

    #test
    test_line="  ~  "; expected_result="/home/$USER"
    test_length=${#test_line}
    for (( cursor_posit=0; cursor_posit<=$test_length; cursor_posit++ )); do
        $tested_func "$test_line" "$cursor_posit"
        if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
            logt "Test07: $test_line; cursor_posit: $cursor_posit; Expected result: $expected_result; got: $uri_with_env_vars_expanded"
            (( failed_tests++ ))
        fi
        (( all_tests++ ))
    done

    #test
    test_line="  ~/.local/share/applications  "; expected_result="/home/$USER/.local/share/applications"
    test_length=${#test_line}
    for (( cursor_posit=0; cursor_posit<=$test_length; cursor_posit++ )); do
        $tested_func "$test_line" "$cursor_posit"
        if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
            logt "Test08: $test_line; cursor_posit: $cursor_posit; Expected result: $expected_result; got: $uri_with_env_vars_expanded"
            (( failed_tests++ ))
        fi
        (( all_tests++ ))
    done

    #test
    test_line="\"~/scripts\""; expected_result="/home/$USER/scripts"
    test_length=${#test_line}
    for (( cursor_posit=0; cursor_posit<=$test_length; cursor_posit++ )); do
        $tested_func "$test_line" "$cursor_posit"
        if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
            logt "Test09: $test_line; cursor_posit: $cursor_posit; Expected result: $expected_result, got: $uri_with_env_vars_expanded"
            (( failed_tests++ ))
        fi
        (( all_tests++ ))
    done

    #test
    test_line="\"/home/guest\" '/home/config'"; 
    expected_result="/home/guest"
    for (( cursor_posit=0; cursor_posit<=13; cursor_posit++ )); do
        $tested_func "$test_line" "$cursor_posit"
        if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
            logt "Test10: $test_line; cursor_posit: $cursor_posit; Expected result: $expected_result, got: $uri_with_env_vars_expanded"
            (( failed_tests++ ))
        fi
        (( all_tests++ ))
    done
    expected_result="/home/config"
    for (( cursor_posit=14; cursor_posit<=${#test_line}; cursor_posit++ )); do
        $tested_func "$test_line" "$cursor_posit"
        if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
            logt "Test11: $test_line; cursor_posit: $cursor_posit; Expected result: $expected_result, got: $uri_with_env_vars_expanded"
            (( failed_tests++ ))
        fi
        (( all_tests++ ))
    done

    #test
    visual_test_line="  ~ |/etc 1 /etc 2 '/home'  3 /etc 4 '/home' 5 \"/media\" 6 \"~/scripts\" 7 ~ \"$HOME\" ~  "; 
    expected_result="/etc" 
    cursorposition=$(get_cursor_position "$visual_test_line"); test_line=$(delete_cursor_char "$visual_test_line")
    $tested_func "$test_line" "$cursorposition"
    if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
        logt "Test12: $visual_test_line; Expected result: $expected_result, got: $uri_with_env_vars_expanded"
        (( failed_tests++ ))
    fi
    (( all_tests++ ))

    #test
    visual_test_line="  ~ /etc 1 /etc 2 '/home'  3 /etc 4 '/home' 5 \"/media\" 6 \"~/scripts\" 7 ~ \"$HOME\" 8  ~|  "; 
    expected_result="/home/$USER" 
    cursorposition=$(get_cursor_position "$visual_test_line"); test_line=$(delete_cursor_char "$visual_test_line")
    $tested_func "$test_line" "$cursorposition"
    if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
        logt "Test13: $visual_test_line; Expected result: $expected_result, got: $uri_with_env_vars_expanded"
        (( failed_tests++ ))
    fi
    (( all_tests++ ))

    #test
    visual_test_line="  ~ /etc 1 '/home'| 2 \"/media\" 3 \"~/scripts\" 4 ~ \"$HOME\" 5 ~  "; 
    expected_result="/home" 
    cursorposition=$(get_cursor_position "$visual_test_line"); test_line=$(delete_cursor_char "$visual_test_line")
    $tested_func "$test_line" "$cursorposition"
    if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
        logt "Test14: $visual_test_line; Expected result: $expected_result, got: $uri_with_env_vars_expanded"
        (( failed_tests++ ))
    fi
    (( all_tests++ ))

    #test
    visual_test_line="  ~ /etc 1 /etc 2 '/home' 3 /etc 4 '/home' 5 \"/media\" 6 \"~/scr|ipts\" 7 ~ \"$HOME\" ~  "; 
    expected_result="/home/$USER/scripts" 
    cursorposition=$(get_cursor_position "$visual_test_line"); test_line=$(delete_cursor_char "$visual_test_line")
    $tested_func "$test_line" "$cursorposition"
    if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
        logt "Test16: $visual_test_line; Expected result: $expected_result, got: $uri_with_env_vars_expanded"
        (( failed_tests++ ))
    fi
    (( all_tests++ ))

    #test
    visual_test_line="  ~ /etc 1 /etc 2 '/home' 3 /etc 4 '/home' 5 \"/media\" 6 \"~/scripts\" 7 ~ \"|$HOME\" ~  "; 
    expected_result="/home/$USER" 
    cursorposition=$(get_cursor_position "$visual_test_line"); test_line=$(delete_cursor_char "$visual_test_line")
    $tested_func "$test_line" "$cursorposition"
    if [  "$uri_with_env_vars_expanded" != "$expected_result" ]; then
        logt "Test17: $visual_test_line; Expected result: $expected_result, got: $uri_with_env_vars_expanded"
        (( failed_tests++ ))
    fi
    (( all_tests++ )) 

    # printing results
    if [ $failed_tests -eq 0 ]; then
        logt "All tests passed"
    else
        logt "Number of failed tests: $failed_tests"
    fi
    logt "Total run tests amount: $all_tests"

    local test_end_time=$(date +%s%3N)
    local test_duration_ms=$(( test_end_time - test_start_time ))
    local test_duration_sec=$( bc <<< "scale = 2; ($test_end_time - $test_start_time) / 1000" )
    logt "Execution time in ms: $test_duration_ms"
    logt "Execution time in sec: $test_duration_sec"
}

tested_func="extract_uri_from_string"

logt "tested_func: $tested_func"

tests


Solution

  • You could try awk (tested with GNU awk).

    Search delimiters to the left and right of integer position

    $ cat foo.awk
    #!/usr/bin/env awk
    
    function lastindex(s, c,    r) {
      r = index(s, c)
      if(r != 0) return r + lastindex(substr(s, r + 1), c)
      return 0
    }
    
    NR == n {
      h = substr($0, 1, p - 1); c = substr($0, p, 1); t = substr($0, p + 1)
      for(i = 1; i < length(d); i += 2) {
        c0 = substr(d, i, 1); c1 = substr(d, i + 1, 1)
        i0 = lastindex(h, c0); i1 = index(t, c1)
        if(i0 && (c == c1)) { # opening in head, closing at position
          s = substr(h, i0 + 1)
          exit
        }
        if((c == c0) && i1) { # opening at position, closing in tail
          s = substr(t, 1, i1 - 1)
          exit
        }
        if(i0 && i1) { # opening in head, closing in tail
          s = substr(h, i0 + 1) c substr(t, 1, i1 - 1)
          exit
        }
      }
      s = $0; exit
    }
    
    END {
      sub(/^[[:space:]]*/, "", s)
      sub(/[[:space:]]*$/, "", s)
      print s
    }
    

    And then, if your file is file, the line of interest is line 15 (counting from 1 for the first line), the position is 72 (counted from 1 for the first character) your delimiter pairs, in decreasing order of precedence, are "", '', () and (2 spaces):

    $ awk -f foo.awk -v d="\"\"''()  " -v n=15 -v p=72 file
    

    Or, if you want to pass the text line directly (but you'll have to properly escape quotes, if needed):

    $ awk -f foo.awk -v d="\"\"''()  " -v n=1 -v p=5 <<< "'~/te st'"
    

    Explanations:

    The script uses 3 variables passed on the command line:

    The lastindex custom function returns the index of the last occurrence of character c in string s or 0 if there is none. The index built-in function does the same with the first occurrence.

    The script processes only line number n (condition NR == n).

    We splits the line in 3 parts: head (h) from character 1 to character p-1, character at position p (c), and tail (t) from character p+1 to the end.

    Then, for each pair of opening/closing characters (c0 and c1, extracted from string variable d):

    1. We check if the opening character (c0) is in h and the closing character (c1) at position p. If they are, we set the result string s to the part of h between last c0 (excluded) and the end, and we exit.

    2. Else we check if the opening character (c0) is at position p and the closing character (c1) in t. If they are we set the result string s to the part of t between beginning and first c1 (excluded), and we exit.

    3. Else we check if the opening character (c0) is in h and the closing character (c1) in t. If they are we set the result string s to the concatenation of:

      • the part of h between last c0 (excluded) and the end,
      • the character at position (c),
      • the part of t between beginning and first c1 (excluded)

      and we exit.

    After trying all opening/closing pairs, if we did not exit yet, we set the result string s to the whole line and we exit.

    The END block runs after exit. We trim leading and trailing spaces from s with function sub, and we print.

    Search matching pairs of delimiters

    One drawback of the previous solution is that non-matching pairs of delimiters are not properly detected. For instance, with input 'a'b'c' and position 4 the output is b while only a or c are probably valid.

    Another possibility would be to find all matching pairs and check if the position is in the substring they delimit.

    In the following we consider only the non-space delimiters (v d="\"\"''()") but the last resort rule is sightly modified: if the position does not fall between a balanced pair of delimiters (delimiters included), then the word (contiguous sequence of non-space characters) at position is printed, unless the character at position is a space, in which case nothing is printed. Moreover, if the specified position is less than 1 or greater than the length of the line, nothing is printed.

    #!/usr/bin/env awk
    
    function find(s, p, c,    q) {
      q = index(substr(s, p), c)
      return (q == 0) ? 0 : p - 1 + q
    }
    
    function trim(s) {
      sub(/^[[:space:]]*/, "", s)
      sub(/[[:space:]]*$/, "", s)
      return s
    }
    
    NR == n {
      if(p < 1 || p > length($0)) exit
      for(i = 1; i < length(d); i += 2) {
        c0 = substr(d, i, 1); c1 = substr(d, i + 1, 1)
        y = 0
        while(x = find($0, y + 1, c0)) {
          if(y = find($0, x + 1, c1)) {
            if(x <= p && p <= y) {
              print trim(substr($0, x + 1, y - x - 1))
              exit
            }
          } else break
        }
      }
      c = substr($0, p, 1)
      if(c ~ /[[:space:]]/) exit
      h = substr($0, 1, p - 1); t = substr($0, p + 1); 
      sub(/.*[[:space:]]+/, "", h) sub(/[[:space:]]+.*/, "", t)
      print h c t
    }
    

    The find function returns the index of character c in string s at or after position p.

    Shortest distance first

    The 2 previous versions assume that the position is an integer value and designates one character in the line. Some editors have modes where the position of the cursor is between two characters, like gvim in insert mode, for instance.

    This third version supports integer positions and also non-integer positions, e.g., 2.5 for position between the second and third characters, or 0.5 for position on the left of the first character. It computes the distance between the position and all substrings delimited by delimiter pairs, selects the smallest distance and, if it is less or equal 1, it prints the corresponding substring, with a left-first priority in case of equality. If no delimited substring is close enough, it does the same with space-delimited words.

    Warning: it uses GNU awk extensions (fourth parameter of patsplit and split) and will not work with other awk without modifications.

    #!/usr/bin/env awk
    
    function distance(p, start, end) {
      if(p < start) return start - p
      if(start <= p && p <= end) return 0
      return p - end
    }
    
    function trim(s) {
      sub(/^[[:space:]]*/, "", s)
      sub(/[[:space:]]*$/, "", s)
      return s
    }
    
    NR == n {
      ndelim = split(d, delim, "")
      dmin = length($0) + 2
      for(i = 1; i < ndelim; i += 2) {
        delim[i] = ((delim[i] ~ @/[\\\-\^\]]/) ? "\\" : "") delim[i]
        delim[i + 1] = ((delim[i + 1] ~ @/[\\\-\^\]]/) ? "\\" : "") delim[i + 1]
        nwords = patsplit($0, words, "[" delim[i] "][^" delim[i + 1] "]*[" delim[i + 1] "]", seps)
        if(nwords == 0) continue
        start = length(seps[0]) + 1
        for(j = 1; j <= nwords; j++) {
          wlen = length(words[j])
          end = start + wlen - 1
          x = distance(p, start, end)
          if(x < dmin) {
            word = substr($0, start + 1, end - start - 1)
            dmin = x
          }
          start += wlen + length(seps[j])
        }
      }
      if(dmin <= 1) {
        print trim(word)
        exit
      }
      nwords = split($0, words, " ", seps)
      start = length(seps[0]) + 1
      for(j = 1; j <= nwords; j++) {
        wlen = length(words[j])
        end = start + wlen - 1
        x = distance(p, start, end)
        if(x < dmin) {
          word = substr($0, start, end - start + 1)
          dmin = x
        }
        start += wlen + length(seps[j])
      }
      if(dmin <= 1) {
        print trim(word)
        exit
      }
    }