regexmatlabregex-lookaroundsregex-group

Regular expression loop makes duplicate matches due to similar words. How to avoid?


I have some kind of a Regex problem I wanted to make it as general as possible although I have written my code in MATLAB.

INFO:

LipidData is a 68x2 table that contains a name column and the Short column, that are strings like LPC, PC, AC4PIM2, SHexCer, SQDG and many more. This LipidData matrix is not going to change, whereas foundpattern may vary depending on the real input data where it comes from.

foundpattern is an N×4 table, where in my example N is 7. The only relevant column here is the first one, called ISDs and which contains the strings to check(for reproducibility you may copy only the column as a cell array). Here you can see both MATLAB tables:

INPUT:

>> LipidData

LipidData =

 68×2 table

                Lipid subclass name                       Short   
___________________________________________________    ___________

{'Diacylated phosphatidylinositol monomannoside'                  }    {'Ac2PIM1'    }
{'Diacylated phosphatidylinositol dimannoside'                    }    {'Ac2PIM2'    }
{'Triacylated phosphatidylinositol dinomannoside'                 }    {'Ac3PIM2'    }
{'Tetraaacylated phosphatidylinositol dimannoside'                }    {'AC4PIM2'    }
{'Anacardic Acid'                                                 }    {'ACar'       }
{'Acetylglucose andrographolide'                                  }    {'AcylGlcADG' }
{'Bis[monoacylglycero]phosphates'                                 }    {'BMP'        }
{'Cholesteryl esters'                                             }    {'CE'         }
{'Ceramide'                                                       }    {'Cer'        }
{'Ceramide alpha-hydroxy fatty acid-dihydrosphingosine'           }    {'CerADS'     }
{'Ceramide alpha-hydroxy fatty acid-phytospingosine'              }    {'CerAP'      }
{'Ceramide beta-hydroxy fatty acid-sphingosine'                   }    {'CerAS'      }
{'Ceramide beta-hydroxy fatty acid-dihydrosphingosine'            }    {'CerBDS'     }
{'Ceramide beta-hydroxy fatty acid-sphingosine'                   }    {'CerBS'      }
{'Ceramide Esterified omega-hydroxy fatty acid-dihydrosphingosine'}    {'CerEODS'    }
{'Ceramide Esterified omega-hydroxy fatty acid-sphingosine'       }    {'CerEOS'     }
{'Ceramide non-hydroxyfatty acid-dihydrosphingosine'              }    {'CerNDS'     }
{'Ceramide non-hydroxyfatty acid-phytospingosine'                 }    {'CerNP'      }
{'Ceramide non-hydroxyfatty acid-sphingosine'                     }    {'Cer_NS'     }
{'Ceramide phosphate'                                             }    {'CerP'       }
{'Cholesterol'                                                    }    {'Cholesterol'}
{'Cardiolipins'                                                   }    {'CL'         }
{'Diacyl/alkylglycerides'                                         }    {'DG'         }
{'Digalactosyldiacylglycerols'                                    }    {'DGDG'       }
{'1,2-diacylglyceryl-3-O-4'-(N,N,N-trimethyl)-homoserine'         }    {'DGTS'       }
{'Ether Oxygenated Phosphatidylcholines'                          }    {'EtherOxPC'  }
{'Ether Oxygenated Phosphatidylethanolamines'                     }    {'EtherOxPE'  }
{'Ether-linked Phosphatidylcoline'                                }    {'EtherPC'    }
{'Ether-linked Phosphatidylethanolamine'                          }    {'EtherPE'    }
{'Fatty Acids'                                                    }    {'FA'         }
{'Fatty acid ester of hydroxyl fatty acid'                        }    {'FAHFA'      }
{'Glucuronosyldiacylglycerol'                                     }    {'GlcADG'     }
{'GM3 Ganglioside'                                                }    {'GM3'        }
{'Hidroxy Bis[monoacylglycero]phosphates'                         }    {'HBMP'       }
{'Hexosylceramide alpha-hydroxy fatty acid-phytospingosine'       }    {'HexCerAP'   }
{'Hexosylceramide non-hydroxyfatty acid-dihydrosphingosine'       }    {'HexCerNDS'  }
{'Hexosylceramide non-hydroxyfatty acid-sphingosine'              }    {'HexCer_NS'  }
{'Lyso 1,2-diacylglyceryl-3-O-4'-(N,N,N-trimethyl)-homoserine'    }    {'DGTS'       }
{'Lyso Phosphatidic acids'                                        }    {'LPA'        }
{'Lyso Phosphatidylcholines'                                      }    {'LPC'        }
{'Lyso Phosphatidylethanolamines'                                 }    {'LPE'        }
{'Lyso Phosphatidylglycerols'                                     }    {'LPG'        }
{'Lyso Phosphatidylinositols'                                     }    {'LPI'        }
{'Lyso Phosphatidylserines'                                       }    {'LPS'        }
{'Monoacyl/alkylglycerides'                                       }    {'MG'         }
{'Monogalactosyldiacylglycerols'                                  }    {'MGDG'       }
{'Oxygenated Cardiolipins'                                        }    {'OxCL'       }
{'Oxygenated Fatty Acids'                                         }    {'OxFA'       }
{'Oxygenated Phosphatidic acids'                                  }    {'OxPA'       }
{'Oxygenated Phosphatidylcholines'                                }    {'OxPC'       }
{'Oxygenated Phosphatidylethanolamines'                           }    {'OxPE'       }
{'Oxygenated Phosphatidylglycerols'                               }    {'OxPG'       }
{'Oxygenated Phosphatidylinositols'                               }    {'OxPI'       }
{'Oxygenated Phosphatidylserines'                                 }    {'OxPS'       }
{'Oxygenated Triacyl/alkylglycerides'                             }    {'OxTG'       }
{'Phosphatidic acids'                                             }    {'PA'         }
{'Phosphatidylbutyl alcohol'                                      }    {'PBtOH'      }
{'Phosphatidylcholines'                                           }    {'PC'         }
{'Phosphatidylethanolamines'                                      }    {'PE'         }
{'Phosphatidyletanol'                                             }    {'PEtOH'      }
{'Phosphatidylglycerols'                                          }    {'PG'         }
{'Phosphatidylinositols'                                          }    {'PI'         }
{'Phosphatidylmethanol'                                           }    {'PMeOH'      }
{'Phosphatidylserines'                                            }    {'PS'         }
{'Sulfatides hexosyl ceramide'                                    }    {'SHexCer'    }
{'Sphingomyelines'                                                }    {'SM'         }
{'Sulfoquinovosyl diacylglycerols'                                }    {'SQDG'       }
{'Triacyl/alkylglycerides'                                        }    {'TG'         }


>> foundpattern

foundpattern =

7×4 table

           ISDs                 tR      Standard desv      RSD  
__________________________    ______    _____________    _______

{'18:1 (d7) MG'          }      1.34       0.020418       1.5238
{'18:1(d7) LPC'          }    1.5868      0.0056024      0.35305
{'18:1 (d9) SM'          }    6.8999        0.08336       1.2081
{'15:0-18:1(d7) PC'      }     7.989       0.072533      0.90791
{'15:0-18:1(d7) DG'      }    12.085       0.097445      0.80631
{'15:0-18:1 (d7)-15:0 TG'}    17.487       0.029701      0.16984
{'Cholesterol (d7)'      }    18.247       0.032275      0.17687

The problem resides when comparing the regular expression of the LipidData PC with a foundpattern value of {'18:1(d7) LPC'} which would make a 'match' that I don't know how to avoid it. I only need to find the exact same Short values within the foundpattern.ISDs. Another example of the same problem would appear hypothetically if in found pattern there was a Cer_NS, which would match not only with its LipidData value Cer_NS but also with Cer.

I believe making the values a group (using regex with parentheses) as you would see in the code is a solution, but of course the groups are 'slightly modified' and thus the repetition. I know I miss something there but I don't know what.

Anyway to avoid match repetitions there? As you would see at the OUTPUT, the Codes cell array should only have 7 entries instead of 8.

CODE:

Codes={}
for j=1:size(ID,1)
  expression=strcat("(",char(LipidData{j,2}),")");
  for i=1:size(foundpattern,1)
    if regexp(char(foundpattern{i,1}),expression) ~= 0
      disp(foundpattern{i,1})
      disp(LipidData{j,2})
      Codes{end+1}=LipidData{j,2};
    end
  end
end

OUTPUT:

>> Codes

Codes =

1×8 cell array

Columns 1 through 6

{1×1 cell}    {1×1 cell}    {1×1 cell}    {1×1 cell}    {1×1 cell}    {1×1 cell}

Columns 7 through 8

{1×1 cell}    {1×1 cell}

>> for i=1:size(Codes,2)
Codes{i}
end

ans =

  1×1 cell array

  {'Cholesterol'}


ans =

  1×1 cell array

  {'DG'}


ans =

  1×1 cell array

  {'LPC'}


ans =

  1×1 cell array

  {'MG'}


ans =

  1×1 cell array

  {'PC'}


ans =

  1×1 cell array

  {'PC'}


ans =

  1×1 cell array

  {'SM'}


ans =

  1×1 cell array

  {'TG'}

>> 

Solution

  • You need

    expression=strcat('\<(', regexptranslate('escape', char(LipidData{j,2})),')\>')
    

    The \< part matches the start of a word. The regexptranslate('escape', char(LipidData{j,2})) now escapes special regex metacharacters in the text used literally in the regex pattern. And \> matches the end of a word.

    See this regex demo.