htmlreplacefindgit-bash

How to find and remove code block accross multiple html files in subdirectories


Lets say I’m working on a project where I need to remove a specific block of SEO-related code from hundreds of index.html files located in various subdirectories within a root directory. Here's the structure of my project:

something like this: /root-directory/ folder1/ index.html folder2/ index.html ...

Problem:

Each index.html file contains a block of SEO-related meta tags enclosed between two specific HTML comments. Here’s an example of the code block I need to remove:

`
<!-- All in One SEO Pro 4.7.6 - aioseo.com -->
<meta name="description" content="Sample description">
<meta name="keywords" content="sample, keywords">
<!-- All in One SEO Pro -->`

The comments that mark the start and end of the block are:

Start comment:

 `<!-- All in One SEO Pro 3.1.6 - aioseo.com -->`

End comment:

`<!-- All in One SEO Pro -->`

Challenges:

  1. The content (meta tags) between these comments varies across files.

  2. The title and other attributes in the meta tags are unique in each file.

  3. The version number in the start comment (4.7.6) may change in other files.

What I Want to Do:

I need to:

  1. Find all index.html files within the root directory and its subdirectories.

  2. target and Remove just that block of code (including the comments) between the start and end comments in each file.

What I Want to Do:

I need to:

  1. Find all index.html files within the root directory and its subdirectories.

  2. Remove the entire block of code (including the comments) between the start and end comments in each file.

I've even tried visual studio code find in files and replace and i used regular expression but it didn't work, it kept saying couldn't find any code relating to that


Solution

  • Here are the commands to:

    1. Find all index.html files within the root directory and its subdirectories.

      grep --include "*.html" -rl '<!-- All in One SEO Pro -->' .
      
    2. Remove the entire block of code (including the comments) between the start and end comments in each file.

      # if you're using sed on macOS
      sed -i '' '/<!-- All in One SEO Pro 4.7.6 - aioseo.com -->/,/<!-- All in One SEO Pro -->/d' /path/to/file.txt
      
      # gnu sed
      sed -i '/<!-- All in One SEO Pro 4.7.6 - aioseo.com -->/,/<!-- All in One SEO Pro -->/d' /path/to/file.txt
      
      # notice the difference between the two is an empty string after -i for macos 
      

    This will modify several files so I suggest backing up your files before running this

    Method 1: use xargs to execute sed

    # I'm using macos, remove the '' after sed -i if you're not on mac
    
    grep --include "sed*.html" -rl '<!-- All in One SEO Pro -->' . |
        xargs -I % sed -i '' '/<!-- All in One SEO Pro 4.7.6 - aioseo.com -->/,/<!-- All in One SEO Pro -->/d' %
    

    Method 2: use for loop

    # create an array of files
    MATCHING_FILES=($(grep --include "*.html" -rl '<!-- All in One SEO Pro -->' .))
    
    for i in "${MATCHING_FILES[@]}"; do
        echo "Editing $i"
        # gnu
        # sed -i '/<!-- All in One SEO Pro 4.7.6 - aioseo.com -->/,/<!-- All in One SEO Pro -->/d' $i
        # macos
        # sed -i '' '/<!-- All in One SEO Pro 4.7.6 - aioseo.com -->/,/<!-- All in One SEO Pro -->/d' $i
    done  
    

    Uncomment sed command based on your operating system