javascriptxmlbashsitemap

I'd like to build a site map generator in a single line of bash code


My static site generator has a /pages directory with a bunch of source files. The names of those source files need to be prepended with my website URL, and then concatenated into a file (Either .txt or .xml).

Here's what I have so far:

find ./pages -name '*.js' \( -exec echo "$FILE"/{} \; -o -print \)

This command prints the names of the files with the extra pages directory up front like this:

/pages/index.js
/pages/articles/article-title.js
/pages/about/index.js
/pages/about/team.js
...

I'm not fantastic with bash. How do I edit each line to include https://www.example.com in front of each line, removing /pages?

Also, I'll need to remove the word index anywhere it appears. /pages/about/index.js should become https://www.example.com/about for example, and /pages/about/team.js should become https://www.example.com/about/team

Bonus

A list of URLs in a .txt file is an acceptable sitemap and I'm happy with that, but if we want to go beyond, we can produce an XML file that has last modified dates.

date -r pages/about.js +"%Y-%m-%d" | tee test.xml This command writes the correct modified date, but I'd have to get it in this final format:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://www.example.com</loc>
    <lastmod>2023-03-01</lastmod>
  </url>
  <url>
    <loc>https://www.example.com/about</loc>
    <lastmod>2023-03-12</lastmod>
  </url>
</urlset>

Solution

  • find ./pages -name '*.js' \( -exec echo "$FILE"/{} \; -o -print \)
    

    This command prints the names of the files with the extra pages > directory up front like this:

    /pages/index.js
    /pages/articles/article-title.js
    /pages/about/index.js
    /pages/about/team.js
    ...
    

    No, it doesn't. Provided that variable FILE has not been set, that command will produce output lines of this form:

    /./pages/index.js
    

    If FILE had been set to a non-null string, then the output would differ even more from what you say. To produce output in the form you show, I suppose that you were actually running this similar command:

    find pages -name '*.js' \( -exec echo "$FILE"/{} \; -o -print \)
    

    And given that the $FILE part is doing nothing, and the -o -print is relevant only in the unlikely event that echo returns a failure status, a simpler way to achieve the same thing would be

    find pages -name '*.js' -exec echo /{} \;
    

    Since you want to modify the beginnings of the output lines, however, it's not particularly useful to prepend a slash, and since -print is the default, I would start with just

    find pages -name '*.js'
    

    Then, sed is one of the typical tools for modifying lines of a file. It looks like you want something along these lines to substitute the leading pages of each result line with the first part of the corresponding URL:

    find pages -name '*.js' | sed 's|^pages|https://www.example.com|'
    

    There's no one-liner for producing the XML format you describe using common shell utilities. It would be possible to write a script to generate that, but I leave it as an exercise.