htmlregexnotepad++calibre

Using Notepad++ find and replace with regular expression


I have a html menu file, which contains list of html pages, extracted by chm decoder.

(7,0,"Icons Used in This Book","final/pref04.html");
(8,0,"Command Syntax Conventions","final/pref05.html");
(9,0,"Introduction","final/pref06.html");
(10,0,"Part I: Introduction and Overview of Service","final/part01.html");
(11,10,"Chapter 1. Overview","final/ch01.html");
(12,11,"Technology Motivation","final/ch01lev1sec1.html");

I want create from this a 'table of contents' file for Calibre (HTML file that contains links to all the other files in the desired order). The final file should look like this:

<a href="final/pref04.html">Icons Used in This Book</a><br/>
<a href="final/pref05.html">Command Syntax Conventions</a><br/>
.
.
.

So first I need to remove the digit prefixes with regular expression, then add a href attribute to make hyperlink, and change the URL and title position. Can anyone show how to make this with Notepad++?


Solution

  • I think this would do it for you, I'm mac based so I don't have notepad++ but this works in dreamweaver. Presuming each expression is one line based.

    Find:

    \(.*?"(.*?)","(.*?)".*
    

    Replace:

    <a href="$2">$1</a><br/>
    

    File:

    (7,0,"Icons Used in This Book","final/pref04.html");
    (8,0,"Command Syntax Conventions","final/pref05.html");
    (9,0,"Introduction","final/pref06.html");
    (10,0,"Part I: Introduction and Overview of Service","final/part01.html");
    (11,10,"Chapter 1. Overview","final/ch01.html");
    (12,11,"Technology Motivation","final/ch01lev1sec1.html");
    

    After Replace All:

    <a href="final/pref04.html">Icons Used in This Book</a><br/>
    <a href="final/pref05.html">Command Syntax Conventions</a><br/>
    <a href="final/pref06.html">Introduction</a><br/>
    <a href="final/part01.html">Part I: Introduction and Overview of Service</a><br/>
    <a href="final/ch01.html">Chapter 1. Overview</a><br/>
    <a href="final/ch01lev1sec1.html">Technology Motivation</a><br/>
    

    If it isn't one line based change .* to .*?\n. That should make it stop after each newline. For readability you also may want to add a newline to the replace.

    Should probably explain the regex as well in case you want to modify it...

    The first \ is escaping the ( so the regex knows to look for the literal character and the not special regex grouping. The *? says find every character until the first "; (. is any single character, * is zero or more occurrences of the preceding character, and ? tells it to stop at the first occurrence of the next character, "). The last .* says keep going with the search. The ( and ) around the .*? group the found value into the $1 and $2. The number correlates to the order in which it is in the regex.