pythonregexkindle

Regex not specific enough


So I wrote a program for my Kindle e-reader that searches my highlights and deletes repetitive text (it's usually information about the book title, author, page number, etc.). I thought it was functional but sometimes there would random be periods (.) on certain lines of the output. At first I thought the program was just buggy but then I realized that the regex I'm using to match the books title and author was also matching any sentence that ended in brackets.

This is the code for the regex that I'm using to detect the books title and author

titleRegex = re.compile('(.+)\((.+)\)')

Example

In this case it would delete everything and leave just the period at the end of the sentence. This is obviously not ideal because it deletes the text I highlighted

Here is the unformatted text file that goes into my program

The program works by finding all of the matches for the regexes I wrote, looping through those matches and one by one replacing them with empty strings.

Would there be any ways to make my title regex more specific so that it only picks up author titles and not full sentences that end in brackets? If not, what steps would I have to take to restructure this program?

I've attached my code to the bottom of this post. I would greatly appreciate any help as I'm a total coding newbie. Thanks :)

import re
titleRegex = re.compile('(.+)\((.+)\)')
titleRegex2 = re.compile(r'\ufeff (.+)\((.+)\)')
infoRegex = re.compile(r'(.) ([a-zA-Z]+) (Highlight|Bookmark|Note) ([a-zA-Z]+) ([a-zA-Z]+) ([0-9]+) (\|)')
locationRegex = re.compile(r' Location (\d+)(-\d+)? (\|)')
dateRegex = re.compile(r'([a-zA-Z]+) ([a-zA-Z]+) ([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)')
timeRegex = re.compile(r'([0-9]+):([0-9]+):([0-9]+) (AM|PM)')
newlineRegex = re.compile(r'\n')
sepRegex = re.compile('==========')

regexList = [titleRegex, titleRegex2, infoRegex, locationRegex, dateRegex, timeRegex, sepRegex, newlineRegex]

string = open("/Users/devinnagami/myclippings.txt").read()

for x in range(len(regexList)):
    newString = re.sub(regexList[x], ' ', string)
    string = newString

finalText = newString.split('             ')

with open('booknotes.txt', 'w') as f:
    for item in finalText:
        f.write('%s\n' % item)

Solution

  • There isn't enough information to tell if "Book title (Book Author)" is different than something like "I like Books (Good Ones)" without context. Thankfully, the text you showed has plenty of context. Instead of creating several different regular expressions, you can combine them into one expression to encode that context.

    For instance:

    quoteInfoRegex = re.compile(
        r"^=+\n(?P<title>.*?) \((?P<author>.*?)\)\n" + 
        r"- Your Highlight on page (?P<page>[\d]+) \| Location (?P<location>[\d-]+) \| Added on (?P<added>.*?)\n" + 
        r"\n" + 
        r"(?P<quote>.*?)\n", flags=re.MULTILINE)
    
    for m in quoteInfoRegex.finditer(data):
        print(m.groupdict())
    

    This will pull out each line of the text, and parse it, knowing that the book title is the first line after the equals, and the quote itself is below that.