pythonencodingcompressiongoogle-diff-match-patch

Efficient way to store tuples of diff info without redundancy


I have this main text How can I run java script from a local folder?

this diff.diff_main(diff(), "How can I run java script from a local folder?","How can I run Javascript from a local folder?")

returns [(0, 'How can I run '), (-1, 'j'), (1, 'J'), (0, 'ava'), (-1, ' '), (0, 'script from a local folder?')]

it's not very big problem with this short string but it is with bigger strings like 40,000 chars which is common in my application. I choose this short string for clarity and readability,,, however i'm looking for a way to store text positions (from start position to end position) instead of actual text. it will finally matched with the original text.

example,,, instead of [(0, 'How can I run '), (-1, 'j'), (1, 'J'), (0, 'ava'), (-1, ' '), (0, 'script from a local folder?')] I will have [(0, '0,14'), (-1, 'j'), (1, 'J'), (0, '15,18'), (-1, ' '), (0, '19,44')]

it will be decoded from positions encoded in tuples for example 0,14 is from position 0 to 14 or How can I run ,,, 15,18 from position 15 to 18 in original text or ava and etc,,

it can be retrived like this originaltext[0:14] later,,,

I have tried with this it gets very close

a=[(0, 'How can I run '), (-1, 'j'), (1, 'J'), (0, 'ava'), (-1, ' '), (0, 'script from a local folder?')]

b='How can I run java script from a local folder?'

result={}

positioncount = 0
for x, y in enumerate(a):
    if y[0] == 0:
        if positioncount == 0:
            result[x]={y[0]:len(y[1])}
            positioncount+=len(y[1])
        else:
            result[x]={y[0]:(len(y[1])+positioncount,len(y[1]))}
    else:
        result[x]={y[0]:y[1]}
        positioncount-=len(y[1])

but print result is give me {0: {0: 14}, 1: {-1: 'j'}, 2: {1: 'J'}, 3: {0: (15, 3)}, 4: {-1: ' '}, 5: {0: (38, 27)}} and is not correct because it should give {0: {0: 14}, 1: {-1: 'j'}, 2: {1: 'J'}, 3: {0: (15, 18)}, 4: {-1: ' '}, 5: {0: (19, 44)}}

what im doing wrong here? is there anyway to do this right? if you have any alternative im glad to take it in thanks!


Solution

  • Why do you create dictionaries with running indexes as keys? Try this:

    a=[(0, 'How can I run '), (-1, 'j'), (1, 'J'), (0, 'ava'), (-1, ' '), (0, 'script from a local folder?')]
    
    b='How can I run java script from a local folder?'
    
    result = []
    
    position = 0
    for v, txt in a:
        if v == 0:
            result.append((0, (position, position+len(txt))))
        else:
            result.append((v, txt))
        if v<=0:
            position += len(txt)