pythonmallet

IndexError: list index out of range in Python Script


I'm new to python and so I apologize if this question has already been answered. I've used this script before and its worked so I'm not at all sure what is wrong.

I'm trying to transform a MALLET output document into a long list of topic, weight, value rather than a wide list of topics documents and weights.

Here's what the original csv I'm trying to convert looks like but there are 30 topics in it (its a text file called mb_composition.txt):

0   file:/Users/mandyregan/Dropbox/CPH-DH/MiningtheSurge/txt/Abizaid.txt    6.509147794508226E-6    1.8463345214533957E-5   3.301298069640119E-6    0.003825178550032757    0.15240841618294929 0.03903974304065183 0.10454783676528623 0.1316719812119471  1.8018057013225344E-5   4.869261713020613E-6    0.0956868156114931  1.3521101623203115E-5   9.514591058923748E-6    1.822741355900598E-5    4.932324961835634E-4    2.756817586271138E-4    4.039186874601744E-5    1.0503346606335033E-5   1.1466132458804392E-5   0.007003443189848799    6.7094360963952E-6  0.2651753488982284  0.011727025879070194    0.11306132549594633 4.463460490946615E-6    0.0032751230536005056   1.1887304822238514E-5   7.382714572306351E-6    3.538808652077042E-5    0.07158823129977483
1   file:/Users/mandyregan/Dropbox/CPH-DH/MiningtheSurge/txt/Jeffrey,%20Jim%20-%20Chk5-%20ASC%20-%20FINAL%20-%20Sept%202017.docx.txt    4.296636200313062E-6    1.218750594272488E-5    1.5556725986514498E-4   0.043172816021532695    0.04645757277949794 0.01963429696910822 0.1328206370818606  0.116826297071711   1.1893574776047563E-5   3.2141605637859693E-6   0.10242945223692496 0.010439315937573735    0.2478814493196687  1.2031769351093548E-5   0.010142417179693447    2.858721603853616E-5    2.6662348272204834E-5   6.9331747684835E-6  7.745091995495631E-4    0.04235638910274044 4.428844900369446E-6    0.0175105406405736  0.05314379308820005 0.11788631730736487 2.9462944350793084E-6   4.746133386282654E-4    7.846714475661223E-6    4.873270616886766E-6    0.008919869163605806    0.02884824479155971

And here's the python script I'm trying to use to convert it:

infile = open('mallet_output_files/mb_composition.txt', 'r')
outfile = open('mallet_output_files/weights.csv', 'w+')

outfile.write('file,topicnum,weight\n')
for line in infile:
    tokens = line.split('\t')
    fn = tokens[1]
    topics = tokens[2:]
    #outfile.write(fn[46:] + ",")
    for i in range(0,59):
        outfile.write(fn[46:] + ",")
        outfile.write(topics[i*2]+','+topics[i*2+1]+'\n')

I'm running this in the terminal with python reshape.py and I get this error:

Traceback (most recent call last):
  File "reshape.py", line 12, in <module>
    outfile.write(topics[i*2]+','+topics[i*2+1]+'\n')
IndexError: list index out of range

Any idea what I'm doing wrong here? I can't seem to figure it out and am frustrated because I know Ive used this script many times before with success! If it helps I'm on Mac OSx with Python Version 2.7.10


Solution

  • The problem is you're looking for 60 topics per line of your CSV.

    If you just want to print out the topics in the list up to the nth topic per line, you should probably define your range by the actual number of topics per line:

    for i in range(len(topics) // 2):
        outfile.write(fn[46:] + ",")
        outfile.write(topics[i*2]+','+topics[i*2+1]+'\n')
    

    Stated more pythonically, it would look something like this:

    # Group the topics into tuple-pairs for easier management
    paired_topics = [tuple(topics[i:i+2]) for i in range(0, len(topics), 2)]
    # Iterate the paired topics and print them each on a line of output
    for topic in paired_topics:
        outfile.write(fn[46:] + ',' + ','.join(topic) + '\n')