pythonnlppyldavis

Python extracting contents from list


I am putting together a text analysis script in Python using pyLDAvis, and I am trying to clean up one of the outputs into something cleaner and easier to read. The function to return the top 5 important words for 4 topics is a list that looks like:

    [(0, '0.008*"de" + 0.007*"sas" + 0.004*"la" + 0.003*"et" + 0.003*"see"'),

     (1,
      '0.009*"sas" + 0.004*"de" + 0.003*"les" + 0.003*"recovery" + 0.003*"data"'),

     (2,
      '0.007*"sas" + 0.006*"data" + 0.005*"de" + 0.004*"recovery" + 0.004*"raid"'),

     (3,
      '0.019*"sas" + 0.009*"expensive" + 0.008*"disgustingly" + 0.008*"cool." + 0.008*"houses"')]

I ideally want to turn this into a dataframe where the first row contains the first words of each topic, as well as the corresponding score, and the columns represent the word and its score i.e.:

r1col1 is 'de', r1col2 is 0.008, r1col3 is 'sas', r1col4 is 0.009, etc, etc.

Is there a way to extract the contents of the list and separate the values given the format it is in?


Solution

  • Assuming the output is consistent with your example, it should be fairly straight forward. The list contains tuples of 2 of which the second is a string with plenty of available operations in python.

    str.split("+") will return a list split from str along the '+' character.

    To then extract the word and the score you could make use of the python package 're' for matching regular expressions.

    score = re.search('\d+.?\d*', str)

    word = re.search('".*"', str)

    you then use .group() to get the match as such:

    score.group()

    word.group()

    You could also simply use split again along '*' this time to split the two parts. The returned list should be ordered.

    l = str.split('*')