pythonnltktext-mining

Python NLTK text dispersion plot has y vertical axis is in backwards / reversed order


Since last month NLTK dispersion_plot seems to have y (vertical) axis in reversed order on my machine. This is likely something about my versions of software (I am on a school virtual machine).

versions: nltk 3.8.1 matplotlib 3.7.2 Python 3.9.13

code:

from nltk.draw.dispersion import dispersion_plot
words=['aa','aa','aa','bbb','cccc','aa','bbb','aa','aa','aa','cccc','cccc','cccc','cccc']
targets=['aa','bbb', 'f', 'cccc']
dispersion_plot(words, targets)

enter image description here

expected: aaa is present at the beginning, and cccc at the end. actual: it's backwards! also notice f should be completely absent - instead bbb is absent.

conclusion: Y axis is backwards.


Solution

  • I found source code for nltk.draw.dispersion and it seems there is mistake.

    def dispersion_plot(text, words, ignore_case=False, title="Lexical Dispersion Plot"):
        """
        Generate a lexical dispersion plot.
    
        :param text: The source text
        :type text: list(str) or iter(str)
        :param words: The target words
        :type words: list of str
        :param ignore_case: flag to set if case should be ignored when searching text
        :type ignore_case: bool
        :return: a matplotlib Axes object that may still be modified before plotting
        :rtype: Axes
        """
    
        try:
            import matplotlib.pyplot as plt
        except ImportError as e:
            raise ImportError(
                "The plot function requires matplotlib to be installed. "
                "See https://matplotlib.org/"
            ) from e
    
        word2y = {
            word.casefold() if ignore_case else word: y
            for y, word in enumerate(reversed(words))  # <--- HERE
        }
        xs, ys = [], []
        for x, token in enumerate(text):
            token = token.casefold() if ignore_case else token
            y = word2y.get(token)
            if y is not None:
                xs.append(x)
                ys.append(y)
    
        _, ax = plt.subplots()
        ax.plot(xs, ys, "|")
        ax.set_yticks(list(range(len(words))), words, color="C0")  # <--- HERE
        ax.set_ylim(-1, len(words))
        ax.set_title(title)
        ax.set_xlabel("Word Offset")
        return ax
    
    
    
    if __name__ == "__main__":
        import matplotlib.pyplot as plt
    
        from nltk.corpus import gutenberg
    
        words = ["Elinor", "Marianne", "Edward", "Willoughby"]
        dispersion_plot(gutenberg.words("austen-sense.txt"), words)
        plt.show()
    

    It calculates word2y using reversed(words)

    for y, word in enumerate(reversed(words))
    

    but later it uses ax.set_yticks() using words but it should use reversed(words)

    ax.set_yticks(list(range(len(words))), words, color="C0")
    

    (or it should calculate word2y without using reversed()).

    I added # <--- HERE in code above to show these places.

    It may need to report it as a issue.

    At this moment you can get ax and use set_yticks with reversed to correct it.
    In your code it will be targets instead of words

    ax = dispersion_plot(words, targets)
    
    ax.set_yticks(list(range(len(targets))), reversed(targets), color="C0")
    

    Full working code

    import matplotlib.pyplot as plt
    from nltk.draw.dispersion import dispersion_plot
    
    words = ['aa','aa','aa','bbb','cccc','aa','bbb','aa','aa','aa','cccc','cccc','cccc','cccc']
    targets = ['aa','bbb', 'f', 'cccc']
    
    ax = dispersion_plot(words, targets)
    ax.set_yticks(list(range(len(targets))), reversed(targets), color="C0")
    
    plt.show()
    

    enter image description here


    EDIT: I seems this problem was reported few months ago and they add reversed() in code on GitHub - and probably it will work in next version

    dispersion plot not working properly · Issue #3133 · nltk/nltk

    dispersion plot not working properly by Apros7 · Pull Request #3134 · nltk/nltk