libreofficelibreoffice-basic

macro to generate bigrams in Writer


How do I generate bigrams using basic language?

I can do that in Python like this...

import nltk, sys
from nltk.tokenize import word_tokenize
sys.stdout = open("mygram1.txt", "w")
with open("mytext.txt") as f:
for text in f:
    tokens = nltk.word_tokenize(text)
    bigrm = (nltk.bigrams(tokens))
    print(*map(' '.join, bigrm), sep='\n')

But I need a macro that I can run in Libreoffice writer. I do not want to use Python.


Update:

just like bigrams, nltk has trigrams method that I call using nltk.trigrams And if I need four or five grams there is everygrams!

from nltk import everygrams
import nltk, sys
from nltk.tokenize import word_tokenize
sys.stdout = open("myfourgram1.txt", "w")
with open("/home/ubuntu/mytext.txt") as f:
  for text in f:
      tokens = nltk.word_tokenize(text)
      for i in list(everygrams(tokens, 4, 4)):
          print((" ".join(i)))

Is it possible in libreoffice basic?


Solution

  • You could replicate the behaviour of your Python code by recycling the code in my answer to your previous question (Can you Print the wavy lines generated by Spell check in writer?). First strip out all the stuff relating to spell checking, generating alternatives and sorting, thereby making it considerably shorter, and change the line that inserts the results into the new document to make it just insert pairs of words. Rather than having your input text in a .txt file, you would have to put them into a writer document, and the results would appear in a new writer document.

    It should look something like the listing below. This also includes the subsidiary function IsWordSeparator()

    Option Explicit
    
    Sub ListBigrams
    
        Dim oSource As Object 
        oSource = ThisComponent
    
        Dim oSourceCursor As Object
        oSourceCursor = oSource.getText.createTextCursor()
        oSourceCursor.gotoStart(False)
        oSourceCursor.collapseToStart()
    
        Dim oDestination As Object
        oDestination = StarDesktop.loadComponentFromURL( "private:factory/swriter",  "_blank", 0, Array() )
    
        Dim oDestinationText as Object
        oDestinationText = oDestination.getText()
    
        Dim oDestinationCursor As Object
        oDestinationCursor = oDestinationText.createTextCursor()
    
        Dim s As String, sParagraph As String, sPreviousWord As String, sThisWord As String    
        Dim i as Long, j As Long, nWordStart As Long, nWordEnd As Long, nChar As Long
        Dim bFirst as Boolean
        
        sPreviousWord = ""
        bFirst = true
    
        Do
            oSourceCursor.gotoEndOfParagraph(True)
            sParagraph = oSourceCursor.getString() & " " 'It is necessary to add a space to the end of
            'the string otherwise the last word of the paragraph is not recognised.
            
            nWordStart = 1
            nWordEnd = 1
            
            For i = 1 to Len(sParagraph)
            
                nChar = ASC(Mid(sParagraph, i, 1))
                
                If IsWordSeparator(nChar) Then   '1
                
                    If nWordEnd > nWordStart Then   '2
                    
                    sThisWord = Mid(sParagraph, nWordStart, nWordEnd - nWordStart)
                                        
                    If bFirst Then
                        bFirst = False
                    Else
                        oDestinationText.insertString(oDestinationCursor, sPreviousWord & " " & sThisWord & Chr(13), False)
                    EndIf
                                    
                    sPreviousWord = sThisWord
                    
                    End If   '2                
                    nWordEnd = nWordEnd + 1
                    nWordStart = nWordEnd                   
                    Else                
                    nWordEnd = nWordEnd + 1                   
                End If    '1
    
            Next i
    
        Loop While oSourceCursor.gotoNextParagraph(False)
    
    End Sub
    
    '----------------------------------------------------------------------------
    
    ' OOME Listing 360. 
    Function IsWordSeparator(iChar As Long) As Boolean
    
        ' Horizontal tab \t 9
        ' New line \n 10
        ' Carriage return \r 13
        ' Space   32
        ' Non-breaking space   160     
    
        Select Case iChar
        Case 9, 10, 13, 32, 160
            IsWordSeparator = True
        Case Else
            IsWordSeparator = False
        End Select    
    End Function
    

    Even if it would be easier to do it in Python, as Jim K suggested, the BASIC approach would make it easier to distribute the functionality to users, since they would not have to install Python and the NLTK library (which is not straightforward).