rxmlms-wordofficer

Export data from word to R as well as the format of the character read


I have a word docx, with some coloured characters. I am trying to export this data into a dataframe and want to retain the information of the font color as well. The colors represent important information and so, I would like the output to state the colour of the character being read. Are there any R packages that would help me read this?

I have tried converting it into XML, but have had no luck trying to retrieve the text based on the font color. I have also tried the officer package but unfortunately, it doesn't read the font colors.

Sample input would be a docx with characters like this:

enter image description here

Sample output could look something like:

Character   Underline   Bold    Color
 O               No          Yes      Red
 %               Yes         Yes    Black
 8               Yes         Yes    Green

OR

Character   Underline   Bold    Color
 O               No          Yes      Red
 %               Yes         Yes    Black
 8               Yes         Yes    Green

OR

Red Character positions- 1
Green Character positions- 3
Underline character positions- 2,3
Bold character positions- 1,2,3

Solution

  • Note: my test document is about pigs, hence the variable names.

    library(xml2)
    
    pigsin <- read_xml(unz(file.choose(), "word/document.xml"))
    
    text_nodeset <- pigsin |> xml2::xml_find_all("//w:r[w:t]") |> as_list()
    

    This gives you a list of all sections of the document containing text. Then iterate over them to extract the relevant text and values, e.g:

    lapply(text_nodeset, 
           FUN = \(x) {
             out <- data.frame(chars = strsplit(unlist(x$t),""),
                        italic = !is.null(x$rPr$i),
                        bold = !is.null(x$rPr$b),
                        colour = ifelse(is.null(x$rPr$color), "-", attr(x$rPr$color, "val")))
             colnames(out) <- c("chars", "italic", "bold", "colour")
             out
           }) |> dplyr::bind_rows()
    

    gives

       chars italic  bold colour
    1      P   TRUE FALSE      -
    2      i   TRUE FALSE FF0000
    3      g   TRUE FALSE      -
    4      P  FALSE FALSE      -
    5      A  FALSE FALSE      -
    6      G  FALSE FALSE FF0000
    7      P  FALSE  TRUE      -
    8      o  FALSE  TRUE      -
    9      g  FALSE  TRUE      -
    10     P  FALSE  TRUE 00B050
    11     U  FALSE  TRUE 00B050
    ...
    (# for my silly toy file)