I have a word docx, with some coloured characters. I am trying to export this data into a dataframe and want to retain the information of the font color as well. The colors represent important information and so, I would like the output to state the colour of the character being read. Are there any R packages that would help me read this?
I have tried converting it into XML, but have had no luck trying to retrieve the text based on the font color. I have also tried the officer package but unfortunately, it doesn't read the font colors.
Sample input would be a docx with characters like this:
Sample output could look something like:
Character Underline Bold Color
O No Yes Red
% Yes Yes Black
8 Yes Yes Green
OR
Character Underline Bold Color
O No Yes Red
% Yes Yes Black
8 Yes Yes Green
OR
Red Character positions- 1
Green Character positions- 3
Underline character positions- 2,3
Bold character positions- 1,2,3
Note: my test document is about pigs, hence the variable names.
library(xml2)
pigsin <- read_xml(unz(file.choose(), "word/document.xml"))
text_nodeset <- pigsin |> xml2::xml_find_all("//w:r[w:t]") |> as_list()
This gives you a list of all sections of the document containing text. Then iterate over them to extract the relevant text and values, e.g:
lapply(text_nodeset,
FUN = \(x) {
out <- data.frame(chars = strsplit(unlist(x$t),""),
italic = !is.null(x$rPr$i),
bold = !is.null(x$rPr$b),
colour = ifelse(is.null(x$rPr$color), "-", attr(x$rPr$color, "val")))
colnames(out) <- c("chars", "italic", "bold", "colour")
out
}) |> dplyr::bind_rows()
gives
chars italic bold colour
1 P TRUE FALSE -
2 i TRUE FALSE FF0000
3 g TRUE FALSE -
4 P FALSE FALSE -
5 A FALSE FALSE -
6 G FALSE FALSE FF0000
7 P FALSE TRUE -
8 o FALSE TRUE -
9 g FALSE TRUE -
10 P FALSE TRUE 00B050
11 U FALSE TRUE 00B050
...
(# for my silly toy file)