I performed text mining on files that I am preparing for publication right now. There are several XML files that contain text within segments (see basic example below). Due to copyright restrictions, I have to make sure that the files that I am going to publish do not contain the whole text while someone who has the texts should be able to 'reconstruct' the files. To make sure that one can still perform basic text mining (= count lengths), the segment length should not change. Therefore I am looking for a way to replace every word except for the first and last one in all segments with dummy / placeholder text.
Basic example:
Input:
<text>
<div>
<seg xml:id="A">Lorem ipsum dolor sit amet</seg>
<seg xml:id="B">sed diam nonumy eirmod tempor invidunt</seg>
</div>
</text>
Output:
<text>
<div>
<seg xml:id="A">Lorem blank blank blank amet</seg>
<seg xml:id="B">sed blank blank blank blank invidunt</seg>
</div>
</text>
There is rapply
to recursively replace values in a nested list:
Let be data.xml
containing your input.
library(tidyverse)
library(xml2)
read_xml("data.xml") %>%
as_list() %>%
rapply(how = "replace", function(x) {
tokens <-
x %>%
str_split(" ") %>%
simplify()
n_tokens <- length(tokens)
c(
tokens[[1]],
rep("blank", n_tokens - 2),
tokens[[n_tokens]]
) %>%
paste0(collapse = " ")
}) %>%
as_xml_document() %>%
write_xml("data2.xml")
Output file data2.xml:
<?xml version="1.0" encoding="UTF-8"?>
<text>
<div>
<seg id="A">Lorem blank blank blank amet</seg>
<seg id="B">sed blank blank blank blank invidunt</seg>
</div>
</text>