rtexttext-extractionofficerbulletedlist

Extract bullets from Word Document in R


I have a Microsoft Word document that contains several bullets and nested bullets (sub-bullets), with up to three levels of nesting. I have been exploring the use of the officer package in R to read the text from the Word document, which I plan to then insert into a database. I am able to successfully extract all the text from the document, but what I can't seem to figure out is how to extract the bullets themselves. Each bullet and bullet level provides important contextual information about the text that I need to extract, but it seems the bullets are stripped/ignored using the officer package. So my question is, is there any way for me to use officer to extract the bullets themselves, in addition to the text, or is there some other R package that I might be able to use that will retrieve the bullets as well?

I realize, I could probably write a custom function to extract the xml structure of the Word document and obtain the bullets from there, but I'm really trying to avoid digging into those details and re-creating the wheel that others might have already developed.

Thanks.


Solution

  • Well, shortly after asking this question, I discovered the docx_summary function in officer. It looks like this gets displays a column called level which indicates the bullet nesting level. I think I should be able to use this to accomplish what I'm trying to do, so sorry, for answering my own question, but I figured this might be useful to others who are trying to do the same thing. The only thing I really wish this had was the ability to determine exactly what symbol is used for the bullet, but I can work around that, but if others might know how to extract the symbol used at each bulleting level, that would be greatly appreciated.