phpms-wordphpwordphpoffice

Read MS word document with PHP Word


I have installed and set up PHP Word on PHPStorm (IDE). I am trying to read the line "learn from yesterday, live for today, hope for tomorrow..." from the word document below titled 'helloWorld.docx' using PHPWord.

enter image description here

This is my code to load and read the document so far:

<?php

require_once 'PHPWord/bootstrap.php';

$objReader = \PhpOffice\PhpWord\IOFactory::createReader("Word2007");
$phpWord = $objReader->load("helloWorld.docx");

$sections = $phpWord->getSection(0);

echo var_dump($sections);

OUTPUT:

/usr/bin/php7.2 /home/wade/PhpstormProjects/getWord/readDoc.php
object(PhpOffice\PhpWord\Element\Section)#21 (21) {

["container":protected]=>
  string(7) "Section"
  ["style":"PhpOffice\PhpWord\Element\Section":private]=>
  object(PhpOffice\PhpWord\Style\Section)#22 (32) {
    ["orientation":"PhpOffice\PhpWord\Style\Section":private]=>
    string(8) "portrait"
    ["paper":"PhpOffice\PhpWord\Style\Section":private]=>
    object(PhpOffice\PhpWord\Style\Paper)#14 (8) {
      ["sizes":"PhpOffice\PhpWord\Style\Paper":private]=>
      array(7) {
        ["A3"]=>
        array(3) {
          [0]=>
          int(297)
          [1]=>
          int(420)
          [2]=>
          string(2) "mm"
        }
        ["A4"]=>
        array(3) {
          [0]=>
          int(210)
          [1]=>
          int(297)
          [2]=>
          string(2) "mm"
        }
        ["A5"]=>
        array(3) {
          [0]=>
          int(148)
          [1]=>
          int(210)
          [2]=>
          string(2) "mm"
        }
        ["B5"]=>
        array(3) {
          [0]=>
          int(176)
          [1]=>
          int(250)
          [2]=>
          string(2) "mm"
        }
        ["Folio"]=>
        array(3) {
          [0]=>
          float(8.5)
          [1]=>
          int(13)
          [2]=>
          string(2) "in"
        }
        ["Legal"]=>
        array(3) {
          [0]=>
          float(8.5)
          [1]=>
          int(14)
          [2]=>
          string(2) "in"
        }
        ["Letter"]=>
        array(3) {
          [0]=>
          float(8.5)
          [1]=>
          int(11)
          [2]=>
          string(2) "in"
        }
      }
      ["size":"PhpOffice\PhpWord\Style\Paper":private]=>
      string(2) "A4"
      ["width":"PhpOffice\PhpWord\Style\Paper":private]=>
      float(11905.511811024)
      ["height":"PhpOffice\PhpWord\Style\Paper":private]=>
      float(16837.795275591)
      ["styleName":protected]=>
      NULL
      ["index":protected]=>
      NULL
      ["aliases":protected]=>
      array(0) {
      }
      ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
      bool(false)
    }
    ["pageSizeW":"PhpOffice\PhpWord\Style\Section":private]=>
    string(15) "11905.511811024"
    ["pageSizeH":"PhpOffice\PhpWord\Style\Section":private]=>
    string(15) "16837.795275591"
    ["marginTop":"PhpOffice\PhpWord\Style\Section":private]=>
    string(4) "1440"
    ["marginLeft":"PhpOffice\PhpWord\Style\Section":private]=>
    string(4) "1440"
    ["marginRight":"PhpOffice\PhpWord\Style\Section":private]=>
    string(4) "1440"
    ["marginBottom":"PhpOffice\PhpWord\Style\Section":private]=>
    string(4) "1440"
    ["gutter":"PhpOffice\PhpWord\Style\Section":private]=>
    string(1) "0"
    ["headerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
    string(3) "720"
    ["footerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
    string(3) "720"
    ["pageNumberingStart":"PhpOffice\PhpWord\Style\Section":private]=>
    NULL
    ["colsNum":"PhpOffice\PhpWord\Style\Section":private]=>
    int(1)
    ["colsSpace":"PhpOffice\PhpWord\Style\Section":private]=>
    string(3) "720"
    ["breakType":"PhpOffice\PhpWord\Style\Section":private]=>
    NULL
    ["lineNumbering":"PhpOffice\PhpWord\Style\Section":private]=>
    NULL
    ["borderTopSize":protected]=>
    NULL
    ["borderTopColor":protected]=>
    NULL
    ["borderTopStyle":protected]=>
    NULL
    ["borderLeftSize":protected]=>
    NULL
    ["borderLeftColor":protected]=>
    NULL
    ["borderLeftStyle":protected]=>
    NULL
    ["borderRightSize":protected]=>
    NULL
    ["borderRightColor":protected]=>
    NULL
    ["borderRightStyle":protected]=>
    NULL
    ["borderBottomSize":protected]=>
    NULL
    ["borderBottomColor":protected]=>
    NULL
    ["borderBottomStyle":protected]=>
    NULL
    ["styleName":protected]=>
    NULL
    ["index":protected]=>
    NULL
    ["aliases":protected]=>
    array(0) {
    }
    ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
    bool(false)
  }
  ["headers":"PhpOffice\PhpWord\Element\Section":private]=>
  array(0) {
  }
  ["footers":"PhpOffice\PhpWord\Element\Section":private]=>
  array(0) {
  }
  ["footnoteProperties":"PhpOffice\PhpWord\Element\Section":private]=>
  NULL
  ["elements":protected]=>
  array(4) {
    [0]=>
    object(PhpOffice\PhpWord\Element\TextRun)#34 (18) {
      ["container":protected]=>
      string(7) "TextRun"
      ["paragraphStyle":protected]=>
      object(PhpOffice\PhpWord\Style\Paragraph)#35 (34) {
        ["aliases":protected]=>
        array(1) {
          ["line-height"]=>
          string(10) "lineHeight"
        }
        ["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        string(6) "Normal"
        ["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        string(0) ""
        ["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(true)
        ["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        int(0)
        ["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        array(0) {
        }
        ["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["contextualSpacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["bidi":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["textAlignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["suppressAutoHyphens":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["borderTopSize":protected]=>
        NULL
        ["borderTopColor":protected]=>
        NULL
        ["borderTopStyle":protected]=>
        NULL
        ["borderLeftSize":protected]=>
        NULL
        ["borderLeftColor":protected]=>
        NULL
        ["borderLeftStyle":protected]=>
        NULL
        ["borderRightSize":protected]=>
        NULL
        ["borderRightColor":protected]=>
        NULL
        ["borderRightStyle":protected]=>
        NULL
        ["borderBottomSize":protected]=>
        NULL
        ["borderBottomColor":protected]=>
        NULL
        ["borderBottomStyle":protected]=>
        NULL
        ["styleName":protected]=>
        NULL
        ["index":protected]=>
        NULL
        ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
        bool(false)
      }
      ["elements":protected]=>
      array(1) {
        [0]=>
        object(PhpOffice\PhpWord\Element\Text)#41 (18) {
          ["text":protected]=>
          string(134) "&quot;Learn from yesterday, live for today, hope for tomorrow. The important thing is not to stop questioning.&quot; (Albert Einstein)"
          ["fontStyle":protected]=>
          object(PhpOffice\PhpWord\Style\Font)#43 (28) {
            ["aliases":protected]=>
            array(1) {
              ["line-height"]=>
              string(10) "lineHeight"
            }
            ["type":"PhpOffice\PhpWord\Style\Font":private]=>
            string(4) "text"
            ["name":"PhpOffice\PhpWord\Style\Font":private]=>
            string(15) "Times New Roman"
            ["hint":"PhpOffice\PhpWord\Style\Font":private]=>
            NULL
            ["size":"PhpOffice\PhpWord\Style\Font":private]=>
            int(20)
            ["color":"PhpOffice\PhpWord\Style\Font":private]=>
            NULL
            ["bold":"PhpOffice\PhpWord\Style\Font":private]=>
            bool(false)
            ["italic":"PhpOffice\PhpWord\Style\Font":private]=>
            bool(false)
            ["underline":"PhpOffice\PhpWord\Style\Font":private]=>
            string(4) "none"
            ["superScript":"PhpOffice\PhpWord\Style\Font":private]=>
            bool(false)
            ["subScript":"PhpOffice\PhpWord\Style\Font":private]=>
            bool(false)
            ["strikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
            bool(false)
            ["doubleStrikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
            bool(false)
            ["smallCaps":"PhpOffice\PhpWord\Style\Font":private]=>
            bool(false)
            ["allCaps":"PhpOffice\PhpWord\Style\Font":private]=>
            bool(false)
            ["fgColor":"PhpOffice\PhpWord\Style\Font":private]=>
            NULL
            ["scale":"PhpOffice\PhpWord\Style\Font":private]=>
            NULL
            ["spacing":"PhpOffice\PhpWord\Style\Font":private]=>
            NULL
            ["kerning":"PhpOffice\PhpWord\Style\Font":private]=>
            NULL
            ["paragraph":"PhpOffice\PhpWord\Style\Font":private]=>
            object(PhpOffice\PhpWord\Style\Paragraph)#42 (34) {
              ["aliases":protected]=>
              array(1) {
                ["line-height"]=>
                string(10) "lineHeight"
              }
              ["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
              string(6) "Normal"
              ["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
              NULL
              ["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
              string(0) ""
              ["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
              NULL
              ["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
              NULL
              ["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
              NULL
              ["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
              bool(true)
              ["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
              bool(false)
              ["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
              bool(false)
              ["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
              bool(false)

The full output is too long to post but you can see the string I'm looking for in this snippet if you scroll down a ways

my primary question is "is there a way to find this string without using var_dump and searching through the massive output?"


Solution

  • Textual info is located in [text] properties, which in their turn are nested in [elements] properties. Just search for them in the object you get in your browser using the "find something in text" function of your browser, to see the text you are searching for.

    These two properties are protected, so you will have to make them public, in order to access/extract them.

    Where these properties are defined within the PHPWord library: https://stackoverflow.com/a/50989007/8510094

    Once you have made them public, you can start cutting off every layer of the object you have received and thus access the object where [elements]->[text] properties are just one layer down the 'tree'.

    So, the algorithm is to 1) find these [text] properties, 2) see the path to the object holding these properties, 3) cut off higher-level objects and arrays level by level, 4) get an object where [elements]->[text] properties are just the 2nd level, 5) gather all the values of [text] properties in, say, an array.

    Don't try to use foreach loops, recursive functions, etc. trying to access the text. The resulting object is enormous. You won't be given memory or time this big to be able to iterate over, flatten, reduce, etc. such big multidimensional associative arrays of data.

    Alternatively, you can make certain changes to the PHPWord library files and don't get unnecessary properties and values in the resulting object you get when you load your Word file into PHPWord (styles, paragraph info, etc.).

    In PHPSpreadsheet, they implemented a method to get only actual data from Excel files (stripped of formatting, styles info, etc). On the other hand, PHPWord also declared $readDataOnly property, but they stopped just there, and for some reason didn't implement the mechanism to read actual, textual data only.