pdfitextitext7

got strange result by using iText's PdfAcroForm.GetAllFormFields to grab data


I have a strangeform.pdf file, using following code to grab data from that file

using iText.Forms.Fields;
using iText.Forms;
using iText.Kernel.Pdf;
using System.Text;

PdfDocument pdfDoc = new PdfDocument(new PdfReader(@"strangeform.pdf"));
PdfAcroForm form = PdfFormCreator.GetAcroForm(pdfDoc, true);
StringBuilder stringBuilder = new StringBuilder();
var data = form.GetAllFormFields();
foreach (var field in data)
{
    if (!string.IsNullOrEmpty(field.Value.GetValueAsString()))
        stringBuilder.AppendLine($"{field.Key},{field.Value.GetValueAsString()}");
}
Console.WriteLine(stringBuilder.ToString());
Console.ReadKey();

got output result:

T36,楊宏章
T37,hello
T50-2,123楊宏章
T50-1,楊宏章123ab

The last line of above output T50-1,楊宏章123ab(expect T50-1,12楊宏章123) is not compatible with the appearance(see image below) of strangleform.pdf. enter image description here

What is going on? Is this an issue of itext or strangeform.pdf?

PS: iText version is 8.0.3


Solution

  • As K J already indicated in a comment, common PDF processors that allow editing form fields also show "楊宏章123ab" as soon as you select the field in question. Thus, that value somehow also is associated with the field and iText is not alone in seeing it.

    The background is that in PDFs form fields have an internal machine-readable value and any number of widgets (most often one) that show the value on a document page. These widgets can contain instructions (in a so called appearance content stream) how to draw the value or they can count on the viewer to create a visualization on the spot.

    These two representations of the field value, the machine-readable one and the visible one, may get out-of-synch for different reasons, by error or even by design, and the result is that different PDF processors return different values for it when reading the document.

    This is the case here, have a look at the internal structure:

    screen shot

    In the top tree view you see the name T50-1 (T value) and the machine readable value 楊宏章123ab of the field.

    In the text view at the bottom you see the associated appearance stream. The argument of the text showing operator Tj therein, <00310032694a5b8f7ae0003100320033> consists of the character codes 0031, 0032, 694a, 5b8f, 7ae0, 0031, 0032, and 0033. The defined encoding of the font object used is UniCNS-UTF16-H, so those hex character codes are UTF-16BE codes, in particular 0031 means '1', 0032 means '2', and 0033 means '3'.

    Considering your question

    What is going on? Is this an issue of itext or strangeform.pdf?

    therefore, this is an issue of the PDF file in question while iText simply extracts the machine readable value.