I have a strangeform.pdf file, using following code to grab data from that file
using iText.Forms.Fields;
using iText.Forms;
using iText.Kernel.Pdf;
using System.Text;
PdfDocument pdfDoc = new PdfDocument(new PdfReader(@"strangeform.pdf"));
PdfAcroForm form = PdfFormCreator.GetAcroForm(pdfDoc, true);
StringBuilder stringBuilder = new StringBuilder();
var data = form.GetAllFormFields();
foreach (var field in data)
{
if (!string.IsNullOrEmpty(field.Value.GetValueAsString()))
stringBuilder.AppendLine($"{field.Key},{field.Value.GetValueAsString()}");
}
Console.WriteLine(stringBuilder.ToString());
Console.ReadKey();
got output result:
T36,楊宏章
T37,hello
T50-2,123楊宏章
T50-1,楊宏章123ab
The last line of above output T50-1,楊宏章123ab
(expect T50-1,12楊宏章123
) is not compatible with the appearance(see image below)
of strangleform.pdf.
What is going on? Is this an issue of itext or strangeform.pdf
?
PS: iText version is 8.0.3
As K J already indicated in a comment, common PDF processors that allow editing form fields also show "楊宏章123ab" as soon as you select the field in question. Thus, that value somehow also is associated with the field and iText is not alone in seeing it.
The background is that in PDFs form fields have an internal machine-readable value and any number of widgets (most often one) that show the value on a document page. These widgets can contain instructions (in a so called appearance content stream) how to draw the value or they can count on the viewer to create a visualization on the spot.
These two representations of the field value, the machine-readable one and the visible one, may get out-of-synch for different reasons, by error or even by design, and the result is that different PDF processors return different values for it when reading the document.
This is the case here, have a look at the internal structure:
In the top tree view you see the name T50-1
(T value) and the machine readable value 楊宏章123ab
of the field.
In the text view at the bottom you see the associated appearance stream. The argument of the text showing operator Tj therein, <00310032694a5b8f7ae0003100320033>
consists of the character codes 0031
, 0032
, 694a
, 5b8f
, 7ae0
, 0031
, 0032
, and 0033
. The defined encoding of the font object used is UniCNS-UTF16-H, so those hex character codes are UTF-16BE codes, in particular 0031
means '1', 0032
means '2', and 0033
means '3'.
Considering your question
What is going on? Is this an issue of itext or
strangeform.pdf
?
therefore, this is an issue of the PDF file in question while iText simply extracts the machine readable value.