imagemagicktesseractghostscriptpgmleptonica

Ghostscript does not write resolution in header of pgm files


I like to convert many pdf to pgm picture files to prepare them for OCR with tesseract. This is the command I use:

gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/ebook -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {} as part of a paralleled job with GNU parallel.

However when tesseract reads these pgm files, it gives out warning such as:

Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 243, for example.

I assume that either leptonica or tesseract cannot read the header of the pgm files or that ghostscript does not write the correct resolution into their header despite of setting -r300 in the above command.

identify -verbose test.pgm gives out:

Image: test.pgm
  Format: PGM (Portable graymap format (gray scale))
  Mime type: image/x-portable-greymap
  Class: DirectClass
  Geometry: 3296x2320+0+0
  Units: Undefined
  Colorspace: Gray
  Type: Grayscale
  Endianess: Undefined
  Depth: 8-bit
  Channel depth:
    Gray: 8-bit
  Channel statistics:
    Pixels: 7646720
    Gray:
      min: 0  (0)
      max: 255 (1)
      mean: 240.457 (0.942969)
      standard deviation: 54.5946 (0.214096)
      kurtosis: 14.187
      skewness: -4.00412
      entropy: 0.180208
  Colors: 172
  Histogram:
    194600: (  0,  0,  0) #000000 gray(0)
     20412: (  2,  2,  2) #020202 gray(2)
     19483: (  4,  4,  4) #040404 gray(4)
     15476: (  6,  6,  6) #060606 gray(6)
     14901: (  8,  8,  8) #080808 gray(8)
     13859: ( 10, 10, 10) #0A0A0A gray(10)
     13083: ( 12, 12, 12) #0C0C0C gray(12)
     12134: ( 14, 14, 14) #0E0E0E gray(14)
      5848: ( 16, 16, 16) #101010 gray(16)
      5505: ( 17, 17, 17) #111111 gray(17)
      5186: ( 18, 18, 18) #121212 gray(18)
      5091: ( 19, 19, 19) #131313 gray(19)
      4624: ( 20, 20, 20) #141414 gray(20)
      4552: ( 21, 21, 21) #151515 gray(21)
      4329: ( 22, 22, 22) #161616 gray(22)
      4130: ( 23, 23, 23) #171717 gray(23)
      3910: ( 24, 24, 24) #181818 gray(24)
      3543: ( 25, 25, 25) #191919 gray(25)
      3473: ( 26, 26, 26) #1A1A1A gray(26)
      3113: ( 27, 27, 27) #1B1B1B gray(27)
      2928: ( 28, 28, 28) #1C1C1C gray(28)
      3020: ( 29, 29, 29) #1D1D1D gray(29)
      2718: ( 30, 30, 30) #1E1E1E gray(30)
      2398: ( 31, 31, 31) #1F1F1F gray(31)
      4602: ( 32, 32, 32) #202020 gray(32)
      3704: ( 34, 34, 34) #222222 gray(34)
      3347: ( 36, 36, 36) #242424 gray(36)
      2865: ( 38, 38, 38) #262626 gray(38)
      2357: ( 40, 40, 40) #282828 gray(40)
      1961: ( 42, 42, 42) #2A2A2A gray(42)
      1692: ( 44, 44, 44) #2C2C2C gray(44)
      1332: ( 46, 46, 46) #2E2E2E gray(46)
       642: ( 48, 48, 48) #303030 gray(48)
       545: ( 49, 49, 49) #313131 gray(49)
       482: ( 50, 50, 50) #323232 gray(50)
       416: ( 51, 51, 51) #333333 gray(51)
       435: ( 52, 52, 52) #343434 gray(52)
       475: ( 53, 53, 53) #353535 gray(53)
       313: ( 54, 54, 54) #363636 gray(54)
       291: ( 55, 55, 55) #373737 gray(55)
       293: ( 56, 56, 56) #383838 gray(56)
       254: ( 57, 57, 57) #393939 gray(57)
       265: ( 58, 58, 58) #3A3A3A gray(58)
       197: ( 59, 59, 59) #3B3B3B gray(59)
       220: ( 60, 60, 60) #3C3C3C gray(60)
       211: ( 61, 61, 61) #3D3D3D gray(61)
       168: ( 62, 62, 62) #3E3E3E gray(62)
       232: ( 63, 63, 63) #3F3F3F gray(63)
       295: ( 64, 64, 64) #404040 gray(64)
       363: ( 66, 66, 66) #424242 gray(66)
       411: ( 68, 68, 68) #444444 gray(68)
       237: ( 70, 70, 70) #464646 gray(70)
       153: ( 72, 72, 72) #484848 gray(72)
       140: ( 74, 74, 74) #4A4A4A gray(74)
       760: ( 76, 76, 76) #4C4C4C gray(76)
       260: ( 78, 78, 78) #4E4E4E gray(78)
        22: ( 80, 80, 80) #505050 gray(80)
        42: ( 81, 81, 81) #515151 gray(81)
        44: ( 82, 82, 82) #525252 gray(82)
        32: ( 83, 83, 83) #535353 gray(83)
        57: ( 84, 84, 84) #545454 gray(84)
        49: ( 85, 85, 85) #555555 gray(85)
        75: ( 86, 86, 86) #565656 gray(86)
        12: ( 87, 87, 87) #575757 gray(87)
        54: ( 88, 88, 88) #585858 gray(88)
        28: ( 89, 89, 89) #595959 gray(89)
       369: ( 90, 90, 90) #5A5A5A gray(90)
       124: ( 91, 91, 91) #5B5B5B gray(91)
        57: ( 92, 92, 92) #5C5C5C gray(92)
        14: ( 93, 93, 93) #5D5D5D gray(93)
         9: ( 94, 94, 94) #5E5E5E gray(94)
        16: ( 95, 95, 95) #5F5F5F gray(95)
        60: ( 96, 96, 96) #606060 gray(96)
         7: ( 98, 98, 98) #626262 gray(98)
        19: (100,100,100) #646464 gray(100)
        49: (102,102,102) #666666 gray(102)
        17: (104,104,104) #686868 gray(104)
         6: (106,106,106) #6A6A6A gray(106)
         8: (108,108,108) #6C6C6C gray(108)
        18: (110,110,110) #6E6E6E gray(110)
         3: (113,113,113) #717171 gray(113)
         8: (116,116,116) #747474 gray(116)
         1: (124,124,124) #7C7C7C gray(124)
         1: (132,132,132) #848484 gray(132)
         5: (138,138,138) #8A8A8A gray(138)
         3: (142,142,142) #8E8E8E gray(142)
         1: (145,145,145) #919191 gray(145)
         1: (146,146,146) #929292 gray(146)
         3: (147,147,147) #939393 gray(147)
         1: (148,148,148) #949494 gray(148)
         6: (149,149,149) #959595 gray(149)
         1: (150,150,150) #969696 gray(150)
        33: (151,151,151) #979797 gray(151)
         3: (152,152,152) #989898 gray(152)
         1: (154,154,154) #9A9A9A gray(154)
        20: (155,155,155) #9B9B9B gray(155)
        13: (156,156,156) #9C9C9C gray(156)
         3: (157,157,157) #9D9D9D gray(157)
        36: (158,158,158) #9E9E9E gray(158)
        19: (159,159,159) #9F9F9F gray(159)
        57: (160,160,160) #A0A0A0 gray(160)
        41: (162,162,162) #A2A2A2 gray(162)
        33: (164,164,164) #A4A4A4 gray(164)
        85: (166,166,166) #A6A6A6 gray(166)
       128: (168,168,168) #A8A8A8 gray(168)
        66: (170,170,170) #AAAAAA gray(170)
       160: (172,172,172) #ACACAC gray(172)
       136: (174,174,174) #AEAEAE gray(174)
       204: (176,176,176) #B0B0B0 gray(176)
        34: (177,177,177) #B1B1B1 gray(177)
        59: (178,178,178) #B2B2B2 gray(178)
       134: (179,179,179) #B3B3B3 gray(179)
       165: (180,180,180) #B4B4B4 gray(180)
       120: (181,181,181) #B5B5B5 gray(181)
       371: (182,182,182) #B6B6B6 gray(182)
       132: (183,183,183) #B7B7B7 gray(183)
       104: (184,184,184) #B8B8B8 gray(184)
        77: (185,185,185) #B9B9B9 gray(185)
        92: (186,186,186) #BABABA gray(186)
       423: (187,187,187) #BBBBBB gray(187)
       949: (188,188,188) #BCBCBC gray(188)
       237: (189,189,189) #BDBDBD gray(189)
       135: (190,190,190) #BEBEBE gray(190)
       253: (191,191,191) #BFBFBF gray(191)
       436: (192,192,192) #C0C0C0 gray(192)
       419: (194,194,194) #C2C2C2 gray(194)
       510: (196,196,196) #C4C4C4 gray(196)
       590: (198,198,198) #C6C6C6 gray(198)
       811: (200,200,200) #C8C8C8 gray(200)
       990: (202,202,202) #CACACA gray(202)
      1246: (204,204,204) #CCCCCC gray(204)
      1539: (206,206,206) #CECECE gray(206)
       982: (208,208,208) #D0D0D0 gray(208)
       928: (209,209,209) #D1D1D1 gray(209)
       977: (210,210,210) #D2D2D2 gray(210)
      1134: (211,211,211) #D3D3D3 gray(211)
      1275: (212,212,212) #D4D4D4 gray(212)
      1444: (213,213,213) #D5D5D5 gray(213)
      2085: (214,214,214) #D6D6D6 gray(214)
      1814: (215,215,215) #D7D7D7 gray(215)
      2432: (216,216,216) #D8D8D8 gray(216)
      2223: (217,217,217) #D9D9D9 gray(217)
      2416: (218,218,218) #DADADA gray(218)
      2834: (219,219,219) #DBDBDB gray(219)
      2602: (220,220,220) #DCDCDC gray(220)
      2849: (221,221,221) #DDDDDD gray(221)
      3317: (222,222,222) #DEDEDE gray(222)
      3842: (223,223,223) #DFDFDF gray(223)
      8282: (224,224,224) #E0E0E0 gray(224)
      9852: (226,226,226) #E2E2E2 gray(226)
     11147: (228,228,228) #E4E4E4 gray(228)
     13932: (230,230,230) #E6E6E6 gray(230)
     16318: (232,232,232) #E8E8E8 gray(232)
     18434: (234,234,234) #EAEAEA gray(234)
     20005: (236,236,236) #ECECEC gray(236)
     24433: (238,238,238) #EEEEEE gray(238)
     12657: (240,240,240) #F0F0F0 gray(240)
     12821: (241,241,241) #F1F1F1 gray(241)
     13388: (242,242,242) #F2F2F2 gray(242)
     14835: (243,243,243) #F3F3F3 gray(243)
     14060: (244,244,244) #F4F4F4 gray(244)
     15687: (245,245,245) #F5F5F5 gray(245)
     15796: (246,246,246) #F6F6F6 gray(246)
     17834: (247,247,247) #F7F7F7 gray(247)
     20159: (248,248,248) #F8F8F8 gray(248)
     22424: (249,249,249) #F9F9F9 gray(249)
     22594: (250,250,250) #FAFAFA gray(250)
     19611: (251,251,251) #FBFBFB gray(251)
     20156: (252,252,252) #FCFCFC gray(252)
     25883: (253,253,253) #FDFDFD gray(253)
   6419695: (254,254,254) #FEFEFE gray(254)
    413243: (255,255,255) #FFFFFF gray(255)
  Rendering intent: Undefined
  Gamma: 0.454545
  Matte color: grey74
  Background color: white
  Border color: srgb(223,223,223)
  Transparent color: none
  Interlace: None
  Intensity: Undefined
  Compose: Over
  Page geometry: 3296x2320+0+0
  Dispose: Undefined
  Iterations: 0
  Compression: Undefined
  Orientation: Undefined
  Properties:
    comment:  Image generated by GPL Ghostscript (device=pgmraw)

    date:create: 2019-01-30T09:01:26+01:00
    date:modify: 2019-01-30T08:53:11+01:00
    signature: 9cac1c1189c8b785b107c2cec0ab88b3931f2f38a9c462a8fc3fac95b370b89f
  Artifacts:
    verbose: true
  Tainted: False
  Filesize: 7646790B
  Number pixels: 7646720
  Pixels per second: 21.8478MP
  User time: 0.240u
  Elapsed time: 0:01.349
  Version: ImageMagick 7.0.8-24 Q16 x86_64 2019-01-18 https://imagemagick.org

As far as I can tell there is no resolution indicated. If this is the problem, how can I make gs write the dpi in the header?


Solution

  • The PGM format doesn't have a resolution.

    In fact the "resolution" of a bitmap file is somewhat moot anyway. If I take an image 300x300 pixels, and I draw it one inch square on a piece of paper, then the resolution of the image on the media is 300 dpi. If I take the same image, and draw it two inches square, then the resolution of the image on the media is 150 dpi.

    Same image, so what's the resolution ?

    In any event, since the PGM format doesn't allow you to specify the resolution, you cna't make Ghostscript write the resolution in the header.