I like to convert many pdf to pgm picture files to prepare them for OCR with tesseract. This is the command I use:
gs -dQUIET -dINTERPOLATE -dSAFER -dBATCH -dNOPAUSE -dPDFSETTINGS=/ebook -dNumRenderingThreads=4 -sDEVICE=pgmraw -r300 -dTextAlphaBits=4 -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dOverrideICC -o {.}/{.}-%03d.pgm {}
as part of a paralleled job with GNU parallel.
However when tesseract reads these pgm files, it gives out warning such as:
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 243
, for example.
I assume that either leptonica
or tesseract
cannot read the header of the pgm files or that ghostscript
does not write the correct resolution into their header despite of setting -r300
in the above command.
identify -verbose test.pgm
gives out:
Image: test.pgm
Format: PGM (Portable graymap format (gray scale))
Mime type: image/x-portable-greymap
Class: DirectClass
Geometry: 3296x2320+0+0
Units: Undefined
Colorspace: Gray
Type: Grayscale
Endianess: Undefined
Depth: 8-bit
Channel depth:
Gray: 8-bit
Channel statistics:
Pixels: 7646720
Gray:
min: 0 (0)
max: 255 (1)
mean: 240.457 (0.942969)
standard deviation: 54.5946 (0.214096)
kurtosis: 14.187
skewness: -4.00412
entropy: 0.180208
Colors: 172
Histogram:
194600: ( 0, 0, 0) #000000 gray(0)
20412: ( 2, 2, 2) #020202 gray(2)
19483: ( 4, 4, 4) #040404 gray(4)
15476: ( 6, 6, 6) #060606 gray(6)
14901: ( 8, 8, 8) #080808 gray(8)
13859: ( 10, 10, 10) #0A0A0A gray(10)
13083: ( 12, 12, 12) #0C0C0C gray(12)
12134: ( 14, 14, 14) #0E0E0E gray(14)
5848: ( 16, 16, 16) #101010 gray(16)
5505: ( 17, 17, 17) #111111 gray(17)
5186: ( 18, 18, 18) #121212 gray(18)
5091: ( 19, 19, 19) #131313 gray(19)
4624: ( 20, 20, 20) #141414 gray(20)
4552: ( 21, 21, 21) #151515 gray(21)
4329: ( 22, 22, 22) #161616 gray(22)
4130: ( 23, 23, 23) #171717 gray(23)
3910: ( 24, 24, 24) #181818 gray(24)
3543: ( 25, 25, 25) #191919 gray(25)
3473: ( 26, 26, 26) #1A1A1A gray(26)
3113: ( 27, 27, 27) #1B1B1B gray(27)
2928: ( 28, 28, 28) #1C1C1C gray(28)
3020: ( 29, 29, 29) #1D1D1D gray(29)
2718: ( 30, 30, 30) #1E1E1E gray(30)
2398: ( 31, 31, 31) #1F1F1F gray(31)
4602: ( 32, 32, 32) #202020 gray(32)
3704: ( 34, 34, 34) #222222 gray(34)
3347: ( 36, 36, 36) #242424 gray(36)
2865: ( 38, 38, 38) #262626 gray(38)
2357: ( 40, 40, 40) #282828 gray(40)
1961: ( 42, 42, 42) #2A2A2A gray(42)
1692: ( 44, 44, 44) #2C2C2C gray(44)
1332: ( 46, 46, 46) #2E2E2E gray(46)
642: ( 48, 48, 48) #303030 gray(48)
545: ( 49, 49, 49) #313131 gray(49)
482: ( 50, 50, 50) #323232 gray(50)
416: ( 51, 51, 51) #333333 gray(51)
435: ( 52, 52, 52) #343434 gray(52)
475: ( 53, 53, 53) #353535 gray(53)
313: ( 54, 54, 54) #363636 gray(54)
291: ( 55, 55, 55) #373737 gray(55)
293: ( 56, 56, 56) #383838 gray(56)
254: ( 57, 57, 57) #393939 gray(57)
265: ( 58, 58, 58) #3A3A3A gray(58)
197: ( 59, 59, 59) #3B3B3B gray(59)
220: ( 60, 60, 60) #3C3C3C gray(60)
211: ( 61, 61, 61) #3D3D3D gray(61)
168: ( 62, 62, 62) #3E3E3E gray(62)
232: ( 63, 63, 63) #3F3F3F gray(63)
295: ( 64, 64, 64) #404040 gray(64)
363: ( 66, 66, 66) #424242 gray(66)
411: ( 68, 68, 68) #444444 gray(68)
237: ( 70, 70, 70) #464646 gray(70)
153: ( 72, 72, 72) #484848 gray(72)
140: ( 74, 74, 74) #4A4A4A gray(74)
760: ( 76, 76, 76) #4C4C4C gray(76)
260: ( 78, 78, 78) #4E4E4E gray(78)
22: ( 80, 80, 80) #505050 gray(80)
42: ( 81, 81, 81) #515151 gray(81)
44: ( 82, 82, 82) #525252 gray(82)
32: ( 83, 83, 83) #535353 gray(83)
57: ( 84, 84, 84) #545454 gray(84)
49: ( 85, 85, 85) #555555 gray(85)
75: ( 86, 86, 86) #565656 gray(86)
12: ( 87, 87, 87) #575757 gray(87)
54: ( 88, 88, 88) #585858 gray(88)
28: ( 89, 89, 89) #595959 gray(89)
369: ( 90, 90, 90) #5A5A5A gray(90)
124: ( 91, 91, 91) #5B5B5B gray(91)
57: ( 92, 92, 92) #5C5C5C gray(92)
14: ( 93, 93, 93) #5D5D5D gray(93)
9: ( 94, 94, 94) #5E5E5E gray(94)
16: ( 95, 95, 95) #5F5F5F gray(95)
60: ( 96, 96, 96) #606060 gray(96)
7: ( 98, 98, 98) #626262 gray(98)
19: (100,100,100) #646464 gray(100)
49: (102,102,102) #666666 gray(102)
17: (104,104,104) #686868 gray(104)
6: (106,106,106) #6A6A6A gray(106)
8: (108,108,108) #6C6C6C gray(108)
18: (110,110,110) #6E6E6E gray(110)
3: (113,113,113) #717171 gray(113)
8: (116,116,116) #747474 gray(116)
1: (124,124,124) #7C7C7C gray(124)
1: (132,132,132) #848484 gray(132)
5: (138,138,138) #8A8A8A gray(138)
3: (142,142,142) #8E8E8E gray(142)
1: (145,145,145) #919191 gray(145)
1: (146,146,146) #929292 gray(146)
3: (147,147,147) #939393 gray(147)
1: (148,148,148) #949494 gray(148)
6: (149,149,149) #959595 gray(149)
1: (150,150,150) #969696 gray(150)
33: (151,151,151) #979797 gray(151)
3: (152,152,152) #989898 gray(152)
1: (154,154,154) #9A9A9A gray(154)
20: (155,155,155) #9B9B9B gray(155)
13: (156,156,156) #9C9C9C gray(156)
3: (157,157,157) #9D9D9D gray(157)
36: (158,158,158) #9E9E9E gray(158)
19: (159,159,159) #9F9F9F gray(159)
57: (160,160,160) #A0A0A0 gray(160)
41: (162,162,162) #A2A2A2 gray(162)
33: (164,164,164) #A4A4A4 gray(164)
85: (166,166,166) #A6A6A6 gray(166)
128: (168,168,168) #A8A8A8 gray(168)
66: (170,170,170) #AAAAAA gray(170)
160: (172,172,172) #ACACAC gray(172)
136: (174,174,174) #AEAEAE gray(174)
204: (176,176,176) #B0B0B0 gray(176)
34: (177,177,177) #B1B1B1 gray(177)
59: (178,178,178) #B2B2B2 gray(178)
134: (179,179,179) #B3B3B3 gray(179)
165: (180,180,180) #B4B4B4 gray(180)
120: (181,181,181) #B5B5B5 gray(181)
371: (182,182,182) #B6B6B6 gray(182)
132: (183,183,183) #B7B7B7 gray(183)
104: (184,184,184) #B8B8B8 gray(184)
77: (185,185,185) #B9B9B9 gray(185)
92: (186,186,186) #BABABA gray(186)
423: (187,187,187) #BBBBBB gray(187)
949: (188,188,188) #BCBCBC gray(188)
237: (189,189,189) #BDBDBD gray(189)
135: (190,190,190) #BEBEBE gray(190)
253: (191,191,191) #BFBFBF gray(191)
436: (192,192,192) #C0C0C0 gray(192)
419: (194,194,194) #C2C2C2 gray(194)
510: (196,196,196) #C4C4C4 gray(196)
590: (198,198,198) #C6C6C6 gray(198)
811: (200,200,200) #C8C8C8 gray(200)
990: (202,202,202) #CACACA gray(202)
1246: (204,204,204) #CCCCCC gray(204)
1539: (206,206,206) #CECECE gray(206)
982: (208,208,208) #D0D0D0 gray(208)
928: (209,209,209) #D1D1D1 gray(209)
977: (210,210,210) #D2D2D2 gray(210)
1134: (211,211,211) #D3D3D3 gray(211)
1275: (212,212,212) #D4D4D4 gray(212)
1444: (213,213,213) #D5D5D5 gray(213)
2085: (214,214,214) #D6D6D6 gray(214)
1814: (215,215,215) #D7D7D7 gray(215)
2432: (216,216,216) #D8D8D8 gray(216)
2223: (217,217,217) #D9D9D9 gray(217)
2416: (218,218,218) #DADADA gray(218)
2834: (219,219,219) #DBDBDB gray(219)
2602: (220,220,220) #DCDCDC gray(220)
2849: (221,221,221) #DDDDDD gray(221)
3317: (222,222,222) #DEDEDE gray(222)
3842: (223,223,223) #DFDFDF gray(223)
8282: (224,224,224) #E0E0E0 gray(224)
9852: (226,226,226) #E2E2E2 gray(226)
11147: (228,228,228) #E4E4E4 gray(228)
13932: (230,230,230) #E6E6E6 gray(230)
16318: (232,232,232) #E8E8E8 gray(232)
18434: (234,234,234) #EAEAEA gray(234)
20005: (236,236,236) #ECECEC gray(236)
24433: (238,238,238) #EEEEEE gray(238)
12657: (240,240,240) #F0F0F0 gray(240)
12821: (241,241,241) #F1F1F1 gray(241)
13388: (242,242,242) #F2F2F2 gray(242)
14835: (243,243,243) #F3F3F3 gray(243)
14060: (244,244,244) #F4F4F4 gray(244)
15687: (245,245,245) #F5F5F5 gray(245)
15796: (246,246,246) #F6F6F6 gray(246)
17834: (247,247,247) #F7F7F7 gray(247)
20159: (248,248,248) #F8F8F8 gray(248)
22424: (249,249,249) #F9F9F9 gray(249)
22594: (250,250,250) #FAFAFA gray(250)
19611: (251,251,251) #FBFBFB gray(251)
20156: (252,252,252) #FCFCFC gray(252)
25883: (253,253,253) #FDFDFD gray(253)
6419695: (254,254,254) #FEFEFE gray(254)
413243: (255,255,255) #FFFFFF gray(255)
Rendering intent: Undefined
Gamma: 0.454545
Matte color: grey74
Background color: white
Border color: srgb(223,223,223)
Transparent color: none
Interlace: None
Intensity: Undefined
Compose: Over
Page geometry: 3296x2320+0+0
Dispose: Undefined
Iterations: 0
Compression: Undefined
Orientation: Undefined
Properties:
comment: Image generated by GPL Ghostscript (device=pgmraw)
date:create: 2019-01-30T09:01:26+01:00
date:modify: 2019-01-30T08:53:11+01:00
signature: 9cac1c1189c8b785b107c2cec0ab88b3931f2f38a9c462a8fc3fac95b370b89f
Artifacts:
verbose: true
Tainted: False
Filesize: 7646790B
Number pixels: 7646720
Pixels per second: 21.8478MP
User time: 0.240u
Elapsed time: 0:01.349
Version: ImageMagick 7.0.8-24 Q16 x86_64 2019-01-18 https://imagemagick.org
As far as I can tell there is no resolution indicated. If this is the problem, how can I make gs
write the dpi in the header?
The PGM format doesn't have a resolution.
In fact the "resolution" of a bitmap file is somewhat moot anyway. If I take an image 300x300 pixels, and I draw it one inch square on a piece of paper, then the resolution of the image on the media is 300 dpi. If I take the same image, and draw it two inches square, then the resolution of the image on the media is 150 dpi.
Same image, so what's the resolution ?
In any event, since the PGM format doesn't allow you to specify the resolution, you cna't make Ghostscript write the resolution in the header.