I use tika-core and tika-parsers-standard-package (v 2.9.0)
I want to parse pdf file. When I process my pdf file, I see that tika has correctly identified the type (application/pdf).
And I expect that this type will be processed by PDFParser. But when I see the processing in EmptyParser.
Next, I decided to check whether Tika created my PdfParser, and whether it was inside it.
I go into the CompositeParser class and watch how types and parsers are added (in the MediaTypeRegistry registry and List<Parser> parsers variables)
And I see that my parser PDFParser is not added to this variable.
So if Tika processes a pdf file, it will not be able to find the PDFParser and will not be able to process the file.
Next, I decided to check what types and parsers were added. And I saw that this parser was really not added.
org.apache.tika.parser.microsoft.chm.ChmParser@3ef5992e = application/vnd.ms-htmlhelp
org.apache.tika.parser.mail.RFC822Parser@681e1dee = message/rfc822
org.apache.tika.parser.microsoft.OfficeParser@1e7ad0d1 = application/vnd.visio
org.apache.tika.parser.feed.FeedParser@6d16750 = application/atom+xml
org.apache.tika.parser.image.ImageParser@14ffa24d = image/x-xcf
org.apache.tika.parser.microsoft.WMFParser@c962c51 = image/wmf
org.apache.tika.parser.audio.MidiParser@44a2894a = audio/midi
org.apache.tika.parser.mat.MatParser@49449d69 = application/x-matlab-data
org.apache.tika.parser.external.CompositeExternalParser@20c0a3c6 = video/x-msvideo
org.apache.tika.parser.pkg.CompressorParser@1d89e72b = application/deflate64
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-powerpoint.slideshow.macroenabled.12
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.openxmlformats-officedocument.presentationml.slide
org.apache.tika.parser.iwork.IWorkPackageParser@20f3b603 = application/vnd.apple.keynote
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.oasis.opendocument.spreadsheet-template
org.apache.tika.parser.mp4.MP4Parser@576ae06d = audio/mp4
org.apache.tika.parser.xliff.XLIFF12Parser@14307e14 = application/x-xliff+xml
org.apache.tika.parser.wordperfect.QuattroProParser@29ce813f = application/x-quattro-pro; version=9
org.apache.tika.parser.epub.EpubParser@6b7e49f1 = application/x-ibooks+zip
org.apache.tika.parser.apple.PListParser@6b97b57b = application/x-plist
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-word.document.macroenabled.12
org.apache.tika.parser.iwork.iwana.IWork13PackageParser@7c40a6c3 = application/vnd.apple.unknown.13
org.apache.tika.parser.mp4.MP4Parser@576ae06d = application/mp4
org.apache.tika.parser.audio.MidiParser@44a2894a = application/x-midi
org.apache.tika.parser.feed.FeedParser@6d16750 = application/rss+xml
org.apache.tika.parser.html.HtmlParser@21da170d = text/html
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-visio.template
org.apache.tika.parser.csv.TextAndCSVParser@38f2e306 = text/csv
org.apache.tika.parser.image.ImageParser@14ffa24d = image/vnd.microsoft.icon
org.apache.tika.parser.mp4.MP4Parser@576ae06d = video/quicktime
org.apache.tika.parser.mp4.MP4Parser@576ae06d = video/mp4
org.apache.tika.parser.pkg.CompressorParser@1d89e72b = application/zlib
org.apache.tika.parser.wacz.WACZParser@6e18b50e = application/x-wacz
org.apache.tika.parser.iwork.IWorkPackageParser@20f3b603 = application/vnd.apple.numbers
org.apache.tika.parser.pkg.PackageParser@171760b0 = application/x-archive
org.apache.tika.parser.microsoft.ooxml.xwpf.ml2006.Word2006MLParser@66846f20 = application/vnd.ms-word2006ml
org.apache.tika.parser.microsoft.OfficeParser@1e7ad0d1 = application/x-tika-msoffice
org.apache.tika.parser.font.AdobeFontMetricParser@49c15a25 = application/x-font-adobe-metric
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-visio.drawing
org.apache.tika.parser.mp4.MP4Parser@576ae06d = video/x-m4v
org.apache.tika.parser.pkg.PackageParser@171760b0 = application/java-archive
org.apache.tika.parser.microsoft.OfficeParser@1e7ad0d1 = application/sldworks
org.apache.tika.parser.http.HttpParser@625a36d6 = application/x-httpresponse
org.apache.tika.parser.microsoft.OfficeParser@1e7ad0d1 = application/x-tika-ooxml-protected
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-excel.sheet.macroenabled.12
org.apache.tika.parser.image.HeifParser@4e64d8cd = image/heif
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-word.template.macroenabled.12
org.apache.tika.parser.microsoft.OfficeParser@1e7ad0d1 = application/vnd.ms-outlook
org.apache.tika.parser.image.HeifParser@4e64d8cd = image/heic
org.apache.tika.parser.pkg.CompressorParser@1d89e72b = application/x-compress
org.apache.tika.parser.dwg.DWGParser@bdc534e = image/vnd.dwg
org.apache.tika.parser.code.SourceCodeParser@480f2062 = text/x-groovy
org.apache.tika.parser.mp4.MP4Parser@576ae06d = video/3gpp
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.openxmlformats-officedocument.spreadsheetml.template
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = model/vnd.dwfx+xps
org.apache.tika.parser.external.CompositeExternalParser@20c0a3c6 = video/mpeg
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.oasis.opendocument.chart-template
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
org.apache.tika.parser.pkg.PackageParser@171760b0 = application/zip
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.oasis.opendocument.text-master
org.apache.tika.parser.odf.FlatOpenDocumentParser@7bd04333 = application/vnd.oasis.opendocument.tika.flat.document
org.apache.tika.parser.image.PSDParser@139a5e34 = image/vnd.adobe.photoshop
org.apache.tika.parser.image.ImageParser@14ffa24d = image/gif
org.apache.tika.parser.executable.ExecutableParser@7ab5292a = application/x-sharedlib
org.apache.tika.parser.asm.ClassParser@4b880a2e = application/java-vm
org.apache.tika.parser.image.WebPParser@32b801b7 = image/webp
org.apache.tika.parser.microsoft.activemime.ActiveMimeParser@7392922a = application/x-activemime
org.apache.tika.parser.indesign.IDMLParser@203d190b = application/vnd.adobe.indesign-idml-package
org.apache.tika.parser.html.HtmlParser@21da170d = application/vnd.wap.xhtml+xml
org.gagravarr.tika.OggParser@2b6fbe58 = video/ogg
org.apache.tika.parser.apple.AppleSingleFileParser@5928a73b = application/applefile
org.apache.tika.parser.audio.AudioParser@3ecc6df6 = audio/x-aiff
org.apache.tika.parser.microsoft.xml.SpreadsheetMLParser@76f1ce65 = application/vnd.ms-spreadsheetml
org.apache.tika.parser.microsoft.OfficeParser@1e7ad0d1 = application/msword
org.apache.tika.parser.iwork.iwana.IWork13PackageParser@7c40a6c3 = application/vnd.apple.numbers.13
org.apache.tika.parser.apple.PListParser@6b97b57b = application/x-bplist-memgraph
org.apache.tika.parser.pkg.CompressorParser@1d89e72b = application/x-java-pack200
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.oasis.opendocument.image-template
org.apache.tika.parser.warc.WARCParser@20ba239d = application/warc
org.apache.tika.parser.microsoft.rtf.RTFParser@3cb759ce = application/rtf
org.apache.tika.parser.image.BPGParser@61c93073 = image/bpg
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.oasis.opendocument.text
org.apache.tika.parser.microsoft.TNEFParser@19b358c9 = application/x-tnef
org.apache.tika.parser.pkg.CompressorParser@1d89e72b = application/x-xz
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-powerpoint.template.macroenabled.12
org.apache.tika.parser.image.ImageParser@14ffa24d = image/vnd.wap.wbmp
org.apache.tika.parser.crypto.Pkcs7Parser@686426a9 = application/pkcs7-mime
org.apache.tika.parser.executable.ExecutableParser@7ab5292a = application/x-executable
org.apache.tika.parser.executable.ExecutableParser@7ab5292a = application/x-coredump
org.apache.tika.parser.microsoft.JackcessParser@6e1b6b42 = application/x-msaccess
org.apache.tika.parser.iwork.iwana.IWork18PackageParser@4679945a = application/vnd.apple.numbers.18
org.apache.tika.parser.csv.TextAndCSVParser@38f2e306 = text/plain
org.apache.tika.parser.image.ImageParser@14ffa24d = image/png
org.apache.tika.parser.microsoft.pst.OutlookPSTParser@450fe94c = application/vnd.ms-outlook-pst
org.apache.tika.parser.pkg.PackageParser@171760b0 = application/x-cpio
org.apache.tika.parser.microsoft.OfficeParser@1e7ad0d1 = application/x-tika-msworks-spreadsheet
org.apache.tika.parser.iwork.IWorkPackageParser@20f3b603 = application/vnd.apple.pages
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-xpsdocument
org.gagravarr.tika.VorbisParser@c97ce2a = audio/ogg
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-visio.template.macroenabled.12
org.apache.tika.parser.pkg.PackageParser@171760b0 = application/x-tar
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.oasis.opendocument.presentation-template
org.apache.tika.parser.apple.PListParser@6b97b57b = application/x-bplist
org.apache.tika.parser.image.ImageParser@14ffa24d = image/x-jbig2
org.apache.tika.parser.dbf.DBFParser@39e3f581 = application/x-dbf
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-excel.template.macroenabled.12
org.apache.tika.parser.mbox.MboxParser@33326da3 = application/mbox
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.oasis.opendocument.formula
org.apache.tika.parser.microsoft.chm.ChmParser@3ef5992e = application/chm
org.apache.tika.parser.microsoft.OldExcelParser@9cda700 = application/vnd.ms-excel.workspace.3
org.apache.tika.parser.microsoft.OldExcelParser@9cda700 = application/vnd.ms-excel.workspace.4
org.apache.tika.parser.image.BPGParser@61c93073 = image/x-bpg
org.apache.tika.parser.xliff.XLZParser@3b99ce8d = application/x-xliff+zip
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.openxmlformats-officedocument.wordprocessingml.template
org.apache.tika.parser.iwork.IWorkPackageParser@20f3b603 = application/vnd.apple.iwork
org.apache.tika.parser.image.HeifParser@4e64d8cd = image/heic-sequence
org.apache.tika.parser.microsoft.OldExcelParser@9cda700 = application/vnd.ms-excel.sheet.2
org.apache.tika.parser.microsoft.OldExcelParser@9cda700 = application/vnd.ms-excel.sheet.3
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-powerpoint.presentation.macroenabled.12
org.apache.tika.parser.pkg.CompressorParser@1d89e72b = application/x-brotli
org.apache.tika.parser.dif.DIFParser@1df07b82 = application/dif+xml
org.apache.tika.parser.microsoft.OfficeParser@1e7ad0d1 = application/vnd.ms-excel
org.apache.tika.parser.microsoft.OldExcelParser@9cda700 = application/vnd.ms-excel.sheet.4
org.apache.tika.parser.microsoft.OfficeParser@1e7ad0d1 = application/x-tika-ole-drm-encrypted
org.apache.tika.parser.microsoft.OfficeParser@1e7ad0d1 = application/vnd.ms-project
org.apache.tika.parser.dgn.DGN8Parser@3d0f52ea = image/vnd.dgn; version=8
org.apache.tika.parser.epub.EpubParser@6b7e49f1 = application/epub+zip
org.apache.tika.parser.sas.SAS7BDATParser@591e17ec = application/x-sas-data
org.apache.tika.parser.pkg.CompressorParser@1d89e72b = application/x-snappy
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.oasis.opendocument.text-template
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.openxmlformats-officedocument.presentationml.presentation
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-visio.stencil
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-visio.stencil.macroenabled.12
org.apache.tika.parser.apple.PListParser@6b97b57b = application/x-bplist-webarchive
org.apache.tika.parser.xml.DcXMLParser@2c0657e4 = application/xml
org.apache.tika.parser.odf.FlatOpenDocumentParser@7bd04333 = application/vnd.oasis.opendocument.flat.presentation
org.apache.tika.parser.image.ImageParser@14ffa24d = image/bmp
org.apache.tika.parser.wordperfect.WordPerfectParser@32b4058d = application/vnd.wordperfect; version=6.x
org.apache.tika.parser.html.HtmlParser@21da170d = application/xhtml+xml
org.apache.tika.parser.crypto.Pkcs7Parser@686426a9 = application/pkcs7-signature
org.apache.tika.parser.wordperfect.WordPerfectParser@32b4058d = application/vnd.wordperfect; version=5.1
org.apache.tika.parser.wordperfect.WordPerfectParser@32b4058d = application/vnd.wordperfect; version=5.0
org.apache.tika.parser.code.SourceCodeParser@480f2062 = text/x-java-source
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.sun.xml.writer
org.apache.tika.parser.audio.AudioParser@3ecc6df6 = audio/basic
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.oasis.opendocument.formula-template
org.apache.tika.parser.tmx.TMXParser@75eb9933 = application/x-tmx
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-powerpoint.addin.macroenabled.12
org.apache.tika.parser.microsoft.OfficeParser@1e7ad0d1 = application/vnd.ms-powerpoint
org.apache.tika.parser.crypto.TSDParser@5ce35115 = application/timestamped-data
org.apache.tika.parser.code.SourceCodeParser@480f2062 = text/x-c++src
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.openxmlformats-officedocument.presentationml.template
org.apache.tika.parser.apple.PListParser@6b97b57b = application/x-bplist-itunes
org.apache.tika.parser.image.TiffParser@42c4941a = image/tiff
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-excel.addin.macroenabled.12
org.apache.tika.parser.microsoft.xml.WordMLParser@50ce79a2 = application/vnd.ms-wordml
org.apache.tika.parser.executable.ExecutableParser@7ab5292a = application/x-object
org.apache.tika.parser.html.HtmlParser@21da170d = application/x-asp
org.apache.tika.parser.microsoft.OfficeParser@1e7ad0d1 = application/x-mspublisher
org.apache.tika.parser.hwp.HwpV5Parser@79455089 = application/x-hwp-v5
org.apache.tika.parser.pkg.RarParser@3f68dd4d = application/x-rar-compressed
org.apache.tika.parser.image.HeifParser@4e64d8cd = image/heif-sequence
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.oasis.opendocument.graphics-template
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.openxmlformats-officedocument.wordprocessingml.document
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-powerpoint.slide.macroenabled.12
org.apache.tika.parser.microsoft.TNEFParser@19b358c9 = application/vnd.ms-tnef
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.oasis.opendocument.text-web
org.apache.tika.parser.mif.MIFParser@59eaeafe = application/x-maker
org.apache.tika.parser.microsoft.EMFParser@7aca8b08 = image/emf
org.apache.tika.parser.pkg.CompressorParser@1d89e72b = application/x-bzip
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.oasis.opendocument.graphics
org.apache.tika.parser.iptc.IptcAnpaParser@29d85dc0 = text/vnd.iptc.anpa
org.apache.tika.parser.iwork.iwana.IWork18PackageParser@4679945a = application/vnd.apple.keynote.18
org.apache.tika.parser.microsoft.OfficeParser@1e7ad0d1 = application/x-tika-msoffice-embedded; format=ole10_native
org.apache.tika.parser.pkg.PackageParser@171760b0 = application/x-arj
org.apache.tika.parser.pkg.CompressorParser@1d89e72b = application/x-lzma
org.apache.tika.parser.mp4.MP4Parser@576ae06d = video/3gpp2
org.apache.tika.parser.mp3.Mp3Parser@50a37f03 = audio/mpeg
org.apache.tika.parser.iwork.iwana.IWork13PackageParser@7c40a6c3 = application/vnd.apple.keynote.13
org.apache.tika.parser.pkg.CompressorParser@1d89e72b = application/x-lz4
org.apache.tika.parser.odf.FlatOpenDocumentParser@7bd04333 = application/vnd.oasis.opendocument.flat.spreadsheet
org.apache.tika.parser.audio.AudioParser@3ecc6df6 = audio/vnd.wave
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.oasis.opendocument.presentation
org.apache.tika.parser.mif.MIFParser@59eaeafe = application/vnd.mif
org.apache.tika.parser.pkg.PackageParser@171760b0 = application/x-7z-compressed
org.apache.tika.parser.image.JXLParser@5e10d2f3 = image/jxl
org.apache.tika.parser.executable.ExecutableParser@7ab5292a = application/x-msdownload
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.oasis.opendocument.chart
org.apache.tika.parser.image.JpegParser@70ac80f2 = image/jpeg
org.apache.tika.parser.image.ICNSParser@1bf1e57a = image/icns
org.gagravarr.tika.VorbisParser@c97ce2a = audio/vorbis
org.gagravarr.tika.OggParser@2b6fbe58 = application/ogg
org.apache.tika.parser.xml.DcXMLParser@2c0657e4 = image/svg+xml
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-excel.sheet.binary.macroenabled.12
org.apache.tika.parser.warc.WARCParser@20ba239d = application/warc+gz
org.apache.tika.parser.microsoft.onenote.OneNoteParser@50e1da60 = application/onenote; format=one
org.apache.tika.parser.video.FLVParser@8be1dc5 = video/x-flv
org.apache.tika.parser.microsoft.MSOwnerFileParser@1c01b730 = application/x-ms-owner
org.apache.tika.parser.pkg.CompressorParser@1d89e72b = application/gzip
org.apache.tika.parser.pkg.PackageParser@171760b0 = application/x-tika-unix-dump
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.oasis.opendocument.spreadsheet
org.apache.tika.parser.iwork.iwana.IWork18PackageParser@4679945a = application/vnd.apple.pages.18
org.apache.tika.parser.odf.OpenDocumentParser@588ff48 = application/vnd.oasis.opendocument.image
org.apache.tika.parser.pkg.CompressorParser@1d89e72b = application/x-bzip2
org.apache.tika.parser.iwork.iwana.IWork13PackageParser@7c40a6c3 = application/vnd.apple.pages.13
org.apache.tika.parser.xml.FictionBookParser@84f4bff = application/x-fictionbook+xml
org.apache.tika.parser.odf.FlatOpenDocumentParser@7bd04333 = application/vnd.oasis.opendocument.flat.text
org.apache.tika.parser.executable.ExecutableParser@7ab5292a = application/x-elf
org.apache.tika.parser.csv.TextAndCSVParser@38f2e306 = text/tsv
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.ms-visio.drawing.macroenabled.12
org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4db1d509 = application/vnd.openxmlformats-officedocument.presentationml.slideshow
org.apache.tika.parser.microsoft.chm.ChmParser@3ef5992e = application/x-chm
org.apache.tika.parser.font.TrueTypeParser@5623a24c = application/x-font-ttf
org.apache.tika.parser.prt.PRTParser@2e0bcd78 = application/x-prt
tika 2.9.0
If you look at this data, it does not contain the required PDFParser. And it should be the default, because after the build I see a jar file inside my lib directory. (tika-parser-pdf-module-2.9.0.jar)
But if you try tika 1.25, you will see a similar list, only there will already be a PDFParser inside.
My pom.xml file with Tika 2.9.0:
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-bom</artifactId>
<version>2.9.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<type>jar</type>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
</dependency>
...
<dependencies>
Tell me how I can add or activate PDFParser, which should work out of the box by default.
Thank you in advance.
I didn't notice how I inserted a dependency with pdfbox (2.0.21) into my pom.xml. And it was in this version that some package was missing.
Therefore, an error occurred if you manually added this library.
When I saw that tika (2.9.0) was already using pdfbox with version 2.0.29, I removed my dependency.
So that as a result, I don’t load the old version 2.0.21, but the new 2.0.29.
After that I saw that tika was able to create a PDFParser.
Therefore, you need to check whether there was an overlap of your dependencies instead of tika dependencies.