jsonditacloud-document-ai

Has anyone used google document ai to convert a pdf into dita files?


I am playing with Google document ai but I am unsure what the possibilities are. Has any one created a model that can read a pdf and split in into appropriate dita topics? Or split into separate json files for each identified dita topic? Any tips or help is appreciated


Solution

  • Slight clarification for https://stackoverflow.com/a/76021683/6216983


    The general Document Splitter processor isn't recommended to be used for production use cases.

    It is recommended to use Custom Document Splitter (currently requires allowlisting) or the Procurement Splitter & Classifier or Lending Splitter & Classifier depending on the types of documents.

    Splitters identify page boundaries, but do not actually split the input document for you.

    You can use the Document AI Toolbox SDK to split the original PDF based on page boundaries identified.

    Document AI doesn't currently have built-in support for DITA topics. If you can provide more context for the use case, I can report this as a feature request to the product development team.