javascriptnode.jsgoogle-cloud-platformstreamgoogle-cloud-dlp

How to use GCP DLP with a file stream


I'm working with Node.js and GCP Data Loss Prevention to attempt to redact sensitive data from PDFs before I display them. GCP has great documentation on this here

Essentially you pull in the nodejs library and run this

const fileBytes = Buffer.from(fs.readFileSync(filepath)).toString('base64');

// Construct image redaction request
const request = {
  parent: `projects/${projectId}/locations/global`,
  byteItem: {
    type: fileTypeConstant,
    data: fileBytes,
  },
  inspectConfig: {
    minLikelihood: minLikelihood,
    infoTypes: infoTypes,
  },
  imageRedactionConfigs: imageRedactionConfigs,
};

// Run image redaction request
const [response] = await dlp.redactImage(request);
const image = response.redactedImage;

So normally, I'd get the file as a buffer, then pass it to the DLP function like the above. But, I'm no longer getting our files as buffers. Since many files are very large, we now get them from FilesStorage as streams, like so

return FilesStorage.getFileStream(metaFileInfo1, metaFileInfo2, metaFileInfo3, fileId)
      .then(stream => {
        return {fileInfo, stream};
      })

The question is, is it possible to perform DLP image redaction on a stream instead of a buffer? If so, how? I've found some other questions that say you can stream with ByteContentItem and GCPs own documentation mentions "streams". But, I've tried passing the returned stream from .getFileStream into the above byteItem['data'] property, and it doesn't work.


Solution

  • So chunking the stream up into buffers of appropriate size is going to work best here. There seem to be a number of approaches to build buffers from a stream you can use here.

    Potentially relevant: Convert stream into buffer?

    (A native stream interface is a good feature request, just not yet there.)