postgresqlperlfilehandleperl-io

How to subclass IO::Handle to properly get a low level file handle without having a file or memory?


I have an app which accesses a PostgreSQL database and needs to read some large binary data out of it depending on some needed processing. This might be hundreds of MB or even some GB of data. Please no discussion about using file systems instead or such, it's the way it is now.

That data is simply files of various types, e.g. it might be a Zip container or some other kind of archive. Some of the needed processing is list the contents of the Zip, maybe even extract some members for further processing, maybe hash the stored data... In the end the data is read multiple times, but written only once to store it.

All of the Perl libs I use are able to work with file handles, some with IO::Handle, others with IO::String or IO::Scalar, some others only with low level file handles. So what I've done is create a subclass of IO::Handle and IO::Seekable which acts like a wrapper for the corresponding methods around DBD::Pg. In the CTOR I create a connection to the database, open some provided LOID for reading and store the handle provided by Postgres in the instance. My own handle object is then forwarded to whoever is able to work with such a file handle and can directly read and seek within the blob provided by Postgres.

The problem is libs which use low level file handles or low level file handle operations on IO::Handle. Digest::MD5 seems to be one, Archive::Zip another one. Digest::MD5 croaks and tells me that no handle has been provided, Archive::Zip on the other hand tries to create a new, own handle from mine, calls IO::Handle::fdopen and fails in my case.

sub fdopen {
    @_ == 3 or croak 'usage: $io->fdopen(FD, MODE)';
    my ($io, $fd, $mode) = @_;
    local(*GLOB);

    if (ref($fd) && "".$fd =~ /GLOB\(/o) {
    # It's a glob reference; Alias it as we cannot get name of anon GLOBs
    my $n = qualify(*GLOB);
    *GLOB = *{*$fd};
    $fd =  $n;
    } elsif ($fd =~ m#^\d+$#) {
    # It's an FD number; prefix with "=".
    $fd = "=$fd";
    }

    open($io, _open_mode_string($mode) . '&' . $fd)
    ? $io : undef;
}

I guess the problem is the low level copy of the handle, which removes my own instance, so there's no instance anymore having my database connection and all that stuff.

So, is it even possible in my case to provide some IO::Handle which successfully can be used wherever a low level file handle is expected?

I mean, I don't have a real file handle, I have an object only where method calls are wrapped to their corresponding Postgres methods, for which a database handle is needed and such. All of that data needs to be stored somewhere, the wrapping needs to be done etc.

I tried to do what others are doing, like IO::String, which additionally uses tie for example. But in the end that use case is different, because Perl is able to create a real low level file handle to some internal memory on its own. Something which is not supported at all in my case. I need to keep my instance around, because only that knows of the handle to the database etc.

Using my handle like an IO::Handle by calling method read and such works like expected, but I would like to take it a bit further and be more compatible to whoever doesn't expect to work on IO::Handle objects. Much like IO::String or File::Temp can be used as low level file handles.

package ReadingHandle;

use strict;
use warnings;
use 5.10.1;

use base 'IO::Handle', 'IO::Seekable';

use Carp ();

sub new
{
  my $invocant  = shift || Carp::croak('No invocant given.');
  my $db        = shift || Carp::croak('No database connection given.');
  my $loid      = shift // Carp::croak('No LOID given.');
  my $dbHandle  = $db->_getHandle();
  my $self      = $invocant->SUPER::new();

    *$self->{'dbHandle'}  = $dbHandle;
    *$self->{'loid'}      = $loid;
  my $loidFd              = $dbHandle->pg_lo_open($loid, $dbHandle->{pg_INV_READ});
    *$self->{'loidFd'}    = $loidFd;

  if (!defined($loidFd))
  {
    Carp::croak("The provided LOID couldn't be opened.");
  }

  return $self;
}

sub DESTROY
{
  my $self = shift || Carp::croak('The method needs to be called with an instance.');

  $self->close();
}

sub _getDbHandle
{
  my $self = shift || Carp::croak('The method needs to be called with an instance.');

  return *$self->{'dbHandle'};
}

sub _getLoid
{
  my $self = shift || Carp::croak('The method needs to be called with an instance.');

  return *$self->{'loid'};
}

sub _getLoidFd
{
  my $self = shift || Carp::croak('The method needs to be called with an instance.');

  return *$self->{'loidFd'};
}

sub binmode
{
  my $self = shift || Carp::croak('The method needs to be called with an instance.');

  return 1;
}

sub close
{
  my $self      = shift || Carp::croak('The method needs to be called with an instance.');
  my $dbHandle  = $self->_getDbHandle();
  my $loidFd    = $self->_getLoidFd();

  return $dbHandle->pg_lo_close($loidFd);
}

sub opened
{
  my $self    = shift || Carp::croak('The method needs to be called with an instance.');
  my $loidFd  = $self->_getLoidFd();

  return defined($loidFd) ? 1 : 0;
}

sub read
{
  my $self    = shift || Carp::croak('The method needs to be called with an instance.');
  my $buffer  =\shift // Carp::croak('No buffer given.');
  my $length  = shift // Carp::croak('No amount of bytes to read given.');
  my $offset  = shift || 0;

  if ($offset > 0)
  {
    Carp::croak('Using an offset is not supported.');
  }

  my $dbHandle  = $self->_getDbHandle();
  my $loidFd    = $self->_getLoidFd();

  return $dbHandle->pg_lo_read($loidFd, $buffer, $length);
}

sub seek
{
  my $self    = shift || Carp::croak('The method needs to be called with an instance.');
  my $offset  = shift // Carp::croak('No offset given.');
  my $whence  = shift // Carp::croak('No whence given.');

  if ($offset < 0)
  {
    Carp::croak('Using a negative offset is not supported.');
  }
  if ($whence != 0)
  {
    Carp::croak('Using a whence other than 0 is not supported.');
  }

  my $dbHandle  = $self->_getDbHandle();
  my $loidFd    = $self->_getLoidFd();
  my $retVal    = $dbHandle->pg_lo_lseek($loidFd, $offset, $whence);
     $retVal    = defined($retVal) ? 1 : 0;

  return $retVal;
}

sub tell
{
  my $self      = shift || Carp::croak('The method needs to be called with an instance.');
  my $dbHandle  = $self->_getDbHandle();
  my $loidFd    = $self->_getLoidFd();
  my $retVal    = $dbHandle->pg_lo_lseek($loidFd);
     $retVal    = defined($retVal) ? $retVal : -1;

  return $retVal;
}

1;

Solution

  • There is a way around this, but it is a bit weird. Your requirements are basically threefold, if I'm reading your code and comments correctly:

    1. Work like a normal file handle/IO::Handle object as much as possible, make the fact that it's not a real file invisible to the user.
    2. Work with Archive::Zip, which is implemented mostly in regular Perl, and which calls the IO::Handle::fdopen code you posted, which fails to duplicate the handle since it's not a real handle.
    3. Work with Digest::MD5, which is implemented in XS using PerlIO. Since tie-based tricks and perl in-memory "fake" filehandles are not usable at that level, it's tricker than 2.

    You can achieve all three of those by using PerlIO layers with PerlIO::via. The code is similar to what you'd write with tie (implement some required behavior methods). Additionally, you can harness the "open variable as file" functionality of open and the pre-rolled IO::Seekable + IO::Handle functionality of IO::File to simplify achieving requirement 1 above (make it usable in Perl code in the same way normal IO::Handle objects are).

    Below is a sample package that does what you need. It has a few caveats:

    Package:

    package TiedThing;
    
    use strict;
    use warnings;
    use parent "IO::File";
    
    our @pushargs;
    sub new {
        my ( $class, $args ) = @_;
        # Build a glob to be used by the PerlIO methods. This does two things:
        # 1. Gets us a place to stick a shared hashref so PerlIO methods and user-
        # -defined object methods can manipulate the same data. They must use the
        # {args} glob field to do that; new fields written will .
        # 2. Unifies the ways of addressing that across custom functions and PerlIO
        # functions. We could just pass a hashref { args => $args } into PUSHED, but
        # then we'd have to remember "PerlIO functions receive a blessed hashref,
        # custom functions receive a blessed glob" which is lame.
        my $glob = Symbol::gensym();
        *$glob->{args} = $args;
        local @pushargs = ($glob, $class);
        my $self = $class->SUPER::new(\my $unused, "<:via($class)");
        *$self->{args} = $args;
        return $self;
    }
    
    sub custom {
        my $self = shift;
        return *$self->{args}->{customvalue};
    }
    
    sub PUSHED { return bless($pushargs[0], $pushargs[1]); }
    
    sub FILL { return shift(@{*$_[0]->{args}->{lines}}); }
    
    1;
    

    Example usage:

    my $object = TiedThing->new({
        lines => [join("\n", 1..9, 1..9)],
        customvalue => "custom!",
    });
    say "can call custom method: " . $object->custom;
    say "raw read with <>: " . <$object>;
    my $buf;
    read($object, $buf, 10);
    say "raw read with read(): " . $buf;
    undef $buf;
    $object->read($buf, 10);
    say "OO read via IO::File::read (end): " . $buf;
    my $checksummer = Digest::MD5->new;;
    $checksummer->addfile($object);
    say "Md5 read: " . $checksummer->hexdigest;
    my $dupto = IO::Handle->new;
    # Doesn't break/return undef; still not usable without implementing
    # more state sharing inside the object.
    say "Can dup handle: " . $dupto->fdopen($object, "r");
    
    my $archiver = Archive::Zip->new;
    # Dies, but long after the fdopen() call. Can be fixed by implementing more
    # PerlIO methods.
    $archiver->readFromFileHandle($object);