oracle-databaseperlutf-8dbi

How do I handle Unicode with DBD::Oracle?


The perl DBI documentation says this :

Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. Similarly, when fetching from the database character data that isn't iso-8859-1 the driver should convert it into utf8.

DBD::Sqlite with parameter (sqlite_unicode => 1), or DBD::Pg with parameter (pg_enable_utf8 => -1) -- which is the default -- indeed do such conversions.

With DBD::Oracle (v1.83, NLS_LANG='FRENCH_FRANCE.UTF8') it is not so : if non-Unicode strings are passed to INSERT or UPDATE statements, the driver does not upgrade them automatically to utf8.

Here is my test suite. Variants for SQLite and Pg succeed, but this Oracle variant fails :

use utf8;
use strict;
use warnings;
use Test::More;
use SQL::Abstract::More;
use Scalar::Util   qw/looks_like_number/;
use DBI;

my @DBI_CONNECT_ARGS = @ARGV;
my ($table, $key_col, $val_col) = qw/TST_UTF8 KEY VAL/; # assuming this table is already created


binmode $_, ':utf8' for *STDERR, *STDOUT;

# strings for tests
my %str;
$str{utf8}        = "il était une bergère";                             # has flag utf8 because of 'use utf8'
$str{native}      = $str{utf8}; utf8::downgrade($str{native});          # without flag utf8
$str{wide_chars}  = "il était une bergère♥♡";                          # chars > 256 - cannot be a native string (\x{2665}\x{2661})
$str{named_chars} = "il \N{LATIN SMALL LETTER E WITH ACUTE}tait une "   # identical to string 'wide_chars'
                  . "berg\N{LATIN SMALL LETTER E WITH GRAVE}re"
                  . "\N{BLACK HEART SUIT}\N{WHITE HEART SUIT}";


# check that test strings meet expectations
ok  utf8::is_utf8($str{utf8}),          "perl string with utf8 flag";
ok !utf8::is_utf8($str{native}),        "perl string without utf8 flag, (native chars ... latin1)";
is $str{utf8}, $str{native},            "strings 'utf8' and 'native' have different encodings but represent the same chars";
ok utf8::is_utf8($str{wide_chars}),     "string with wide chars must have utf8 flag";
ok utf8::is_utf8($str{named_chars}),    "string with named wide chars must have utf8 flag";
is $str{wide_chars}, $str{named_chars}, "named chars are identical to chars from perl source";



my $dbh   = DBI->connect(@DBI_CONNECT_ARGS);
my $sqlam = SQL::Abstract::More->new;
my ($sql, @bind);


# suppress records from previous run
my @k = keys %str;
($sql, @bind) = $sqlam->delete(-from => $table, -where => {$key_col => {-in => \@k}});
my $del = $dbh->do($sql, {}, @bind);
note "DELETED $del records";

# insert strings via bind values
while (my ($key, $val) = each %str) {
  ($sql, @bind) = $sqlam->insert(-into => $table, -values => {$key_col => $key, $val_col => $val});
  my $ins = $dbh->do($sql, {}, @bind);
  note "INSERT via bind $key: $ins";
}

# read data back
($sql, @bind) = $sqlam->select(-from    => $table,
                               -columns => [$key_col, $val_col],
                               -where   => {$key_col => {-in => \@k}});
my $rows = $dbh->selectall_arrayref($sql, {}, @bind);
my %str_from_db = map {@$_} @$rows;

# check round trip
is_deeply \%str_from_db, \%str, 'round trip with bind values';

# suppress again
($sql, @bind) = $sqlam->delete(-from => $table, -where => {$key_col => {-in => \@k}});
$del = $dbh->do($sql, {}, @bind);
note "DELETED $del records";

# insert strings via raw sql
while (my ($key, $val) = each %str) {
  my $ins = $dbh->do("INSERT INTO $table($key_col, $val_col) VALUES ('$key', '$val')");
  note "INSERT via raw SQL $key: $ins";
}

# check round trip
is_deeply \%str_from_db, \%str, 'round trip with raw SQL';

As a workaround, I added some callbacks for automatic upgrading of native strings; with this addition the tests pass :

   $dbh->{Callbacks}{prepare} = sub {
      # warn "PREPARE : upgrading stmt: $_[1]\n";
      utf8::upgrade($_[1]);
      return;
    };
    $dbh->{Callbacks}{ChildCallbacks}{execute} = sub {
      # warn "EXECUTE: ";
      foreach my $i (1 .. $#_) {
        if ($_[$i] && ! ref $_[$i] && ! looks_like_number(($_[$i]))) {
          # warn "upgrading $i : $_[$i];";
          utf8::upgrade($_[$i]);
        }
      }
      print STDERR "\n";
      return;
    };

If I understand properly the DBI spec, this automatic upgrade should be performed by the DBD::Oracle driver, not by the application code. Or am i missing something ?


Solution

  • (cross-posted from https://github.com/perl5-dbi/DBD-Oracle/issues/161)

    I studied the situation deeper and came to the conclusion that even if DBD::Oracle did automatically upgrade native strings, as DBD:SQLite or DBD::Pg do, it would not completely resolve the problem.

    Perl utf8::upgrade assumes an iso-8859-1 charset; so if this is performed automatically, it is not convenient when the native charset is something different, like for example Windows-1252. So I wrote a new CPAN module that lets client explicitly specify what is the charset to decode from : https://metacpan.org/pod/DBIx::AutoUpgrade::NativeStrings .

    Since I no longer consider that DBD::Oracle should absolutely do the upgrade, I'll close the issue.