I want my programs to be able to share their data, or even reuse those data on the their next run. For big applications, that might mean that I use a dataserver server such as Postgres or MariaDB, but that means that every application needs access to the database, whether that’s the permission to use them or being online to reach them. There are plenty of books cover those solutions, but not many cover the other situations.
In this chapter, I cover the lightweight solutions that don’t require a database server or a central resource. Instead, I can store data in regular files and pass those around liberally. I don’t need to install a database server, add users, create a web service, or keep everything running. My program output can become the input for the next program in a pipeline.
Perl-specific formats output data that only makes sense to a single programming language and are practically useless to other programming languages. That’s not to say that some other programmer can’t read it, just that they might have to do a lot of work to create a parser to understand it.
The pack
built-in takes data and turns it into a single string according to a template that I provide. It’s similar to sprintf
, although like the pack
name suggests, the output string uses space as efficiently as it can:
#!/usr/bin/perl # pack.pl my $packed = pack( 'NCA*', 31415926, 32, 'Perl' ); print 'Packed string has length [' . length( $packed ) . "]\n"; print "Packed string is [$packed]\n";
The string that pack
creates in this case is shorter than just stringing together the characters that make up the data, and certainly not as easy to read for humans:
Packed string has length [9] Packed string is [öˆ Perl]
The format string NCA*
has one (Latin) letter for each of the rest of my arguments to pack
, with an optional modifier, in this case the *
, after the last letter. My template tells pack
how I want to store my data. The N
treats its argument as a network-order unsigned long. The C
treats its argument as an unsigned char, and the A
treats its argument as an ASCII character. After the A
I use a *
as a repeat count to apply it to all the characters in its argument. Without the *
, I would only pack the first character in Perl
.
Once I have my packed string, I can write it to a file, send it over a socket, or anything else I can do with a chunk of data. When I want to get back my data, I use unpack
with the same template string:
my( $long, $char, $ascii ) = unpack( "NCA*", $packed ); print <<"HERE"; Long: $long Char: $char ASCII: $ascii HERE
As long as I’ve done everything correctly, I get back the data I started with:
Long: 31415926 Char: 32 ASCII: Perl
There are many other formats I can use in the template string, including many sorts of number format and storage. If I wanted to inspect a string to see exactly what’s in it, I can unpack
it with the H
format to turn it into a hex string. I don’t have to unpack the string in $packed
with the same template I used to create it:
my $hex = unpack( "H*", $packed ); print "Hex is [$hex]\n";
I can now see the hex values for the individual bytes in the string:
Hex is [01df5e76205065726c]
Since I can control the length of the packed string through its template, I can pack several data together to form a record for a flat file database that I can randomly access. Suppose my record comprises the ISBN, title, and author for a book. I can use three different A
formats, giving each a length specifier. For each length, pack
will either truncate the argument if it is too long or pad it with spaces if it’s shorter:
#!/usr/bin/perl # isbn-record.pl my( $isbn, $title, $author ) = ( '0596527241', 'Mastering Perl', 'brian d foy' ); my $record = pack( "A10 A20 A20", $isbn, $title, $author ); print "Record: [$record]\n";
The record is exactly 50 characters long, no matter which data I give it:
Record: [144939311XMastering Perl brian d foy ]
When I store this in a file along with several other records, I always know that the next 50 bytes is another record. The seek
built-in puts me in the right position, and I can read an exact number of bytes with sysread
:
open my $fh, '<', 'books.dat' or die ...; seek $fh, 50 * $ARGV[0]; # move to right record sysread $fh, my( $record ), 50; # read next record.
The unpack
built-in is handy for reading binary formats quickly. Here’s a bit of code to read the Windows portable BitMap (BMP) data from the Image::Info
distribution. The while
loop reads a chunk of eight bytes and unpacks them as a long and a four-character ASCII string. The number is the length of the next block of data and the string is the block type. Further on in, the subroutine uses even more unpack
s:
package Image::Info::BMP; use constant _CAN_LITTLE_ENDIAN_PACK => $] >= 5.009002; sub process_file { my($info, $source, $opts) = @_; my(@comments, @warnings, @header, %info, $buf, $total); read($source, $buf, 54) == 54 or die "Can't reread BMP header: $!"; @header = unpack((_CAN_LITTLE_ENDIAN_PACK ? "vVv2V2Vl<v2V2V2V2" : "vVv2V2V2v2V2V2V2" ), $buf); $total += length($buf); ...; }
With almost no effort I can serialize Perl data structures as (mostly) human-readable text. The Data::Dumper
module, which comes with Perl, turns its arguments into Perl source code in a way that I can later turn back into the original data. I give its Dumper
function a list of references to stringify:
#!/usr/bin/perl # data-dumper.pl use Data::Dumper qw(Dumper); my %hash = qw( Fred Flintstone Barney Rubble ); my @array = qw(Fred Barney Betty Wilma); print Dumper( \%hash, \@array );
The program outputs text that represents the data structures as Perl code:
$VAR1 = { 'Barney' => 'Rubble', 'Fred' => 'Flintstone' }; $VAR2 = [ 'Fred', 'Barney', 'Betty', 'Wilma' ];
I have to remember to pass it references to hashes or arrays; otherwise, Perl passes Dumper
a flattened list of the elements and Dumper
won’t be able to preserve the data structures. If I don’t like the variable names, I can specify my own. I give Data::Dumper->new
an anonymous array of the references to dump and a second anonymous array of the names to use for them:
#!/usr/bin/perl # data-dumper-named.pl use Data::Dumper qw(Dumper); my %hash = qw( Fred Flintstone Barney Rubble ); my @array = qw(Fred Barney Betty Wilma); my $dd = Data::Dumper->new( [ \%hash, \@array ], [ qw(hash array) ] ); print $dd->Dump;
I can then call the Dump
method on the object to get the stringified version. Now my references have the name I gave them:
$hash = { 'Barney' => 'Rubble', 'Fred' => 'Flintstone' }; $array = [ 'Fred', 'Barney', 'Betty', 'Wilma' ];
The stringified version isn’t the same as what I had in the program, though. I had a hash and an array before but now I have scalars that hold references to those data types. If I prefix my names with an asterisk in my call to Data::Dumper->new
, Data::Dumper
stringifies the data with the right names and types:
my $dd = Data::Dumper->new( [ \%hash, \@array ], [ qw(*hash *array) ] );
The stringified version no longer has references:
%hash = ( 'Barney' => 'Rubble', 'Fred' => 'Flintstone' ); @array = ( 'Fred', 'Barney', 'Betty', 'Wilma' );
I can then read these stringified data back into the program or even send them to another program. It’s already Perl code, so I can use the string form of eval
to run it. I’ve saved the previous output in data-dumped.txt
and now I want to load it into my program. By using eval
in its string form, I execute its argument in the same lexical scope. In my program I define %hash
and @array
as lexical variables but don’t assign anything to them. Those variables get their values through the eval
and strict
has no reason to complain:
#!/usr/bin/perl # data-dumper-reload.pl use strict; my $data = do { if( open my $fh, '<', 'data-dumped.txt' ) { local $/; <$fh> } else { undef } }; my %hash; my @array; eval $data; print "Fred's last name is $hash{Fred}\n";
Since I dumped the variables to a file, I can also use do
. We covered this partially in Intermediate Perl, although in the context of loading subroutines from other files. We advised against it then because require
or use
work better for that. In this case we’re reloading data and the do
built-in has some advantages over eval
. For this task, do
takes a filename and it can search through the directories in @INC
to find that file. When it finds it, it updates %INC
with the path to the file. This is almost the same as require
, but do
will reparse the file every time whereas require
or use
only do that the first time. They both set %INC
so they know when they’ve already seen the file and don’t need to do it again. Unlike require
or use
, do
doesn’t mind returning a false value, either. If do
can’t find the file, it returns undef
and sets $!
with the error message. If it finds the file but can’t read or parse it, it returns undef
and sets $@
. I modify my previous program to use do
:
#!/usr/bin/perl # data-dumper-reload-do.pl use strict; use Data::Dumper; my $file = "data-dumped.txt"; print "Before do, \$INC{$file} is [$INC{$file}]\n"; { no strict 'vars'; do $file; print "After do, \$INC{$file} is [$INC{$file}]\n"; print "Fred's last name is $hash{Fred}\n"; }
When I use do
, I lose out on one important feature of eval
. Since eval
executes the code in the current context, it can see the lexical variables that are in scope. Since do
can’t do that it’s not strict
safe and it can’t populate lexical variables.
I find the dumping method especially handy when I want to pass around data in email. One program, such as a CGI program, collects the data for me to process later. I could stringify the data into some format and write code to parse that later, but it’s much easier to use Data::Dumper
, which can also handle objects. I use my Business::ISBN
module to parse a book number, then use Data::Dumper
to stringify the object, so I can use the object in another program. I save the dump in isbn-dumped.txt
:
#!/usr/bin/perl # data-dumper-object.pl use Business::ISBN; use Data::Dumper; my $isbn = Business::ISBN->new( '0596102062' ); my $dd = Data::Dumper->new( [ $isbn ], [ qw(isbn) ] ); open my $fh, '>', 'isbn-dumped.txt' or die "Could not save ISBN: $!"; print $fh $dd->Dump();
When I read the object back into a program, it’s like it’s been there all along since Data::Dumper
outputs the data inside a call to bless
:
$isbn = bless( { 'country' => 'English', 'country_code' => '0', 'publisher_code' => 596, 'valid' => 1, 'checksum' => '2', 'positions' => [ 9, 4, 1 ], 'isbn' => '0596102062', 'article_code' => '10206' }, 'Business::ISBN' );
I don’t need to do anything special to make it an object but I still need to load the appropriate module to be able to call methods on the object. Just because I can bless something into a package doesn’t mean that package exists or has anything in it.:
#!/usr/bin/perl # data-dumper-object-reload.pl use Business::ISBN; my $data = do { if( open my $fh, '<', 'isbn-dumped.txt' ) { local $/; <$fh> } else { undef } }; my $isbn; eval $data; # Add your own error handling print "The ISBN is ", $isbn->as_string, "\n";
The Data::Dumper
module might not be enough for me for every task, and there are several other modules on CPAN that do the same job a bit differently. The concept is the same: turn data into text files and later turn the text file back into data. I can try to dump an anonymous subroutine with Data::Dumper
:
use Data::Dumper; my $closure = do { my $n = 10; sub { return $n++ } }; print Dumper( $closure );
I don’t get back anything useful, though. Data::Dumper
knows it’s a subroutine, but it can’t say what it does:
$VAR1 = sub { "DUMMY" };
The Data::Dump::Streamer
module can handle this situations to a limited extent:
use Data::Dump::Streamer; my $closure = do { my $n = 10; sub { return $n++ } }; print Dump( $closure );
Since Data::Dump::Streamer
serializes all of the code references in the same scope, all of the variables to which they refer show up in the same scope. There are some ways around that, but they may not always work:
my ($n); $n = 10; $CODE1 = sub { return $n++; };
If I don’t like the variables Data::Dumper
has to create, I might want to use Data::Dump
, which simply creates the data:
#!/usr/bin/perl use Business::ISBN; use Data::Dump qw(dump); my $isbn = Business::ISBN->new( '144939311X' ); print dump( $isbn );
The output is almost just like that from Data::Dumper
, although it is missing the $VARn
stuff:
bless({ article_code => 9311, checksum => "X", common_data => "144939311X", group_code => 1, input_isbn => "144939311X", isbn => "144939311X", prefix => "", publisher_code => 4493, type => "ISBN10", valid => 1, }, "Business::ISBN10")
When I eval
this, I won’t create any variables. I have to store the result of the eval
to use the variable. The only way to get back my object is to assign the result of eval
to $isbn
:
#!/usr/bin/perl # data-dump-reload.pl use Business::ISBN; my $data = do { if( open my $fh, '<', 'data-dump.txt' ) { local $/; <$fh> } else { undef } }; my $isbn = eval $data; # Add your own error handling print "The ISBN is ", $isbn->as_string, "\n";
There are several other modules on CPAN that can dump data, so if I don’t like any of these formats I have many other options.
The Storable
module is one step up from the human-readable data dumps from the last section. The output it produces might be human-decipherable, but in general it’s not for human eyes. The module is mostly written in C, and part of this exposes the architecture on which I built perl, and the byte order of the data will depend on the underlying architecture in some cases. On a big-endian machine I’ll get different output than on a little-endian machine. I’ll get around that in a moment.
The store
function serializes the data and puts it in a file. Storable
treats problems as exceptions (meaning it tries to die
rather than recover), so I wrap the call to its functions in eval
and look at the eval
error variable $@
to see if something serious went wrong. More minor errors, such as output errors, don’t die
and return undef
, so I check those too and find the error in $!
if it was related to something with the system (i.e. couldn’t open the output):
#!/usr/bin/perl # storable-store.pl use Business::ISBN; use Storable qw(store); my $isbn = Business::ISBN->new( '0596102062' ); my $result = eval { store( $isbn, 'isbn-stored.dat' ) }; if( defined $@ and length $@ ) { warn "Serious error from Storable: $@" } elsif( not defined $result ) { warn "I/O error from Storable: $!" }
When I want to reload the data I use retrieve
. As with store
, I wrap my call in eval
to catch any errors. I also add another check in my if
structure to ensure I got back what I expected, in this case a Business::ISBN
object:
#!/usr/bin/perl # storable-retreive.pl use Business::ISBN; use Storable qw(retrieve); my $isbn = eval { retrieve( 'isbn-stored.dat' ) }; if( defined $@ and length $@ ) { warn "Serious error from Storable: $@" } elsif( not defined $isbn ) { warn "I/O error from Storable: $!" } elsif( not eval { $isbn->isa( 'Business::ISBN' ) } ) { warn "Didn't get back Business::ISBN object\n" } print "I loaded the ISBN ", $isbn->as_string, "\n";
To get around this machine-dependent format, Storable
can use network order, which is architecture-independent and is converted to the local order as appropriate. For that, Storable
provides the same function names with a prepended n
. Thus, to store the data in network order I use nstore
. The retrieve
function figures it out on its own so there is no nretrieve
function. In this example, I also use Storable
‘s functions to write directly to filehandles instead of a filename. Those functions have fd
in their name:
my $result = eval { nstore( $isbn, 'isbn-stored.dat' ) }; open my $fh, '>', $file or die "Could not open $file: $!"; my $result = eval{ nstore_fd $isbn, $fh }; my $result = eval{ nstore_fd $isbn, \*STDOUT }; my $result = eval{ nstore_fd $isbn, \*SOCKET }; $isbn = eval { fd_retrieve(\*SOCKET) };
Now that you’ve seen filehandle references as arguments to Storable
‘s functions, I need to mention that it’s the data from those filehandles that Storable
affects, not the filehandles themselves. I can’t use these functions to capture the state of a filehandle or socket that I can magically use later. That just doesn’t work, no matter how many people ask about it on mailing lists.
The Storable
module can also freeze
data into a scalar. I don’t have to store it in a file or send it to a filehandle; I can keep it in memory, although serialized. I might store that in a database or do something else with it. To turn it back into a data structure, I use thaw
:
#!/usr/bin/perl # storable-thaw.pl use Business::ISBN; use Data::Dumper; use Storable qw(nfreeze thaw); my $isbn = Business::ISBN->new( '0596102062' ); my $frozen = eval { nfreeze( $isbn ) }; if( $@ ) { warn "Serious error from Storable: $@" } my $other_isbn = thaw( $frozen ); XXX error handling print "The ISBN is ", $other_isbn->as_string, "\n";
This has an interesting use. Once I serialize the data it’s completely disconnected from the variables in which I was storing it. All of the data are copied and represented in the serialization. When I thaw it, the data come back into a completely new data structure that knows nothing about the previous data structure.
Before I show this copying, I’ll show a shallow copy, in which I copy the top level of the data structure, but the lower levels are the same references. This is a common error in copying data. I think they are distinct copies only later to discover that a change to the copy also changes the original.
I’ll start with an anonymous array that comprises two other anonymous arrays. I want to look at the second value in the second anonymous array, which starts as Y
. I look at that value in the original and the copy before and after I make a change in the copy. I make the shallow copy by dereferencing $AoA
and using its elements in a new anonymous array. Again, this is the naïve approach, but I’ve seen it quite a bit and probably have even done it myself a couple or fifty times:
#!/usr/bin/perl # shallow-copy.pl my $AoA = [ [ qw( a b ) ], [ qw( X Y ) ], ]; # make the shallow copy my $shallow_copy = [ @$AoA ]; # Check the state of the world before changes show_arrays( $AoA, $shallow_copy ); # Now, change the shallow_copy $shallow_copy->[1][1] = "Foo"; # Check the state of the world after changes show_arrays( $AoA, $shallow_copy ); print "\nOriginal: $AoA->[1]\nCopy: $shallow_copy->[1]\n"; sub show_arrays { foreach my $ref ( @_ ) { print "Element [1,1] is $ref->[1][1]\n"; } }
When I run the program, I see from the output that the change to $shallow_copy
also changes $AoA
. When I print the stringified version of the reference for the corresponding elements in each array, I see that they are actually references to the same data:
Element [1,1] is Y Element [1,1] is Y Element [1,1] is Foo Element [1,1] is Foo Original: ARRAY(0x790c9320) Copy: ARRAY(0x790c9320)
To get around the shallow copy problem I can make a deep copy by freezing and immediately thawing, and I don’t have to do any work to figure out the data structure. Once the data are frozen, they no longer have any connection to the source. I use nfreeze
to get the data in network order just in case I want to send it to another machine:
use Storable qw(nfreeze thaw); my $deep_copy = thaw( nfreeze( $isbn ) );
This is so useful that Storable
provides the dclone
function to do it in one step:
use Storable qw(dclone); my $deep_copy = dclone $isbn;
Storable
is much more interesting and useful than I’ve shown for this section. It can also handle file locking and has hooks to integrate it with classes so I can use its features for my objects. See the Storable
documentation for more details.
The Clone::Any
module by Matthew Simon Cavalletto provides the same functionality through a facade to several different modules that can make deep copies. With Clone::Any
‘s unifying interface, I don’t have to worry about which module I actually use or is installed on a remote system (as long as one of them is):
use Clone::Any qw(clone); my $deep_copy = clone( $isbn );
Storable
has a couple of huge security problems related to Perl’s (and Perl programmers!) trusting nature.
If you look in Storable.xs
, you’ll find a couple of instances of a call to load_module
. Depending on what you’re trying to deserialize, Storable
might load a module without you explicitly asking for it. When Perl loads a module, it can run code right away. I know a file with the right name loads, but I don’t know if it’s the code I intend.
My Perl module can define serialization hooks that replace the default behavior of Storable
with my own. I can serialize the object myself and give Storable
the octets it should store. Along with that, I can take the octets from Storable
and recreate the object myself. Perhaps I want to reconnect to a resource as I rehydrate the object.
Storable
notes the presence of a hook with a flag set in the serialization string. As it deserializes and notices that flag, it loads that module to look for the corresponding STORABLE_thaw
method.
The same thing happens for classes that overload
operators. Storable
sets a flag and when it notices that flag, it loads the overload
module too. It might need it when it recreates the objects.
It doesn’t really matter that a module actually defines hooks or uses overload
. The only thing that matters is that the serialized data sets those flags. If I store my data through the approved interface, bugs aside, I should be fine. If I want to trick Storable
, though, I can make my own data and set whatever bits I like. If I can get you to load a module, I’m one step closer to taking over your program.
I can also construct a special hash serialization that tricks perl into running a method. If I use the approved interface to serialize a hash, I know that a key is unique and will only appear in the serialization once.
Again, I can muck with the serialization myself to construct something that Storable
would not make itself. I can make a Storable
string that repeats a key in a hash:
#!/usr/bin/perl # storable-dupe-key.pl use v5.14; use Storable qw(freeze thaw); use Data::Printer; say "Storable version ", Storable->VERSION; package Foo { sub DESTROY { say "DESTROY while exiting for ${$_[0]}" }; } my $data; my $frozen = do { my $pristine = do { local *Foo::DESTROY = sub { say "DESTROY while freezing for ${$_[0]}" }; $data = { 'key1' => bless( \ (my $n = 'abc'), 'Foo' ), 'key2' => bless( \ (my $m = '123'), 'Foo' ), }; say "Saving..."; freeze( $data ); }; $pristine =~ s/key2/key1/r; }; my $unfrozen = do { say "Retrieving..."; local *Foo::DESTROY = sub { say "DESTROY while inflating for ${$_[0]}" }; thaw( $frozen ); }; say "Done retrieving, showing hash..."; p $unfrozen; say "Exiting next...";
In the first do
black, I create a hash with key1
and key2
, both which point to scalar references I blessed into Foo
. I freeze that and immediately change the serialization to replace the literal key2
with key1
. I can do that because I know things about the serialization and how the keys show up in it. The munged version ends up in $frozen
.
When I want to thaw that string, I create a local version of DESTROY
to watch what happens. In the output I see that while inflating, I handle one instance of key1
then destroys it when it handles the next one. At the end I have a single key in the hash:
% perl storable-dupe-key.pl
Storable version 2.41
Saving...
Retrieving...
DESTROY while inflating for abc
Done retrieving, showing hash...
Exiting next...
DESTROY while exiting for 123
DESTROY while exiting for 123
DESTROY while exiting for abc
\ {
key1 Foo {
public methods (1) : DESTROY
private methods (0)
internals: 123
}
}
If I haven’t already forced Storable
to load a module, I might be able to use the DESTROY
method from a class that I know is already loaded. One candidate is the core module CGI.pm
which includes the CGITempFile
class which tries to unlink a file when it cleans up an object:
sub DESTROY { my($self) = @_; $$self =~ m!^([a-zA-Z0-9_ \'\":/.\$\\~-]+)$! || return; my $safe = $1; # untaint operation unlink $safe; # get rid of the file }
Although this method untaints the filename, remembering that taint checking is not a prophylactic; it’s a development tool. Untainting whatever I put in $$self
isn’t going to stop me from deleting a file, including, perhaps, the Storable
file I used to deliver my malicious payload.
Booking.com, one of Perl’s big supporters, developed a new serialization format that doesn’t have some of Storable
‘s problems. They still wanted to save some special Perl features in their format, including references, aliases, objects, and regular expressions. And, although, it started in Perl, they’ve set it up so it doesn’t have to stay there. Best of all, they made it really fast.
The module use is simple. The encoders and decoders are separate, by design, but I only need to load Sereal
to get Sereal::Encoder
:
use Sereal; my $data = ...; my $encoder = Sereal::Encoder->new; my $serealized = $encoder->encode( $data );
To go the other way, I use Sereal::Decoder
:
my $decoder = Sereal::Decoder->new; my $unserealized = $decoder->decode( $serealized );
In the previous section, I showed that Storable
had a problem with duplicated keys. Here’s the same program but with Sereal
:
#!/usr/bin/perl # sereal-bad-key.pl use v5.14; use Sereal; use Data::Printer; say "Sereal version ", Sereal->VERSION; package Foo { sub DESTROY { say "DESTROY while exiting for ${$_[0]}" }; } my $data; my $frozen = do { my $pristine = do { local *Foo::DESTROY = sub { say "DESTROY while freezing for ${$_[0]}" }; $data = { 'key1' => bless( \ (my $n = 'abc'), 'Foo' ), 'key2' => bless( \ (my $m = '123'), 'Foo' ), }; say "Saving..."; Sereal::Encoder->new->encode( $data ); }; $pristine =~ s/key2/key1/r; }; my $unfrozen = do { say "Retrieving..."; local *Foo::DESTROY = sub { say "DESTROY while inflating for ${$_[0]}" }; Sereal::Decoder->new->decode( $frozen ); }; say "Done retrieving, showing hash..."; p $unfrozen; say "Exiting next...";
The output shows that the DESTROY
during the inflation isn’t triggered. I wouldn’t be able to trick CGITempFile
into deleting a file like I could do with Storable
. Also, since Sereal
doesn’t support special per-class serialization and deserialization hooks, I won’t be able to trick it into loading classes or running code.
<Sereal>, unlike the other serializers I have shown so far, makes a deliberate and conscious effort to create a small string. Imagine a data structure that is a an array of hashes; the keys in each hash are the same and it’s only the values are different:
The Sereal
specification includes a way for a later hash to reuse the string already stored for that key so it doesn’t have to store it again like the JSON or Storable
. I wrote a tiny benchmark to try this, comparing Data::Dumper
, JSON
, Storable
, and Sereal::Encoder
, using the defaults from each:
#!/usr/bin/perl use v5.18; use Data::Dumper qw(Dumper); use Storable qw(nstore_fd dclone); use Sereal::Encoder qw(encode_sereal); use JSON qw(encode_json); my $stores = { dumper => sub { Dumper( $_[0] ) }, jsoner => sub { encode_json( $_[0] ) }, serealer => sub { encode_sereal( $_[0] ) }, storer => sub { open my $sfh, '>:raw', \ my $string; nstore_fd( $_[0], $sfh ); close $sfh; $string; }, }; my $max_hash_count = 10; my $hash; my @keys = get_keys(); my @values = get_values(); @{$hash}{ @keys } = @values; my %lengths; my @max; foreach my $hash_count ( 1 .. $max_hash_count ) { my $data = [ map { dclone $hash } 1 .. $hash_count ]; foreach my $type ( sort keys %$stores ) { my $string = $stores->{$type}( $data ); my $length = length $string; $max[$hash_count] = $length if $length > $max[$hash_count]; $max[0] = $length if $length > $max[0]; # grand max if( 0 == $length ) { warn "$type: length is zero!\n"; } push @{$lengths{$type}}, $length; } } ########### # make a tab separated report with the normalized numbers # for each method, in columns suitable for a spreadsheet say join "\t", sort keys %$stores; open my $per_fh, '>:utf8', "$0-per.tsv" or die "$!"; open my $grand_fh, '>:utf8', "$0-grand.tsv" or die "$!"; foreach my $index ( 1 .. $max_hash_count ) { say { $per_fh } join( "\t", map { $lengths{$_}[$index - 1] / $max[$index] } sort keys %$stores ); say { $grand_fh } join( "\t", map { $lengths{$_}[$index - 1] / $max[0] } sort keys %$stores ); } # make some long keys sub get_keys { map { $0 . time() . $_ . $$ } ( 'a' .. 'f' ); } # make some long values sub get_values { map { $0 . time() . $_ . $$ } ( 'f' .. 'k' ); }
This program serializes an arrays of hashes, starting with an array that has one hash and going up to an array with ten hashes, all of them exactly the same but not references of each other (hence the dclone
). This way, the keys and values for each hash should be repeated in the serialization.
To make the relative measures a bit easier to see, I keep track of the maximum string length for all serializations (the grand) and the per-hash-count maximum. I use those to normalize the numbers to create two graphs of the same data.
The first plot uses the grand normalization and shows linear growth in each serialization. For size, Data::Dumper
does the worst, with JSON and Storable
doing slightly better, mostly because they use much less whitespace. The size of the Sereal
strings grows much more slowly.
Many people would be satisfied with that plot, but I like the one from the per normalization where the numbers are normalized just for the maximum string size of the same hash count. The Data::Dumper
size is always the largest, so it is always normalized to exactly 1. JSON and Storable
still normalize to almost the same number (to two decimal places) and look like a straight line. The Sereal
curve is more interesting: it starts at the same point as JSON and Storable
for one hash, when every serialization has to store the keys and values at least once, then drops dramatically and continues to drop, although more slowly, as the number of hashes increase.
But, as I explained in Chapter 6, all benchmarks have caveats. I’ve chosen a particular use case for this, but that does not mean you would see the same thing for another problem. If all the hashes had unique keys that no other hash stored, I expect that Sereal
wouldn’t have as significant space savings.
The next step after Storable
are tiny, lightweight databases. These don’t require a database server but still handle most of the work to make the data available in my program. There are several facilities for this, but I’m only going to cover a couple of them. The concept is the same even if the interfaces and fine details are different.
Since at least Perl 3 I’ve been able to connect to DBM files, which are hashes stored on disk. In the early days of Perl, when the language and practice was much more unix-centric, DBM access was important since many system databases used that format. The DBM was a simple hash where I could specify a key and a value. I use dbmopen
to connect a hash to the disk file, then use it like a normal hash. dbmclose
ensures that all of my changes make it to the disk:
#!/usr/bin/perl # dbmopen.pl dbmopen %HASH, "dbm-open", 0644; $HASH{'0596102062'} = 'Intermediate Perl'; while( my( $key, $value ) = each %HASH ) { print "$key: $value\n"; } dbmclose %HASH;
In modern Perl the situation is much more complicated. The DBM format branched off into several competing formats, each of which had their own strengths and peculiarities. Some could only store values shorter than a certain length, or only store a certain number of keys, and so on.
Depending on the compilation options of the local perl binary, I might be using any of these implementations. That means that although I can safely use dbmopen
on the same machine, I might have trouble sharing it between machines since the next machine might have used a different DBM library.
None of this really matters because CPAN has something much better.
Much more popular today is DBM::Deep
, which I use anywhere that I would have previously used one of the other DBM formats. With this module, I can create arbitrarily deep, multilevel hashes or arrays. The module is pure Perl so I don’t have to worry about different library implementations, underlying details, and so on. As long as I have Perl, I have everything I need. It works without worry on a Mac, Windows, or unix, any of which can share DBM::Deep
files with any of the others. And, best of all, it’s pure Perl.
Joe Huckaby created DBM::Deep
with both an object-oriented interface and a tie interface (see Chapter 17). The documentation recommends the object interface, so I’ll stick to that here. With a single argument, the constructor uses it as a filename, creating the file if it does not already exist:
use DBM::Deep; my $isbns = DBM::Deep->new( "isbns.db" ); if( $isbns->error ) { warn 'Could not create database: ' . $isbns->error . "\n"; } $isbns->{'1449393098'} = 'Intermediate Perl';
Once I have the DBM::Deep
object, I can treat it just like a hash reference and use all of the hash operators.
Additionally, I can call methods on the object to do the same thing. I can even set additional features, such as file locking and flushing when I create the object:
#!/usr/bin/perl use DBM::Deep; my $isbns = DBM::Deep->new( file => "isbn.db" locking => 1, autoflush => 1, ); if( $isbns->error ) { warn 'Could not create database: ' . $isbns->error . "\n"; } $isbns->put( '1449393098', 'Intermediate Perl' ); my $value = $isbns->get( '1449393098' );
The module also handles objects based on arrays, which have their own set of methods. It has hooks into its inner mechanisms so I can define how it does its work.
By the time you read this book, DBM::Deep
should already have transaction support thanks to the work of Rob Kinyon, its current maintainer. I can create my object, then use the begin_work
method to start a transaction. Once I do that, nothing happens to the data until I call commit
, which writes all of my changes to the data. If something goes wrong, I just call rollback
to get to where I was when I started:
my $db = DBM::Deep->new( 'file.db' ); eval { $db->begin_work; ... die q(Something didn't work) if $error; $db->commit; }; if( defined $@ and length $@ ) { $db->rollback; }
So far in this chapter I’ve used formats that are specific to Perl. Sometimes that works out, but more likely I’ll want something that I can exchange with other languages so I don’t lock myself into a particular language or tool. If my format doesn’t care about the language, I’ll have an easier time building compatible systems and integrating or switching technologies later.
In this section, I’ll show some other formats and how to work with them in Perl, but I’m not going to give a tutorial for each of them. My intent is to survey what’s out there and give you an idea when you might use them.
XXX: http://techblog.babyl.ca/entry/jackrabbit JSON::Rabbit
JavaScript Object Notation, or JSON, is a very attractive format for data interchange because I can have my Perl (or Ruby or Python or whatever) program send the data as part of a web request so a browser can use it easily and immediately. The format is actually valid Javascript code; this is technically language specific for that reason, but the value of a format understandable by a web browser is so high that most mainstream languages already have libraries for it.
A JSON data structure looks similar to a Perl data structure, although much simpler. Instead of =>
there’s a :
, and strings are double quoted:
{ "meta" : { "established" : 1991, "license" : "416d656c6961" }, "source" : "Larry's Camel Clinic", "camels" : [ "Amelia", "Slomo" ] }
I created that JSON data with a tiny program. I started with a Perl data structure and turned it into the JSON form:
#!/usr/bin/perl # json-data.pl use v5.10.1; use JSON; my $hash = { camels => [ qw(Amelia Slomo) ], source => "Larry's Camel Clinic", meta => { license => '416d656c6961', established => 1991, }, }; say JSON->new->pretty->encode( $hash );
To load that data into my Perl program, I need to decode it. Although the JSON specification although for several Unicode encodings, the JSON
modules only handles UTF-8 text. I have to read that as raw octets though:
#!/usr/bin/perl # read-json.pl use v5.10; use JSON; my $json = do { local $/; open my $fh, '<:raw', '/Users/Amelia/Desktop/sample.json'; <$fh>; }; my $perl = JSON->new->decode( $json ); say "Camels are [ @{ $perl->{camels} } ]";
Going the other way is much easier. I give the module a data structure and get back the result as JSON:
#!/usr/local/perls/perl-5.18.1/bin/perl use v5.10; # simple-json.pl use JSON; my $hash = { camels => [ qw(Amelia Slomo) ], source => "Larry's Camel Clinic", meta => { license => '416d656c6961', established => 1991, }, }; say JSON->new->encode( $hash );
The output is compact with minimal whitespace. If machines are exchanging data, they don’t need the extra characters:
{"camels":["Amelia","Slomo"],"meta":{"license":"416d656c6961","established":1991},"source":"Larry's Camel Clinic"}
The module has many options for the output to specify the encoding, the style, and other things I might want to control. If I’m sending my data to a web browser, I probably don’t care if the output is easy for me to read. However, if I want to be able to read it easily, I can use the pretty
option, as I did in my first example:
say JSON->new->pretty->encode( $hash );
The JSON
module lists other options you might need. Read its documentation to see what else you can do.
CPAN has other JSON implementations, such as JSON::Syck
. This is based on libsyck
, a YAML parser (read the next section). Since some YAML parser has some of the same problems that Storable
has[6], you might want to avoid parsing JSON, which doesn’t create local objects, which a parser that can.
YAML (YAML Ain’t Markup Language) seems the same idea as Data::Dumper
, although more concise and easier to read. The http://www.yaml.org/spec/1.2/spec.html|YAML 1.2 spec
says “There are hundreds of different languages for programming, but only a handful of languages for storing and transferring data.” That is, YAML aims to be much more than serialization.
YAML was popular in the Perl community when I wrote the first edition of this book, but JSON has largely eaten its lunch. Still, some parts of the Perl toolchain use it, and it does have some advantages over JSON. The META.yml
file produced by various module distribution creation tools is YAML
.
I write to a file that I give the extension .yml
:
#!/usr/bin/perl # yaml-dump.pl use Business::ISBN; use YAML qw(Dump); my %hash = qw( Fred Flintstone Barney Rubble ); my @array = qw(Fred Barney Betty Wilma); my $isbn = Business::ISBN->new( '144939311X' ); open my $fh, '>', 'dump.yml' or die "Could not write to file: $!\n"; print $fh Dump( \%hash, \@array, $isbn );
The output for the data structures is very compact although still readable once I understand its format. To get the data back, I don’t have to go through the shenanigans I experienced with Data::Dumper
:
--- Barney: Rubble Fred: Flintstone --- - Fred - Barney - Betty - Wilma --- !!perl/hash:Business::ISBN10 article_code: 9311 checksum: X common_data: 144939311X group_code: 1 input_isbn: 144939311X isbn: 144939311X prefix: '' publisher_code: 4493 type: ISBN10 valid: 1
YAML can preserve Perl data structures and objects because it has a way to label things (which is basically how Perl blesses a reference). This is something I couldn’t get (and don’t want) with plain JSON.
The YAML
module provides a Load
function to do it for me, although the basic concept is the same. I read the data from the file and pass the text to Load
:
#!/usr/bin/perl # yaml-load.pl use Business::ISBN; use YAML; my $data = do { if( open my $fh, '<', 'dump.yml' ) { local $/; <$fh> } else { undef } }; my( $hash, $array, $isbn ) = Load( $data ); print "The ISBN is ", $isbn->as_string, "\n";
YAML
isn’t part of the standard Perl distribution and it relies on several other noncore modules as well. Since it can create Perl objects, it has some of the same problems as Storable
.
YAML has three common versions, and they aren’t necessarily compatible with each other. Parsers (and writers) target particular versions, which means that I’m likely to have a problem if I create a YAML file in one version and try to parse it as another.
YAML 1.0 allows unquoted dashes, -
, as data, but YAML 1.1 and later do not. This caused problems for me when I created many files with an older dumper and tried to use a newer parser. The YAML::Syck
, based on libsyck
, handles YAML 1.0 but not YAML 1.1.
YAML::LibYAML
includes YAML::XS
. Kirill Siminov’s libyaml
is arguably the best YAML implementation. The C library is written precisely to the YAML 1.1 specification. It was originally bound to Python and was later bound to Ruby. For most things, I stick with YAML::XS
.
YAML::Tiny
handles subset of YAML 1.1 in pure Perl. Like the other ::Tiny
modules, YAML::Tiny
has no non-core dependencies, does not require a compiler to install, is back-compatible to Perl 5.004, and can be inlined into other modules if needed. If you aren’t doing anything tricky, want a very small footprint, or want minimal dependencies, this module might be for you.
The MessagePack format is similar to JSON, but smaller and faster. It’s a binary serialization format (so it can be much smaller) that has implementations in most of the mainstream languages. It’s like a cross-platform pack
that’s also smarter. The Data::MessagePack
module handles it:
#!/usr/bin/perl # message_pack.pl use v5.10; use Data::MessagePack; use Data::Dumper; my %hash = qw( Fred Flintstone Barney Rubble Key 12345 ); use Data::MessagePack; my $mp = Data::MessagePack->new; $mp->canonical->utf8->prefer_integer if $needed; my $packed = $mp->pack( \%hash ); say 'Length of packed is ', length $packed; say Dumper( $mp->unpack( $packed ) );
The Data::MessagePack
module comes with some benchmark programs (although remember what I wrote in Chapter 6):
% perl benchmark/deserialize.pl
-- deserialize JSON::XS: 2.34 Data::MessagePack: 0.47 Storable: 2.41 Rate storable json mp storable 64577/s -- -21% -45% json 81920/s 27% -- -30% mp 117108/s 81% 43% --% perl benchmark/serialize.pl
-- serialize JSON::XS: 2.34 Data::MessagePack: 0.47 Storable: 2.41 Rate storable json mp storable 91897/s -- -22% -50% json 118154/s 29% -- -35% mp 182044/s 98% 54% --
By stringifying Perl data I have a lightweight way to pass data between invocations of a program and even between different programs. Slightly more complicated are binary formats, although Perl comes with the modules to handle that too. No matter which one I choose, I have some options before I decide that I have to move up to a full database server.
Programming the Perl DBI by Tim Bunce and Alligator Descartes covers the Perl Database Interface (DBI
). The DBI
is a generic interface to most popular database servers. If you need more than I covered in this chapter, you probably need DBI
. I could have covered SQLite, an extremely lightweight, single-file relational database in this chapter, but I access it through the DBI just as I would any other database so I left it out. It’s extremely handy for quick persistence tasks, though.
The BerkeleyDB
module provides an interface to the BerkeleyDB library, http://www.oracle.com/us/products/database/berkeley-db/overview/index.htm, which provides another way to store data. It’s use is somewhat complex but it is very powerful.
Alberto Simões wrote “Data::Dumper and Data::Dump::Streamer” for The Perl Review 3.1 (Winter 2006).
Vladi Belperchinov-Shabanski shows an example of Storable
in “Implementing Flood Control” for Perl.com: http://www.perl.com/pub/2004/11/11/floodcontrol.html.
Randal Schwartz has some articles on persistent data: “Persistent Data”, Unix Review, February 1999, http://www.stonehenge.com/merlyn/UnixReview/col24.html; “Persistent Storage for Data”, Linux Magazine, May 2003, http://www.stonehenge.com/merlyn/LinuxMag/col48.html; and “Lightweight Persistent Data”, Unix Review, July 2004, http://www.stonehenge.com/merlyn/UnixReview/col53.html.
The JSON website explains the data format, as does RFC 4627. The JavaScript: The Definitive Guide has a good section on JSON. I also like the JSON appendix in JavaScript: The Good Parts.
Randal Schwartz wrote a JSON parser in a single regex for XXX.
The YAML website has link to all the YAML projects in different languages.
There’s a set of StackOverflow answers to “Should I use YAML or JSON to store my Perl data?” which discusses the costs and benefits of YAML, JSON, and XML.
Steffen Müller writes about Booking.com’s development of Sereal
.
The documentation for AnyDBM_File
discusses the various implementations of DBM files.