"Linux Gazette...making Linux just a little more fun!"
Large Objects and Perl DBI
If this document changes, it will be available here:
http://genericbooks.com/Literature/Articles/2
.
Index:
- Resources
- Introduction
- Using Perl DBI
- Using Large Objects and Perl
- Using the database server with a webserver.
- Comments
Resources
I assume that you have DBI and DBD::Pg installed with Perl.
- man DBI
- man DBD::Pg
-
ftp://ftp.perl.com/CPAN/modules/by-category/07_Database_Interfaces/
- A search for "large" and "dbi" on one of the postgresql
newsgroups.
- http://www.postgresql.org/doxlist.html -- A chapter on Large Objects in the Programmer's
Guide for PostgreSQL.
- Two examples using psql which explain how to use large objects.
Introduction
I work at ZING Is Not GNU. We sell and distribute free books or literature.
Well, besides selling books inexpensively, we also want people to download
the books. This creates a problem when you need to take megabytes of data
and put them into a database. I have programmed putting large amounts
of data into databases before, but only 100k or less of data per transaction.
This was the first time I needed to put really large files into a database.
My biggest problem was the fact it really isn't
documented well when you use Perl. Also, I never needed to use
Large Objects with PostgreSQL before. I have used other database servers
for Large Objects, but not PostgreSQL.
What are Large Objects? Large Objects are "things" (text or binary data) which
cannot fit into a normal field in a table. For example, a 100 megabyte file
cannot really fit into a field in a table.
What were the technologies I was using? RedHat Linux, postgresql-6.5.3,
Perl 5 (with DBI, DBD::Pg, and Pg), and apache_1.3.9.
In general, how do you use Large Objects in PostgreSQL? You can save
large objects like "files" where the database server will let you input
a file and it will give you a number in return, and when you want to
retrieve the data, you use the number to export your data to a temporary
file on your hard drive. It is kind of weird. In order to extract a Large
Object, it sort of takes two steps. First, you copy the data to a file
on your computer, and then you can read it. The problem is, from my
point of view, is that data normally is read once. Here, it is read twice,
to make the file, and then to read the file. There isn't any way around this
(as far I as know), but I am just new to Large Objects in PostgreSQL (I have
used them in other database servers before though), so perhaps there
are better ways.
Using Perl DBI
PERL is a programming language used by many web and database professionals.
It took a couple of years, but now people are finally recognizing that
free and open software, like
PERL,
PHP, and
Python
can be used for commercial companies
as an alternative to bloated, inefficient, and unstable commercial
programming languages (which most of the time you don't have source code
for, and hence, you can never be sure what you are using truly does
everything it claims to do -- there is one company I am
talking about in particular). Being a contractor, I have noticed an
explosion of Perl jobs (it pleases me greatly to see the rise of ethically
clean software), and hence, I am sure this will be useful to other contractors
out there who would prefer to use PostgreSQL over other alternatives.
Perl DBI is a generic database interface for all database
servers or files. In order to connect to a specific database, you also
need a Perl DBD driver. For example, to connect to PostgreSQL, I use Perl
DBI and Perl DBD:Pg. Here is a brief example of how I print out
the first and last names from a PostgreSQL database using the table
"people".
#!/usr/bin/perl
use DBI;
use vars qw($dbh);
### zing is the database I am connecting to.
$dbh ||= DBI->connect("dbi:Pg:dbname=zing");
my $Command = "select first_name,last_name from people
sort by last_name,first_name";
my $sth = $dbh->prepare($Command);
my $Result = $sth->execute;
while (my @row_ary = $sth->fetchrow_array)
{print "<br> $row_ary[0] $row_ary[1]\n";}
The problem with PERL DBI is that the Large Object interface isn't
fined tuned yet, and you have to use the specific DBD driver for each
type of database. The problem with DBD::Pg is that Large Objects is
not well documented, and it took me a long time looking through
newsgroups to find the answer I was looking for. Eventually, I also
did manage to find
this note on Large Objects.
Using Large Objects and Perl
ONE WARNING: The reason why temporary files are actually a good thing is
for the following reason, for big files, if you suck the data all into
memory at once, Perl running under Apache using mod_perl will use up
a lot of system memory and it won't give it back (even though Perl will
reuse the memory itself). Imagine if you have 10 people downloading
10 megabyte files and your Perl script loads the files into memory
before it prints them (instead of just printing them line-by-line). Apache
will use 100 megabytes of system memory (actually Perl will) and it won't
give it back. This can be bad. In other Database servers using Large Objects,
I could load the data directly into memory. I would not want to do that
with really large files anyways. Read this
performance guide for more info.
Okay, how do we import and export large object from PostgreSQL?
This following example comes straight from the
this
PostgreSQL documentation. This example uses the program psql.
CREATE TABLE image (
name text,
raster oid
);
INSERT INTO image (name, raster)
VALUES ('beautiful image', lo_import('/etc/motd'));
SELECT lo_export(image.raster, "/tmp/motd") from image
WHERE name = 'beautiful image';
Now, we need to convert this into perl. Here is a perl script which
would do the exact same thing.
#!/usr/bin/perl
use vars qw($dbh);
$dbh ||= DBI->connect("dbi:Pg:dbname=zing");
my $Command = "INSERT INTO image (name, raster)
VALUES ('beautiful image', lo_import('/etc/motd'));";
my $sth = $dbh->prepare($Command);
$sth->execute();
$Command = "SELECT lo_export(image.raster, '/tmp/motd') from image
WHERE name = 'beautiful image'";
$sth = $dbh->prepare($Command);
$sth->execute();
Those two examples do the following. The first command loads the file
"/etc/motd" into a table. The second command takes the data in the table
and exports it to a file called "/tmp/motd". If you want to get the "oid"
of data in the table and export it to a file, you could also do this.
## This command will return a numeric "oid" number
$Command = "SELECT raster from image WHERE name = 'beautiful image'";
$sth = $dbh->prepare($Command);
$sth->execute();
my @row_ary = $sth->fetchrow_array;
my $Oid = $row_ary[0];
## This command will export the data with the "oid" to a file
$Command = "SELECT lo_export($Oid, '/tmp/motd') from image";
$sth = $dbh->prepare($Command);
$sth->execute();
For live examples, ZING at
www.genericboosk.com
has all of its Perl Scripts available for public viewing. Look for
scripts that view documents that are extracted from the database.
ZING now has setup scripts to let people upload and download documents,
so we should have real live examples floating around on the website.
Using the database server with a webserver.
Okay, what problems can you encounter when you use a webserver to
input/output the data from a database server? Well, the biggest problem
I saw, was that if the database server and the webserver run under two
different accounts (like "www" and "pgsql"), the webserver and database
server may have problem reading the others temporary files.
Here is a list of problems
and their solutions.
- The webserver cannot upload the file to PostgreSQL because it doesn't
have super user privileges to use lo_import and lo_export. Well, give the
webserver its own database server (same account), give the webserver
super user status with the database server, or look into client-side
commands that I never had a chance to look at yet, concerning lo_import
and lo_export.
- The webserver cannot delete the exported files once the database
server has exported them and the webserver doesn't need them anymore.
Make the webserver and database server run under the same account,
setup a "group" where the webserver can delete the database server
files, or setup a cron job to delete the files every once in a while and
save the files by the pid of the webserver so that they will tend
to get overwritten by new processes later. ZING just saves files by their pid
number so that the next time a document is pulled, it will overwrite the
previous one (which isn't needed anymore).
At ZING, we export the files by the pid number of the child process
of the webserver. When the webserver needs to export another document,
it hands the process to one of its children, and if the child already
has exported a document before, it will just overwrite the previous one
which makes it so we don't end up with tons of exported undeleted files.
Then, once an hour, we go through and delete any files older than 15 minutes.
It is not as elegant as I would like, but it works fine.
In general, it is a bad idea to let your webserver have super user
status or to have the webserver and database server run under the same
username. For security, you don't want your webserver having the power
to blow away the database server. Oh well.
Comments
I really don't like the way Large Objects are handled in PostgreSQL, or
any other database server I have used. There should be a way to treat
importing and exporting data from a database server as STDIN or STDOUT
where you can just grab the stuff line by line and not the whole darn
thing at once. I don't like the fact that we have to use intermediate files
to get to the data. It would be nice if we could choose to directly dump data
directly into memory or use temporary files if the data is too big to hold
in memory. CHOICE is the key word here.
What does this mean? Since PostgreSQL is the best
free and open database server out there, help the guys out by
becoming a developer for PostgreSQL! Perhaps there is a way to directly
dump Large Objects into memory using PostgreSQL, and if there is, write
a follow-up to this article and prove me wrong! Actually, being able to
get Large Objects one line at a time would really be cool.
Anyways, I will try to find better solutions, so
let me know if you hear
of any!
Large Objects have always been a pain for me. I have always wanted to
use Large Objects in PostgreSQL. Now that I had a reason, I finally
did it. Hopefully it will save other people headaches. If it saved you
from headaches, donate time or money to
ZING, or do something charitable
for some other cause! If we all do a little, it makes a big impact.
My next goal is to do handle Large Objects in PostgreSQL using the
programming languages PHP and Python. After that, I want to see if there
is a way to use Large Objects without being a super user.
After that, I want to compare this
to MySQL. For licensing reasons and since PostgreSQL has always been
100% free and open, I prefer PostgreSQL over MySQL. However I want to
compare and contrast them to help make PostgreSQL better. Recently, MySQL
has relaxed its license, but I am going to stick with PostgreSQL since
they have always had the best license.
Mark Nielsen works for
The Computer Underground as a
receptionist and as a
book binder at
ZING.
In his spare time, he does volunteer stuff, like writing
these documents for The Linux Gazette (and other magazines).
Copyright © 2000, Mark Nielsen
Published in Issue 50 of Linux Gazette, February 2000