Fun with PostgreSQL, Psycopg2 and Bytea arrays

Introduction

Despite being a great DBMS, PostgreSQL has a few wrinkles that can cause quite a bit of pain :-)

One such wrinkle is the insertion of database rows with Bytea arrays.

If you’re not dealing with PostgreSQL or don’t need to wrangle with Bytea arrays feel free to skip this article. If you are, however, this article may save you a lot of time and frustration.

The code

Since the correct number of backslashes in the code below is important but unlikely to be displayed properly in your web browser please view it syntax highlighted here or in plain format here.

The code below consists of roughly two sections:

  • the actual code of interest dealing with the backslash plague and the quoting orgy (prepareByteaString(), lines 49-82)
  • a test harness (lines 84-118) merely facilitating the testing of the function of interest

The problem on hand is that the PostgreSQL Array Value Input syntax requires that the Bytea array literal be enclosed in single quotes. Psycopg2 however quotes individual byte strings using single quotes as well.

Please note: a byte string or array corresponds to one Bytea variable. A Bytea array is hence an array of byte arrays or strings.

The resulting Bytea array literals are a mess and cause syntax errors when used in INSERT statements etc. The approach chosen is to “re-quote” the Bytea literals returned by the Psycopg2 Binary() function from single to double quotes, For more detail please see the comments on lines 68-80.

  36 # created: Thu Oct 25 21:35:50 2007
  37 __version__ = "$Id:$"
  38 # $HeadURL $
  39 
  40 import psycopg2 as ps2
  41 
  42 class PGT(object):
  43     """Utility functions for using a PostgreSQL database with python"""
  44 
  45     def __init__(self):
  46         """initialiser"""
  47         # super(NEW_CLASS, self).__init__()
  48 
  49     @staticmethod
  50     def prepareByteaString(byteaSeq):
  51         """
  52         Given a sequence of byte arrays this function prepares a properly
  53         quoted string to be used for inserting database rows with Bytea
  54         arrays.
  55 
  56         Given e.g. a table like the following
  57             Create table baex(byteaa Bytea ARRAY[16]);
  58         the resulting string ('bas') can be used as follows:
  59             cursor.execute("INSERT INTO baex(byteaa) values('{%s}')" % bas))
  60 
  61         Parameters:
  62         - byteaSeq: a sequence of byte arrays each corresponding to a Bytea
  63                     value in the database
  64         Returns:
  65         string: containing all the byte arrays from the 'byteaSeq' properly
  66                 quoted for utilisation in an INSERT statement
  67         """
  68         # in a first step
  69         #   1. quote all the byte arrays (using the psycopg2 Binary()
  70         #      function)
  71         #   2. strip away the single quotes on the left and right side
  72         #   3. escape any double quote characters with a backslash
  73         # The last step is necessary because we will use the double quote as
  74         # the quote/delimiter for the byte arrays
  75         baSeq = [str(ps2.Binary(ba))[1:-1].replace(r'"', r'\"') for ba in byteaSeq]
  76         # join the prep'ed byte arrays into a single string
  77         bas = "\"%s\"" % '","'.join(baSeq)
  78         # double the number of backslashes (needed because we're inserting a
  79         # Bytea array as opposed to a single Bytea)
  80         bas = bas.replace('\\', '\\\\')
  81         # done!
  82         return(bas)

The code below is only executed when you invoke the Python file directly but will not run if you import it.

  84 if __name__ == '__main__':
  85     ### TEST code ********************************************************
  86     import os, sys
  87     from random import random as rand
  88 
  89     # connect to test database
  90     db = ps2.connect("dbname='test' user='postgres'")
  91     cursor = db.cursor()
  92     # create test table
  93     cursor.execute('Create Table public.baex(id Serial, byteaa Bytea ARRAY[16])')

For test purposes I am connecting to my test database (line 90) and creating a test table (baex, line 93).

  95     sys.stdout.flush()
  96     print "\n******************** Bytea data generated: ********************"

Then I generate Bytea data for three database rows and insert it into the database (loop on lines 98-109).

  97     # generate 3 byte array sequences
  98     for rowId in range(1, 4):
  99         byteaSeq = []

Each row is populated with a Bytea array holding of up to seven elements. I am using the random() function to generate the bytes.

 100         # generate 1-7 random byte arrays
 101         for numOfSstrings in range(1, int(rand()*8)):
 102             bytea = ''.join([chr(int(rand()*256)) for x in range(int(rand()*5))])
 103             byteaSeq.append(bytea)
 104         print byteaSeq
 105         # get the INSERT string for the byte array sequence generated
 106         bas = PGT.prepareByteaString(byteaSeq)
 107         # insert the row into the table
 108         cursor.execute("INSERT INTO public.baex(id, byteaa) VALUES(%s, '{%s}')" \
 109                        % (rowId, bas))
 110     db.commit()
 111     sys.stdout.flush()

Once the data is inserted into the database I run psql to check whether everything worked properly..

 113     print "\n******************** Bytea data inserted: ********************"
 114     sys.stdout.flush()
 115     # show the data inserted
 116     os.system("psql -d test -U postgres -c 'SELECT * FROM public.baex'")

.. and finally drop the test table.

 117     cursor.execute('DROP TABLE public.baex')
 118     db.commit()

A few test runs

I have selected the two test runs below because they show some interesting (edge) cases.

The first one shows that single quotes are dealt with properly (last byte in last byte string of first row and first byte string of the third row)

******************** Bytea data generated: ********************
['R\xb7', '\xd7\xcd\xbb', 'u', "\xa3[\x97'"]
['\xfbT\x84g', '\xfa', '']
["'", '\x8d', '', '\xc5\n5', 'A', '\xa4P*']

******************** Bytea data inserted: ********************
 id |                    byteaa
----+-----------------------------------------------
  1 | {"R\\267","\\327\\315\\273",u,"\\243[\\227'"}
  2 | {"\\373T\\204g","\\372",""}
  3 | {',"\\215","","\\305\125",A,"\\244P*"}
(3 rows)

The second test run demonstrates that double quotes are handled correctly (fist byte of first byte string in first row)

******************** Bytea data generated: ********************
['"\xa5\x13', '\x9a`', '', '\xf4\x98']
[]
['u', '?7', '1\xff', '.\x16\xcf', '\xcbe}', '']

******************** Bytea data inserted: ********************
 id |                   byteaa
----+--------------------------------------------
  1 | {"\"\\245\23","\\232`","","\\364\\230"}
  2 | {""}
  3 | {u,?7,"1\\377",".\26\\317","\\313e}",""}
(3 rows)

Conclusion

Once you bite the bullet and invest the time to think about the problem, code the function and test it, it’s simple :-)