KEMBAR78
PostgreSQL for Schemaless Data | PDF | Json | Postgre Sql
100% found this document useful (1 vote)
292 views61 pages

PostgreSQL for Schemaless Data

PostgreSQL can be used as a schemaless database using three different data types: XML, hstore, and JSON. XML has been available since the 1990s but has performance issues for indexing. Hstore uses key-value pairs and has good indexing support via GiST and GIN indexes. JSON was recently added to the core in PostgreSQL 9.2 and has basic functionality, with more coming in 9.3. Based on benchmarks loading 1.78 million records and querying on the primary key, the relational and MongoDB approaches had the best performance, while hstore with GiST/GIN and XML had slower performance than the other options.

Uploaded by

Ivan Voras
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
292 views61 pages

PostgreSQL for Schemaless Data

PostgreSQL can be used as a schemaless database using three different data types: XML, hstore, and JSON. XML has been available since the 1990s but has performance issues for indexing. Hstore uses key-value pairs and has good indexing support via GiST and GIN indexes. JSON was recently added to the core in PostgreSQL 9.2 and has basic functionality, with more coming in 9.3. Based on benchmarks loading 1.78 million records and querying on the primary key, the relational and MongoDB approaches had the best performance, while hstore with GiST/GIN and XML had slower performance than the other options.

Uploaded by

Ivan Voras
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

PostgreSQL as a Schemaless Database.

Christophe Pettus PostgreSQL Experts, Inc. PgDay FOSDEM 2013

Welcome!
Im Christophe. PostgreSQL person since 1997. Consultant with PostgreSQL Experts, Inc. cpettus@pgexperts.com thebuild.com @xof on Twitter.

Whats on the menu?


What is a schemaless database? How can you use PostgreSQL to store
schemaless data?

How does do the various schemaless


options perform?

A note on NoSQL.
Worst. Term. Ever. Its true that all modern schemaless
databases do not use SQL, but

Neither did Postgres before it became


PostgreSQL. (Remember QUEL?) xed schema.

The dening characteristic is the lack of a

Schematic.
A schema is a xed (although mutable
over time) denition of the data. table to eld/column/attribute.

Database to schema (unfortunate term) to Individual elds can be optional (NULL). Adding new columns requires a schema
change.

Rock-n-Roll!
Schemaless databases store documents
rather than rows.

They have internal structure, but that structure is per document. No elds! No schemas! Make up whatever
you like!

We are not amused.


Culturally, very different from the glass
house data warehouse model. storage

Grew out of the need for persistent object and impatience with the (perceived)
limitations of relational databases and object-relational managers.

Let us never speak of this again. Theres a lot to talk about in schemaless vs
traditional relational databases.

But lets not. Todays topic: If you want to store And what can you expect?

schemaless data in PostgreSQL, how can you?

What is schemaless data?


Schemaless does not mean unstructured. Each document (=record/row) is a
hierarchical structure of arrays and keyvalue pairs. one of these

The application knows what to expect in and how to react if it doesnt get it.

PostgreSQL has you covered. Not one, not two, but three different
document types:

Lets see what theyve got.

XML hstore JSON

XML
It seemed like a good idea at the time.

XML
Been around since the mid-1990s. Hierarchical structured data based on
SGML.

Underlying technology for SOAP and a lot


of other stuff that was really popular for a while.

Still super-popular in the Java world.

XML, your dads document language. Can specify XML schemas using DTDs. No one does this. Can do automatic transformations of XML
into other markups using XSLT.

Lets not forget the most important use of


XML!

Only the masochistic do this.

Tomcat Conguration Files.


<Server port="8005" shutdown="SHUTDOWN" debug="0"> <Service name="Tomcat-Standalone"> <Connector className="org.apache.catalina.connector.http.HttpConnector" port="8080" minProcessors="5" maxProcessors="75" enableLookups="true" redirectPort="8443" acceptCount="10" debug="0" connectionTimeout="60000"/> <Engine name="Standalone" defaultHost="localhost" debug="0"> <Logger className="org.apache.catalina.logger.FileLogger" prefix="catalina_log." suffix=".txt" timestamp="true"/> <Realm className="org.apache.catalina.realm.MemoryRealm" /> <Host name="localhost" debug="0" appBase="webapps" unpackWARs="true"> <Valve className="org.apache.catalina.valves.AccessLogValve" directory="logs" prefix="localhost_access_log." suffix=".txt" pattern="common"/> <Logger className="org.apache.catalina.logger.FileLogger" directory="logs" prefix="localhost_log." suffix=".txt" timestamp="true"/> <Context path="/examples" docBase="examples" debug="0" reloadable="true"> <Logger className="org.apache.catalina.logger.FileLogger" prefix="localhost_examples_log." suffix=".txt" timestamp="true"/> </Context> </Host> </Engine> </Service> </Server>

XML Support in PostgreSQL. Built-in type. Can handle documents up to 2 gigabytes. A healthy selection of XML operators. xpath in particular. Very convenient XML export functions. Great for external XML requirements.

XML Indexing.
There isnt any. Unless you build it yourself with an
expression index.

Functionality is great. Performance is well talk about this later.

hstore
The hidden gem of contrib/

hstore
A hierarchical storage type specic to
PostgreSQL.

Maps string keys to string values, or to other hstore values. Contrib module; not part of the
PostgreSQL core.

hstore functions
Lots and lots and lots of hstore functions. h->a (get value for key a). h?a (does h contain key a?). h@>a->2 (does key a contain 2?). Many others.

hstore indexing.
Can create GiST and GIN indexes over
hstore values. key.

Indexes the whole hierarchy, not just one Accelerates @>, ?, ?& and ?| operators. Can also build expression indexes.

JSON
All the cool kids are doing it.

JSON
JavaScript Object Notation. JavaScripts data structure declaration
format, turned into a protocol.

Dictionaries, arrays, primitive types. Originally designed to just be passed into


eval() in JavaScript.

Please dont do this.

JSON, the new hotness


The de facto standard API data format for
REST web services. programmers.

Very comfortable for Python and Ruby MongoDBs native data storage type.

JSON? Yeah, we got that.


JSON type in core as of 9.2. Validates JSON going in. And not much else right now. array_to_json, row_to_json. Lots more coming in 9.3 (offer subject to
committer approval).

JSON Indexing.
Expression indexing. Can also treat as a text string for strict
comparison

which is kind of a weird idea and Im


not sure why youd do that.

But the coolest part of JSON in core is!

PL/V8!
The V8 JavaScript engine from Google is
available as an embedded language. youd expect.

JavaScript deals with JSON very well, as Not part of core or contrib; needs to be
built and installed separately.

PL/V8 ProTips
Use the static V8 engine that comes with
PL/V8.

Function is compiled by V8 on rst use. Now that we got rid of SQL injection
attacks, we now have JSON injection attacks.

PL invocation overhead is non-trivial.

Schemaless Strategies
Create single-eld tables with only a
hierarchical type. an object API.

Wrap up the (very simple) SQL to provide Create indexes to taste Maybe extract elds if you need to JOIN. Prot!

CREATE OR REPLACE FUNCTION get_json_key(structure JSON, key TEXT) RETURNS TEXT AS $get_json_key$ var js_object = structure; if (typeof ej != 'object') return NULL; return JSON.stringify(js_object[key]); $get_json_key$ IMMUTABLE STRICT LANGUAGE plv8;

CREATE TABLE blog { post json } CREATE INDEX post_pk_idx ON blog((get_json_key(post, post_id)::BIGINT)); CREATE INDEX post_date_idx ON blog((get_json_key(post, post_date)::TIMESTAMPTZ));

But but but


PostgreSQL was not designed to be a
schemaless database.

Wouldnt it be better to use a bespoke Well, lets nd out!

database designed for this kind of data?

Some Numbers.
When all else fails, measure.

Schemaless Shootout!
A very basic document structure: id, name, company, address1, address2,
city, state, postal code.

address2 and company are optional


(NULL in relational version).

1,780,000 records, average 63 bytes each.

id 64-bit integer, all others text.

The Competitors!
Traditional relational schema. hstore (GiST and GIN indexes). XML JSON One column per table for these. MongoDB

Timing Harness.
Scripts written in Python. psycopg2 2.4.6 for PostgreSQL interface. pymongo 2.4.2 for MongoDB interface.

The Test Track.


This laptop. OS X 10.7.5. 2.8GHz Intel Core i7. 7200 RPM disk. 8GB (never comes close to using a fraction
of it).

Indexing Philosophy
For relational, index on primary key. For hstore, index using GiST and GIN (and
none).

For JSON and XML, expression index on


primary key.

For MongoDB, index on primary key. Indexes created before records loaded.

Your Methodology Sucks.


Documents are not particularly large. No deep hierarchies. Hot cache. Only one index. No joins. No updates.

The Sophisticated Database Tuning Philosophy. None. Stock PostgreSQL 9.2.2, from source. No changes to postgresql.conf Stock MongoDB 2.2, from MacPorts. Fire it up, let it go.

First Test: Bulk Load


Scripts read a CSV le, parse it into the
appropriate format, INSERT it into the database. parsing time.

We measure total load time, including (COPY will be much much much faster.) mongoimport too, most likely.

Records/Second
6000

4500

3000

1500

Relational

hstore

hstore (GiST) hstore (GIN)

XML

JSON

MongoDB

Observations.
No attempt made to speed up PostgreSQL. Synchronous commit, checkpoint tuning,
etc.

GIN indexes are really slow to build. The XML xpath function is probably the
culprit for its load time.

Next Test: Disk Footprint.


Final disk footprint once data is loaded. For PostgreSQL, reported database sizes
from the pg_*_size functions.

For MongoDB, reported by db.stats().

Disk Footprint in Megabytes


2250

Data

Index

1500

750

Relational

hstore

hstore (GiST) hstore (GIN)

XML

JSON

MongoDB

Observations.
GIN indexes are really big on disk. PostgreSQLs relational data storage is very
efcient.

MongoDB certain likes its disk space. padding factor was 1, so it wasnt that.

None of these records are TOAST-able.

Next Test: Query on Primary Key For a sample of 100 documents, query a
Results not fetched. For PostgreSQL, time of .execute()
method from Python. method.

single document based on the primary key.

For MongoDB, time of .fetch()

Fetch Time in Milliseconds


400

300

200

100

Relational

hstore

hstore (GiST) hstore (GIN)

XML

JSON

MongoDB

Fetch Time in Milliseconds (<100ms)


13

9.75

6.5

3.25

Relational

XML

JSON

MongoDB

Fetch Time in Milliseconds (>100ms)


400

300

200

100

hstore

hstore (GiST)

hstore (GIN)

Observations.
B-tree indexes kick ass. GiST and GIN not even in same league
for simple key retrieval.

Difference between relational, XML and


JSON is not statistically signicant. be super-performant. Huh.

Wait, I thought MongoDB was supposed to

Next Test: Query on Name


For a sample of 100 names, query all
documents with that name.

Results not fetched. Required a full-table scan (except for


hstore with GiST and GIN indexes).

Same timing methodology.

Fetch Time in Milliseconds


50000

37500

25000

12500

Relational

hstore

hstore (GiST) hstore (GIN)

XML

JSON

MongoDB

Fetch Time in Milliseconds (<500ms)


500

375

250

125

Relational

hstore

hstore (GiST)

hstore (GIN)

MongoDB

Fetch Time in Milliseconds (>500ms)


50000

37500

25000

12500

XML

JSON

Observations.
GiST and GIN accelerate every eld, not
just the primary key.

Wow, executing the accessor function on


each XML and JSON eld is slow. footprint hurts it here.

MongoDBs grotesquely bloated disk Not that theres anything wrong with that.

Now that we know this, what do we know?

Some Conclusions.
PostgreSQL does pretty well as a
schemaless database.

Build indexes using expressions on


commonly-queried elds full exibility.

or use GiST and hstore if you want GIN might well be worth it for other cases.

Some Conclusions, 2.
Avoid doing full-table scans if you need to
use an accessor function. to xpath or a PL.

Although hstores are not bad compared Seriously consider hstore if you have the
exibility.

Its really fast.

Flame Bait!
MongoDB doesnt seem to be more
performant than PostgreSQL. goodies.

And you still get all of PostgreSQLs Larger documents will probably continue to
favor PostgreSQL.

As will larger tables.

Fire Extinguisher.
You can nd workloads that prove any dBase II included.
data model, now and in the future. real-world volumes. data storage technology is the right answer.

Be very realistic about your workload and Test, and test fairly with real-world data in

Thank you!

thebuild.com @xof

You might also like