KEMBAR78
Python-CouchDB Training at PyCon PL 2012 | PDF
Using CouchDB with Python

      Stefan Kögl
       @skoegl
What we will cover
●   What is CouchDB?
    –   Access from Python though couchdbkit
    –   Key-value Store Functionality
    –   MapReduce Queries
    –   HTTP API
●   When is CouchDB useful and when not?
    –   Multi-Master Replication
    –   Scaling up and down
●   Pointers to other resources, CouchDB ecosystem
What we won't cover

●   CouchApps – browser-based apps that are served by
    CouchDB
●   Detailled Security, Scaling and other operative issues
●   Other functionality that didn't fit
Training Modes
●   Code-Along
    –   Follow Examples, write your own code
    –   Small Scripts or REPL
●   Learning-by-Watching
    –   Example Application at
        https://github.com/stefankoegl/python-couchdb-examples
    –   Slides at
        https://slideshare.net/skoegl/couch-db-pythonpyconpl2012
    –   Use example scripts and see what happens
    –   Submit Pull-Requests!
Contents
●   Intro
    –   Contents
    –   CouchDB
    –   Example Application
●   DB Initialization
●   Key-Value Store
●   Simple MapReduce Queries
●   The _changes Feed
●   Complex MapReduce Queries
●   Replication
●   Additional Features and the Couch Ecosystem
CouchDB
●   Apache Project
●   https://couchdb.apache.org/
●   Current Version: 1.2


●   Apache CouchDB™ is a database that uses JSON for
    documents, JavaScript for MapReduce queries, and regular
    HTTP for an API
Example Application
●   Lending Database
    –   Stores Items that you might want to lend
    –   Stores when you have lent what to whom
●   Stand-alone or distributed
●   Small Scripts that do one task each
●   Look at HTTP traffic
Contents
●   Intro
●   DB Initialization
    –   Setting Up CouchDB
    –   Installing couchdbkit
    –   Creating a Database
●   Key-Value Store
●   Simple MapReduce Queries
●   The _changes Feed
●   Complex MapReduce Queries
●   Replication
●   Additional Features and the Couch Ecosystem
Getting Set Up: CouchDB
●   Provided by me (not valid anymore after the training)
●   http://couch.skoegl.net:5984/<yourname>
●   Authentication: username training, password training
●   Setup your DB_URL in settings.py


●   If you want to install your own
    –   Tutorials: https://wiki.apache.org/couchdb/Installation
    –   Ubuntu: https://launchpad.net/~longsleep/+archive/couchdb
    –   Mac, Windows: https://couchdb.apache.org/#download
Getting Set Up: couchdbkit
●   http://couchdbkit.org/
●   Python client library
# install with pip
pip install couchdbkit


# or from source
git clone git://github.com/benoitc/couchdbkit.git
cd couchdbkit 
sudo python setup.py install


# and then you should be able to import 
import couchdbkit
Contents
●   Intro
●   DB Initialization
    –   Setting Up CouchDB
    –   Installing couchdbkit
    –   Creating a Database
●   Key-Value Store
●   Simple MapReduce Queries
●   Complex MapReduce Queries
●   The _changes Feed
●   Replication
●   Additional Features and the Couch Ecosystem
Creating a Database
●   What we have: a CouchDB server and its URL
    eg http://127.0.0.1:5984


●   What we want: a database there
    eg http://127.0.0.1:5984/myname


●   http://wiki.apache.org/couchdb/HTTP_database_API
A note on Debugging
●   Apache-style log files
●   Locally
    –   $ tail ­f /var/log/couchdb/couch.log
●   HTTP
    –   http://127.0.0.1:5984/_log?bytes=5000
    –   http://wiki.apache.org/couchdb/HttpGetLog
Creating a Database
# ldb-init.py

from restkit import BasicAuth

from couchdbkit import Database

from couchdbkit.exceptions import ResourceNotFound



auth_filter = BasicAuth('username', 'pwd')

db = Database(dburl, filters=[auth_filter])

server = db.server

try:

    server.delete_db(db.dbname)

except ResourceNotFound:

    pass

db = server.get_or_create_db(db.dbname)
Creating a Database
[Thu, 06 Sep 2012 16:44:30 GMT] [info] [<0.1435.0>] 127.0.0.1 - -
DELETE /myname/ 200
[Thu, 06 Sep 2012 16:44:30 GMT] [info] [<0.1435.0>] 127.0.0.1 - -
HEAD /myname/ 404
[Thu, 06 Sep 2012 16:44:30 GMT] [info] [<0.1440.0>] 127.0.0.1 - -
PUT /myname/ 201
Contents
●   Intro
●   DB Initialization
●   Key-Value Store
    –   Modelling Documents
    –   Storing and Retrieving Documents
    –   Updating Documents
●   Simple MapReduce Queries
●   Complex MapReduce Queries
●   The _changes Feed
●   Replication
●   Additional Features and the Couch Ecosystem
Key-Value Store
●   Core of CouchDB
●   Keys (_id): any valid JSON string
●   Values (documents): any valid JSON objects
●   Stored in B+-Trees
●   http://guide.couchdb.org/draft/btree.html
Modelling a Thing
●   A thing that we want to lend
    –   Name
    –   Owner
    –   Dynamic properties like
         ●   Description
         ●   Movie rating
         ●   etc
Modelling a Thing
●   In CouchDB documents are JSON objects
●   You can store any dict
    –   Wrapped in couchdbkit's Document classes for convenience


●   Documents can be serialized to JSON …
    mydict = mydoc.to_json()
●   … and deserialized from JSON
    mydoc = DocClass.wrap(mydict)
Modelling a Thing
# models.py
from couchdbkit import Database, Document, StringProperty


class Thing(Document):
    owner = StringProperty(required=True)
    name = StringProperty(required=True)


db = Database(DB_URL)
Thing.set_db(db)
Storing a Document
●   Document identified by _id
    –   Auto-assigned by Database (bad)
    –   Provided when storing the database (good)
    –   Think about lost responses
    –   couchdbkit does that for us


●   couchdbkit adds property doc_type with value „Thing“
Internal Storage
●   Database File /var/lib/couchdb/dbname.couch
●   B+-Tree of _id
●   Access: O(log n)
●   Append-only storage
●   Accessible in historic order (we'll come to that later)
Storing a Document
# ldb-new-thing.py
couchguide = Thing(owner='stefan',
                   name='CouchDB The Definitive Guide')
couchguide.publisher = "O'Reilly“
couchguide.to_json()
# {'owner': u'stefan', 'doc_type': 'Thing',
# 'name': u'CouchDB The Definitive Guide',
# 'publisher': u"O'Reilly"}


couchguide.save()


print couchguide._id
# 448aaecfe9bc1cde5d6564a4c93f79c2
Storing a Document
[Thu, 06 Sep 2012 19:40:26 GMT] [info] [<0.962.0>] 127.0.0.1 - -
GET /_uuids?count=1000 200
[Thu, 06 Sep 2012 19:40:26 GMT] [info] [<0.962.0>] 127.0.0.1 - -
PUT /lendb/8f14ef7617b8492fdbd800f1101ebb35 201
Retrieving a Document
●   Retrieve Documents by its _id
    –   Limited use
    –   Does not allow queries by other properties



# ldb­get­thing.py 
thing = Thing.get(thing_id)
Retrieving a Document
[Thu, 06 Sep 2012 19:45:30 GMT] [info] [<0.962.0>] 127.0.0.1 - -
GET /lendb/8f14ef7617b8492fdbd800f1101ebb35 200
Updating a Document
●   Optimistic Concurrency Control
●   Each Document has a revision
●   Each Operation includes revision
●   Operation fails if revision doesn't match
Updating a Document
>>> thing1 = Thing.get(some_id)      >>> thing2 = Thing.get(some_id)

>>> thing1._rev                      >>> thing2._rev

'1­110e1e46bcde6ed3c2d9b1073f0b26'   '1­110e1e46bcde6ed3c2d9b1073f0b26'



>>> thing1.something = True

>>> thing1.save()                    >>> thing2._rev

>>> thing1._rev                      '1­110e1e46bcde6ed3c2d9b1073f0b26'

'2­3f800dffa62f4414b2f8c84f7cb1a1'   >>> thing2.conflicting = 'test'

                                     >>> thing2.save()

             Success                 couchdbkit.exceptions.ResourceConfl
                                     ict: Document update conflict.



                                                   Failed
Updating a Document
[Thu, 13 Sep 2012 06:16:52 GMT] [info] [<0.7977.0>] 127.0.0.1 - -
GET /lendb/d46d311d9a0f64b1f7322d20721f9f1d 200
[Thu, 13 Sep 2012 06:16:55 GMT] [info] [<0.7977.0>] 127.0.0.1 - -
GET /lendb/d46d311d9a0f64b1f7322d20721f9f1d 200
[Thu, 13 Sep 2012 06:17:34 GMT] [info] [<0.7977.0>] 127.0.0.1 - -
PUT /lendb/d46d311d9a0f64b1f7322d20721f9f1d 201
[Thu, 13 Sep 2012 06:17:48 GMT] [info] [<0.7977.0>] 127.0.0.1 - -
PUT /lendb/d46d311d9a0f64b1f7322d20721f9f1d 409
Contents
●   Intro
●   DB Initialization
●   Key-Value Store
●   Simple MapReduce Queries
    –   Create a View
    –   Query the View
●   Complex MapReduce Queries
●   The _changes Feed
●   Replication
●   Additional Features and the Couch Ecosystem
Views
●   A specific „view“ on (parts of) the data in a database
●   Indexed incrementally
●   Query is just reading a range of a view sequentially
●   Generated using MapReduce
MapReduce Views
●   Map Function
    –   Called for each document
    –   Has to be side-effect free
    –   Emits zero or more intermediate key-value pairs
●   Reduce Function (optional)
    –   Aggregates intermediate pairs
●   View Results stored in B+-Tree
    –   Incrementally pre-computed at query-time
    –   Queries are just a O(log n)
List all Things
●   Implemented as MapReduce View
●   Contained in a Design Document
    –   Create
    –   Store
    –   Query
Create a Design Document
●   Regular document, interpreted by the database
●   Views Mapped to Filesystem by directory structure
    _design/<ddoc name>/views/<view name>/{map,reduce}.js

●   Written in JavaScript or Erlang
●   Pluggable View Servers
    –   http://wiki.apache.org/couchdb/View_server
    –   http://packages.python.org/CouchDB/views.html
    –   Lisp, PHP, Ruby, Python, Clojure, Perl, etc
Design Document

# _design/things/views/by_owner_name/map.js


function(doc) {
    if(doc.doc_type == “Thing“) {
        emit([doc.owner, doc.name], null);
    }
}
Intermediate Results
Key                               Value
[„stefan“, „couchguide“]          null
[„stefan“, „Polish Dictionary“]   null
[„marek“, „robot“]                null
Design Document

# _design/things/views/by_owner_name/reduce.js


_count
Reduced Results
 ●     Result depends on group level

Key                               Value
[„stefan“, „couchguide“]          1
[„stefan“, „Polish Dictionary“]   1
[„marek“, „robot“]                1



Key                               Value
[„stefan“]                        2
[„marek“]                         1


Key                               Value
null                              3
Synchronize Design Docs
●   Upload the design document
●   _id: _design/<ddoc name>
●   couchdbkit syncs ddocs from filesystem


●   We'll need this a few more times
    –   Put the following in its own script
    –   or run
        $ ./ldb­sync­ddocs.py
Synchronize Design Docs
# ldb­sync­ddocs.py


from couchdbkit.loaders import FileSystemDocsLoader


auth_filter = BasicAuth('username', 'pwd')
db = Database(dburl, filters=[auth_filter])


loader = FileSystemDocsLoader('_design')
loader.sync(db, verbose=True)
View things/by_name
  ●   Emitted key-value pairs
  ●   Sorted by key
      http://wiki.apache.org/couchdb/View_collation

  ●   Keys can be complex (lists, dicts)
  ●   Query
      http://127.0.0.1:5984/myname/_design/things/_view/by_name?reduce=false



Key                             Value          _id (implicit)   Document (implicit)
[“stefan“, “couchguide“]        null                            {…}
[“stefan“, “Polish Dictionary“] null                            {…}
Query a View


# ldb­list­things.py
things = Thing.view('things/by_owner_name',
                    include_docs=True, reduce=False)


for thing in things:
   print thing._id, thing.name, thing.owner
Query a View – Reduced

# ldb­overview.py
owners = Thing.view('things/by_owner_name',
                    group_level=1)


for owner_status in owners:
    owner = owner_status['key'][0]
    count = owner_status['value']
    print owner, count
Break
From the Break
●   Filtering by Price
    –   startkey = 5
    –   endkey = 10
●   Structure: ddoc name / view name
    –   Logical Grouping
    –   Performance
Contents
●   Intro
●   DB Initialization
●   Key-Value Store
●   Simple MapReduce Queries
●   The _changes Feed
    –   Accessing the _changes Feed
    –   Lending Objects
●   Advanced MapReduce Queries
●   Replication
●   Additional Features and the Couch Ecosystem
Changes Sequence
●   With every document update, a change is recorded
●   local history, ordered by _seq value
●   Only the latest _seq is kept
Changes Feed
●   List of all documents, in the order they were last modified
●   Possibility to
    –   React on changes
    –   Process all documents without skipping any
    –   Continue at some point with since parameter


●   CouchDB as a distributed, persistent MQ
●   http://guide.couchdb.org/draft/notifications.html
●   http://wiki.apache.org/couchdb/HTTP_database_API#Changes
Changes Feed
# ldb­changes­log.py


def callback(line):
    seq = line['seq']
    doc = line['doc']
   
    # get obj according to doc['doc_type']
    print seq, obj


consumer = Consumer(db)
consumer.wait(callback, since=since, include_docs=True)
„Lending“ Objects
●   Thing that is lent
●   Who lent it (ie who is the owner of the thing)
●   To whom it is lent
●   When it was lent
●   When it was returned
Modelling a „Lend“ Object
# models.py 


class Lending(Document):
    thing = StringProperty(required=True)
    owner = StringProperty(required=True)
    to_user = StringProperty(required=True)
    lent = DateTimeProperty(default=datetime.now)
    returned = DateTimeProperty()


Lending.set_db(db)
Lending a Thing
# ldb­lend­thing.py


lending = Lending(thing=thing_id,
                  owner=username,
                  to_user=to_user)           
lending.save()                                            
                  
Returning a Thing
# ldb­return­thing.py    


lending = Lending.get(lend_id)
lending.returned = datetime.now()
lending.save()           
Contents
●   Intro
●   DB Initialization
●   Key-Value Store
●   Simple MapReduce Queries
●   The _changes Feed
●   Advanced MapReduce Queries
    –   Imitating Joins with „Mixed“ Views
●   Replication
●   Additional Features and the Couch Ecosystem
Current Thing Status
●   View to get the current status of a thing
●   No Joins
●   We emit with keys, that group together
Complex View
# _design/things/_view/history/map.js


function(doc) {
    if(doc.doc_type == "Thing") {
        emit([doc.owner, doc._id, 1], doc.name);
    }
    if(doc.doc_type == "Lending") {
        if(doc.lent && !doc.returned) {
            emit([doc.owner, doc.thing, 2], doc.to_user);
        }
    }
}                                                         
                      
Intermediate View Results
Key                    Value
[„stefan“, 12345, 1]   „couchguide“
[„stefan“, 12345, 2]   [„someone“, „2012-09-12“]
[„marek“, 34544, 1]    „robot“
Reduce Intermediate Results
# _design/things/_view/status/reduce.js


/* use with group_level = 2 */
function(keys, values) {
    
    /* there is at least one „Lending“ row */
    if(keys.length > 1) {
        return "lent";
    } else {
        return "available";
    }
}
Reduce Intermediate Results
●   Don't forget to synchronize your design docs!
●   Group Level: 2
●   Reduce Function receives rows with same grouped value
                        Intermediate – not reduced
Key                                   Value
[„stefan“, 12345, 1]                  „couchguide“
[„stefan“, 12345, 2]                  [„someone“, „2012-09-12“]
[„marek“, 34544, 1]                   „robot“

                                 reduced
Key                                   Value
[„owner“, 12345]                      „lent“
[„owner“, 34544]                      „available“
Get Status
# ldb­status.py


things = Thing.view('things/status', group_level = 2)


for result in things:
    owner = result['key'][0]
    thing_id = result['key'][1]
    status = result['value'])
    Print owner, thing_id, status
Contents
●   Intro
●   DB Initialization
●   Key-Value Store
●   Simple MapReduce Queries
●   The _changes Feed
●   Advanced MapReduce Queries
●   Replication
    –   Setting up filters
    –   Find Friends and Replicate from them
    –   Eventual Consistency and Conflicts
●   Additional Features and the Couch Ecosystem
Replication
●   Replicate Things and their status from friends
●   Don't replicate things from friends of friends
    –   we don't want to borrow anything from them
Replication
●   Pull replication
    –   Pull documents from our friends, and store them locally
●   There's also Push replication, but we won't use it


●   Goes through the source's _changes feed
●   Compares with local documents, updates or creates conflicts
Set up a Filter
●   A Filter is a JavaScript function that takes
    –   a document
    –   a request object
●   and returns
    –   true, if the document passes the filter
    –   false otherwise
●   A filter is evaluated at the source
Replication Filter
# _design/things/filters/from_friend.js


/* doc is the document, 
   req is the request that uses the filter */
function(doc, req)
{
    /* Allow only if entry is owned by the friend */
    return (doc.owner == req.query.friend);
}
Replication
●   Sync design docs to your own database!


●   Find friends to borrow from
    –   Post your nickname and Database URL to
        http://piratepad.net/pycouchpl
    –   Pick at least two friends
Replication
●   _replicator database
●   Objects describe Replication tasks
    –   Source
    –   Target
    –   Continuous
    –   Filter
    –   etc
●   http://wiki.apache.org/couchdb/Replication
Replication
# ldb­replicate­friend.py
auth_filter = BasicAuth(username, password)
db = Database(db_url, filters=[auth_filter])
replicator_db = db.server['_replicator']


replication_doc = {
    "source": friend_db_url,  "target": db_url,
    "continuous": True, 
    "filter": "things/from_friend",
    "query_params": { "friend": friend_name }
}
replicator_db[username+“­“+friend_name]=replication_doc
Replication
●   Documents should be propagated into own database
●   Views should contain both own and friends' things
Dealing with Conflicts
●   Conflicts introduces by
    –   Replication
    –   „forcing“ a document update
●   _rev calculated based on
    –   Previous _rev
    –   document content
●   Conflict when two documents have
    –   The same _id
    –   Distinct _rev
Dealing with Conflicts
●   Select a Winner
●   Database can't do this for you
●   Automatic strategy selects a (temporary) winner
    –   Deterministic: always the same winner on each node
    –   leaves them in conflict state
●   View that contains all conflicts
●   Resolve conflict programmatically
●   http://guide.couchdb.org/draft/conflicts.html
●   http://wiki.apache.org/couchdb/Replication_and_conflicts
Contents
●   Intro
●   DB Initialization
●   Key-Value Store
●   Simple MapReduce Queries
●   The _changes Feed
●   Advanced MapReduce Queries
●   Replication
●   Additional Features and the Couch Ecosystem
    –   Scaling and related Projects
    –   Fulltext Search
    –   Further Reading
Scaling Up / Out
●   BigCouch
    –   Cluster of CouchDB nodes that appears as a single server
    –   http://bigcouch.cloudant.com/
    –   will be merged into CouchDB soon
●   refuge
    –   Fully decentralized data platform based on CouchDB
    –   Includes fork of GeoCouch for spatial indexing
    –   http://refuge.io/
Scaling Down
●   CouchDB-compatible Databases on a smaller scale
●   PouchDB
    –   JavaScript library http://pouchdb.com/
●   TouchDB
         ●   IOS: https://github.com/couchbaselabs/TouchDB-iOS
         ●   Android: https://github.com/couchbaselabs/TouchDB-Android
Fulltext and Relational Search
●   http://wiki.apache.org/couchdb/Full_text_search
●   CouchDB Lucene
    –   http://www.slideshare.net/martin.rehfeld/couchdblucene
    –   https://github.com/rnewson/couchdb-lucene
●   Elastic Search
    –   http://www.elasticsearch.org/
Operations Considerations
●   Append Only Storage
●   Your backup tools: cp, rsync
●   Regular Compaction needed
Further Features
●   Update Handlers: JavaScript code that carries out update in
    the database server
●   External Processes: use CouchDB as a proxy to other
    processes (eg search engines)
●   Attachments: attach binary files to documents
●   Update Validation: JavaScript code to validate doc updates
●   CouchApps: Web-Apps served directly by CouchDB
●   Bulk APIs: Several Updates in one Request
●   List and Show Functions: Transforming responses before
    serving them
Summing Up
●   Apache CouchDB™ is a database that uses JSON for
    documents, JavaScript for MapReduce queries, and regular
    HTTP for an API
●   couchdbkit is a a Python library providing access to Apache
    CouchDB
Thanks!
              Time for Questions and Discussion



Stefan Kögl
stefan@skoegl.net
@skoegl
                                                  Downloads
   https://slideshare.net/skoegl/couch-db-pythonpyconpl2012
    https://github.com/stefankoegl/python-couchdb-examples

Python-CouchDB Training at PyCon PL 2012