Saturday, August 18, 2012

Another document stream problem

I mentioned that secondary indexes are associated with the primary table.  Then whenever the primary table data changes, BDB calls the associate callback to update the secondary index based on the new primary table's row.  This is great, and it makes life much easier when using cursors, puts and gets from BDB.

But there is a problem with this with respect to the documents.db database: the callback reads the entire primary table row for you.   But like I said in one of the previous posts, we don't want to read the MBs of a document unless we have to.  So we cannot store the actual document data in the primary table's row.

There is a fairly easy solution to this problem.  We separate the primary document meta-metadata (like etag, last modification, key, ID, document size, metadata size) from the document data.  This leaves us adding another database section to the documents.db file (and we'll separate out the metadata json object while we are at it).  This leaves us with a documents.db file looking like:

documents.db

VERSION=3
format=bytevalue
database=data
type=btree
db_pagesize=512
HEADER=END
DATA=END

VERSION=3
format=bytevalue
database=indexByKey
type=btree
db_pagesize=512
HEADER=END
DATA=END

VERSION=3
format=bytevalue
database=indexByEtag
type=btree
db_pagesize=512
HEADER=END
DATA=END

documents_data.db

VERSION=3
format=bytevalue
database=documents
type=btree
db_pagesize=8192
HEADER=END
DATA=END

VERSION=3
format=bytevalue
database=metadatas
type=btree
db_pagesize=8192
HEADER=END
DATA=END

We have the main rows which is the meta-metadata, then the two indexes key->id and etag->id, then the actual documents table which will be id to document json, and finally the id to metadata json table.

This solves the problem since we will only associate indexByKey to primary and indexByEtag to primary, and the data rows will be very small.

It also helps fix another problem that will creep up once we get to performance testing/verification.  BDB locks all access based on page boundaries.  So it's important to optimize the page sizes so that we don't lock too many documents at once.  By separating out the document json and metatdata json from the primary key data, we can optimize the page size for the index sections while maintaining larger page sizes for the json object dumps.

I am hoping that this will also simplify some of the Stream handling since we don't have to worry about the header parts when streaming metadata json and document jsons.

And, for a future, future protection.  If we ever need to change the structure of the documents.db (by adding a "number of times accessed" or something like that) then we only have to modify the primary row.  If the document json data was present in the same data section, then we would have to do a bunch of moving data around in order to insert the new field into the header.

Ok, now I just have to do it...

No comments:

Post a Comment