But there is a problem with this with respect to the documents.db database: the callback reads the entire primary table row for you. But like I said in one of the previous posts, we don't want to read the MBs of a document unless we have to. So we cannot store the actual document data in the primary table's row.
There is a fairly easy solution to this problem. We separate the primary document meta-metadata (like etag, last modification, key, ID, document size, metadata size) from the document data. This leaves us adding another database section to the documents.db file (and we'll separate out the metadata json object while we are at it). This leaves us with a documents.db file looking like:
documents.db
VERSION=3 format=bytevalue database=data type=btree db_pagesize=512 HEADER=END DATA=END VERSION=3 format=bytevalue database=indexByKey type=btree db_pagesize=512 HEADER=END DATA=END VERSION=3 format=bytevalue database=indexByEtag type=btree db_pagesize=512 HEADER=END DATA=END
documents_data.db
VERSION=3 format=bytevalue database=documents type=btree db_pagesize=8192 HEADER=END DATA=END VERSION=3 format=bytevalue database=metadatas type=btree db_pagesize=8192 HEADER=END DATA=END
We have the main rows which is the meta-metadata, then the two indexes key->id and etag->id, then the actual documents table which will be id to document json, and finally the id to metadata json table.
This solves the problem since we will only associate indexByKey to primary and indexByEtag to primary, and the data rows will be very small.
It also helps fix another problem that will creep up once we get to performance testing/verification. BDB locks all access based on page boundaries. So it's important to optimize the page sizes so that we don't lock too many documents at once. By separating out the document json and metatdata json from the primary key data, we can optimize the page size for the index sections while maintaining larger page sizes for the json object dumps.
I am hoping that this will also simplify some of the Stream handling since we don't have to worry about the header parts when streaming metadata json and document jsons.
And, for a future, future protection. If we ever need to change the structure of the documents.db (by adding a "number of times accessed" or something like that) then we only have to modify the primary row. If the document json data was present in the same data section, then we would have to do a bunch of moving data around in order to insert the new field into the header.
Ok, now I just have to do it...
No comments:
Post a Comment