Saturday, August 25, 2012

First linux test

First successful run of RavenDB on linux today.  Compiled on windows, run on linux.  Man C# is pretty neat.


Friday, August 24, 2012

BDB queues

BDB queues are said to support high concurrency since it only locks records not pages, and if you only have 64 byte (or so) records and your page size is 512, then locking a page could lock a large number of records.  But I ran into a problem with it today...

We have multiple writes and multiple readers for the index stats table.  Writes update the stats from multiple indexing threads, while consumers check the status of the index using IsIndexStale.  We use snapshot isolation to prevent read locks during the IsIndexStale call, but here comes the rub.

BDB queues don't support snapshot isolation.

So that's that.

Monday, August 20, 2012

Mapped results "table"

Most of the tables in RavenDB are very simple "storage dumps".  There is an ID, some row properties, and a data section.  Then there is typically at most one secondary key, and one sorting key.  For example, the documents table.  There's the internal document ID, the secondary index which is the document key and the sorting key is the etag.  These tables turn out to be easy to achieve in BDB since it has built-in support for secondary key lookups.

The mapped results table is a little different since we are looking items up by more complicated keys and sorts.  I will attempt to describe how this is done in BDB as clearly as I can, since I had quite a few misses before achieving the results I wanted.

The mapped results table

The mapped results table holds raven's mapped results (obviously).  I don't fully understand what all of the columns (nor do I have to) are used for since I'm not that familiar with the specifics of how raven pulls off map-reduce.  But here are the columns for the table:
  • primary key ID
  • view name
  • document key
  • reduce key
  • reduce key/view hash
  • document etag
  • timestamp
  • document data
Now, there are a few ways that we access the data in the table (writing all of these out will help us model our  secondary indexes)
  1. get the most recent etag for a view
  2. get the document data for a view after a specified etag
  3. delete documents given a view
  4. get the document data documents given a view and a reduce key
  5. delete documents given a view and a document key
Once we make a slight change from the way the esent storage engine operates, most of these are easy to implements and involve a since secondary index combined with the primary row data.  The change we are going to make is to make the etag the primary key and remove the auto-incrementing ID.

We can then add a secondary index for the view.  This will map every document etag to the view it's part of.  We sort the view in ascending order and the etag in this secondary index in reverse order.  We tell BDB that we want duplicated in the index and we want to sort these indexes

this.indexByViewFile = env.CreateDatabase(DbCreateFlags.None);
indexByViewFile.PageSize = 512;
indexByViewFile.SetFlags(DbFlags.Dup | DbFlags.DupSort);
indexByViewFile.DupCompareFast = GuidCompareDesc;

Now we use this index to solve:
  1. open a cursor on indexByView, jump to the given view, then get the first primary key for the index.  Since the duplicates are stored in reverse order it will always be the most recent etag
  2. open a cursor on indexByView, jump to the given view, now continue to pull the duplicate secondary index records until we hit the stop etag.  Again since the duplicates are stored in reverse order, all of the keys we pull from the index until we hit the stop etag will be more recent than that stop etag.
  3. open a cursor on indexByView, jump to the given view, look at all duplicates and delete them
The next secondary index we need is based on the reduce key/view hash.  This secondary index will solve:
  1. open a cursor on indexByHash, jump to the given hash, then enumerate all duplicate keys.  This will give us all the etag that match the reduce key and view.  And just like the esent engine we also verify that the reduce key and view match in case of hash clashes.
The final access is the slightly more complicated scenario since we are trying to match two separate fields.  Of course BDB supports this, it's just a little more work.  We want to match by view and document key.  And we already have the secondary index on view, so we need to add another secondary index on document key.  Now, we use something called a join cursor in BDB.  This allows us to start up two cursors then join them together to access the primary data while both cursors continue to match.
  1. open a cursor on indexByView and jump to the given view.  Then open a cursor on indexByKey and jump to the given document key.  Now, with the primary row table, join these two cursors into another cursor.  Iterating over this cursor will give us etag that match the view and the document key.

using(var cursorView = indexByView.OpenCursor(txn))
using(var cursorKey = indexByKey.OpenCursor(txn))
{
  string currentView, currentKey;

  if(cursorView.JumpTo(view, out currentView) == ReadStatus.NotFound)
   return reduceKeys;
  if (currentView != view)
   return reduceKeys;

  if (cursorKey.JumpTo(documentId, out currentKey) == ReadStatus.NotFound)
   return reduceKeys;
  if (currentKey != documentId)
   return reduceKeys;

  using(var cursor = primaryTable.Join(new DbFileCursor[] { cursorView, cursorKey }, Db.JoinMode.None, 0))
  {
   var vKey = DbEntry.Out(new byte[16]);
   var vData = DbEntry.Out(new byte[128]);

   while (true)
   {
    var status = cursor.Get(ref vKey, ref vData, DbJoinCursor.GetMode.None, DbJoinCursor.ReadFlags.None);
    if (status == ReadStatus.NotFound)
     break;

    int viewLength = BitConverter.ToInt32(vData.Buffer, 40);
    int keyLength = BitConverter.ToInt32(vData.Buffer, 44);
    int reduceKeyLength = BitConverter.ToInt32(vData.Buffer, 48);
    string reduceKey = Encoding.Unicode.GetString(vData.Buffer, 56+viewLength+keyLength, reduceKeyLength);

    reduceKeys.Add(reduceKey);
    deletes.Add(new Guid(vKey.Buffer));
   }
  }
 }
}

Sunday, August 19, 2012

Making progress

Just an update on the progress ("are working" == unit test I can find for the operations are successful).
  • Documents are working
  • Transactions are working
  • Queues are working
  • Mapped indexes are working
The only things that seem to remain as far as full functionality are the map->reduce working set tables, the tasks tables and the attachment tables.

There's nothing much to report as far as anything interesting.  All of the extra operations involved are basically just copies of the existing access methods.

Saturday, August 18, 2012

Another document stream problem

I mentioned that secondary indexes are associated with the primary table.  Then whenever the primary table data changes, BDB calls the associate callback to update the secondary index based on the new primary table's row.  This is great, and it makes life much easier when using cursors, puts and gets from BDB.

But there is a problem with this with respect to the documents.db database: the callback reads the entire primary table row for you.   But like I said in one of the previous posts, we don't want to read the MBs of a document unless we have to.  So we cannot store the actual document data in the primary table's row.

There is a fairly easy solution to this problem.  We separate the primary document meta-metadata (like etag, last modification, key, ID, document size, metadata size) from the document data.  This leaves us adding another database section to the documents.db file (and we'll separate out the metadata json object while we are at it).  This leaves us with a documents.db file looking like:

documents.db

VERSION=3
format=bytevalue
database=data
type=btree
db_pagesize=512
HEADER=END
DATA=END

VERSION=3
format=bytevalue
database=indexByKey
type=btree
db_pagesize=512
HEADER=END
DATA=END

VERSION=3
format=bytevalue
database=indexByEtag
type=btree
db_pagesize=512
HEADER=END
DATA=END

documents_data.db

VERSION=3
format=bytevalue
database=documents
type=btree
db_pagesize=8192
HEADER=END
DATA=END

VERSION=3
format=bytevalue
database=metadatas
type=btree
db_pagesize=8192
HEADER=END
DATA=END

We have the main rows which is the meta-metadata, then the two indexes key->id and etag->id, then the actual documents table which will be id to document json, and finally the id to metadata json table.

This solves the problem since we will only associate indexByKey to primary and indexByEtag to primary, and the data rows will be very small.

It also helps fix another problem that will creep up once we get to performance testing/verification.  BDB locks all access based on page boundaries.  So it's important to optimize the page sizes so that we don't lock too many documents at once.  By separating out the document json and metatdata json from the primary key data, we can optimize the page size for the index sections while maintaining larger page sizes for the json object dumps.

I am hoping that this will also simplify some of the Stream handling since we don't have to worry about the header parts when streaming metadata json and document jsons.

And, for a future, future protection.  If we ever need to change the structure of the documents.db (by adding a "number of times accessed" or something like that) then we only have to modify the primary row.  If the document json data was present in the same data section, then we would have to do a bunch of moving data around in order to insert the new field into the header.

Ok, now I just have to do it...

Streaming the data out

Oren, the main ravendb guy, mentioned on the mailing list that we should be streaming the data to the database: "We have users that uses documents that are several MB in size, and we have to stream it to disk, instead of copying the data several times."

So the simple document adder function I mentioned a couple posts ago won't cut it since using MemoryStream.ToArray is a no-no for large documents.  BDB already supports using your provided byte buffer, and it also supports partial writes.  These are the two things required for us to support the Stream interface to the database.

Partial writes

Starting from square one with no documents in the database, we now need to be able to support that same simple PUT command, but never call MemoryStream.ToArray.  The main IDocumentStorageActions.AddDocument looks basically the same except that the streaming interface (I know you see a MemoryStream.ToArray in there which I just said was a no-no, but I'm hoping that I'll be allowed to get away with this since the metadata shouldn't ever be very big).

Guid newEtag = uuidGenerator.CreateSequentialUuid();

using(var ms = new MemoryStream()) { metadata.WriteTo(ms); metadataBuffer = ms.ToArray(); }
using(Stream stream = new BufferedStream(database.DocumentTable.AddDocument(transaction, key, newEtag, System.DateTime.Now, metadataBuffer)))
{
 using (var finalStream = documentCodecs.Aggregate(stream, (current, codec) => codec.Encode(key, data, metadata, current)))
 {
  data.WriteTo(finalStream);
  stream.Flush();
 }
}

return newEtag;

The changes come from the DocumentTable's AddDocument function.  This function now needs to return a Stream object that can be used to partial write the document to the database.

public Stream AddDocument(Txn transaction, string key, Guid etag, DateTime dateTime, byte[] metadata)
{
 long docId, existingId;
 Guid existingEtag;

 //update or insert?
 if(!GetEtagByKey(transaction, key, out existingId, out existingEtag))
 {
  long lastId = 0;
  var vlastId = DbEntry.Out(new byte[8]);
  var vlastData = DbEntry.EmptyOut();
   using (var cursor = dataTable.OpenCursor(transaction, DbFileCursor.CreateFlags.None))
  {
   if (cursor.Get(ref vlastId, ref vlastData, DbFileCursor.GetMode.Last, DbFileCursor.ReadFlags.None) != ReadStatus.NotFound)
    lastId = BitConverter.ToInt64(vlastId.Buffer, 0);
  }
   docId = lastId + 1;
 }
 else
 {
  docId = existingId;
 }

 return new DocumentWriteStream(dataTable, transaction, docId, key,
   etag, dateTime, metadata);
}

Again, up to this point, the function is basically the same, the new stuff comes in the DocumentWriteStream object, which implements the abstract Stream interface.

private class DocumentWriteStream : Stream
{
 private readonly DbBTree data;
 private readonly Txn transaction;
 private readonly int headerSize;
 private DbEntry idBuffer;
 private long position;

 public DocumentWriteStream(DbBTree data, Txn transaction, long docId, string key, Guid etag, DateTime dateTime, byte[] metadata)
 {
  this.data = data;
  this.transaction = transaction;
  this.idBuffer = DbEntry.InOut(BitConverter.GetBytes(docId));
  this.position = 0;

  //put out the header we do know (we need to do this first since every put will call the secondary associate functions,
  // so we need the associte keys available in the header
  var header = new byte[documentBaseLength + key.Length * 2 + metadata.Length];
  Buffer.BlockCopy(etag.ToByteArray(), 0, header, 0, 16);
  Buffer.BlockCopy(BitConverter.GetBytes(dateTime.ToFileTime()), 0, header, 16, 8);
  Buffer.BlockCopy(BitConverter.GetBytes(key.Length * 2), 0, header, 24, 4);
  Buffer.BlockCopy(BitConverter.GetBytes(metadata.Length), 0, header, 28, 4);
  Buffer.BlockCopy(BitConverter.GetBytes(0), 0, header, 32, 4); //we don't know the document length yet
  Buffer.BlockCopy(Encoding.Unicode.GetBytes(key), 0, header, 36, key.Length * 2);
  Buffer.BlockCopy(metadata, 0, header, 36 + key.Length * 2, metadata.Length);
  this.headerSize = header.Length;

  var dvalue = DbEntry.InOut(header, 0, header.Length);
  data.Put(transaction, ref idBuffer, ref dvalue);
 }

We start up the document streamer by writing out the same header information we were writing before, the etag, modification time, sizes, key and metadata.  The difference is that we don't know what to put for the document size yet, so we just fill that in after the stream is flushed.

The important thing to note is that we have to set up the secondary key values in the header since when we issue a BDB put, the secondary index lookup function is going to be called in order to get the secondary index key.  We need this in place before we start streaming the data.

public override bool CanRead { get { return false; } }
public override bool CanSeek { get { return false; } } 
public override bool CanWrite { get { return true; } } 

public override long Length { get { throw new NotImplementedException(); } }
public override long Position { get { return position; } set { position = value; } }
public override int Read(byte[] buffer, int offset, int count) { throw new NotImplementedException(); }
public override long Seek(long offset, SeekOrigin origin) { throw new NotImplementedException(); }
public override void SetLength(long value) { throw new NotImplementedException(); }

Now for the boiler-place write-only non-seeking stream functions. And finally the actual methods we care about.

public override void Flush()
{
 var dvalue = DbEntry.InOut(BitConverter.GetBytes((int)position), 0, 4, 4, 32);
 data.Put(transaction, ref idBuffer, ref dvalue);
}

public override void Write(byte[] buffer, int offset, int count)
{
 var dvalue = DbEntry.InOut(buffer, offset, count, count, headerSize + (int)position);
 data.Put(transaction, ref idBuffer, ref dvalue);
 position += count;
}

The write command simply takes the buffer it's given and uses the current stream position to tell BDB where to add the data (BDB handles extending the size of the record as we do the partial writes).  And finally we on flush, we know the size of the document so we can re-issue the put command to put the document size in the correct header spot.

So, all-in-all doing the streaming writes is not that complicated.  It turns out that doing streaming reads is a little harder since we have to worry about opening, managing and disposing of cursors around the streaming read accesses, but we'll leave that for next time.

Friday, August 17, 2012

Easy index searches

My next goal was to get Raven Studio up and running so that I could see the documents without relying on db_dump and curl.  The main url's that rstudio accesses are /docs and /stats.  /stats is pretty easy to mock with null data, and the /docs url is the useful one for first looking at documents.

My first index

/docs ends up accessing IDocumentStorageActions.GetDocumentsByReverseUpdateOrder to get the most recently changed documents.  The most recently changed documents is based on the etag of the document.  So we need another index in our documents.db file that pulls the etag as the secondary index value.


this.indexByEtagFile = env.CreateDatabase(DbCreateFlags.None);
this.indexByEtag = (DbBTree)indexByEtagFile.Open(null, "documents.db", "indexByEtag", DbType.BTree, 
  Db.OpenFlags.Create | Db.OpenFlags.ThreadSafe | Db.OpenFlags.AutoCommit, 0);
dataTable.Associate(indexByEtag, GetDocumentEtagForIndexByEtag,
  DbFile.AssociateFlags.Create);

unsafe private DbFile.KeyGenStatus GetDocumentEtagForIndexByEtag(DbFile secondary, ref DbEntry key, 
  ref DbEntry data, out DbEntry result)
{
 //extract the key for the secondary index of the document table
 var header = new DocumentHeader();
 Marshal.Copy(data.Buffer, 0, new IntPtr(&header), documentBaseLength);

 var etagBuffer = new byte[16];
 Buffer.BlockCopy(data.Buffer, 0, etagBuffer, 0, etagBuffer.Length);

 result = DbEntry.InOut(etagBuffer);
 return DbFile.KeyGenStatus.Success;
}

This allows a sorted view of the etag values for all documents. Now to get the most recent list of changed documents we just use a BDB cursor to start pulling records from the end of the index.

Cursors

Cursors are how you move through an btree in BDB.  We can start a cursor at the beginning of an index, or at the end, then move forward or backwards, everything else is up to us.  So let's use the new etag index to get the documents skipping and paging.

unsafe public IEnumerable<JsonDocument> GetDocumentsByReverseUpdateOrder(Txn transaction, int start, int take,
  Func<string, Guid, DateTime, byte[], byte[], JsonDocument> formDocument)
{
 var ret = new List<JsonDocument>();

 using(var cursor = indexByEtag.OpenCursor(transaction, DbFileCursor.CreateFlags.None))
 {
  for (int i = 0; i < start; i++)
  {
   if(cursor.PrevSkip() == ReadStatus.NotFound)
    return ret;
  }

  foreach(var docData in cursor.ItemsBackward(true, DbFileCursor.ReadFlags.None).Take(take))
   ret.Add(ExtractDocument(docData.Data.Buffer, formDocument));
 }

 return ret;
}

We open the cursor, then for the skips, we just keep issuing previous calls without extracting any data.  Then we keep moving backwards picking up the data until we have a complete page of results.  The C#/BDB interface makes it easy by giving us an iterator over the retrieved items.  Once we have a complete document, we can extract the data from it an push it to the action for generating a json document.

unsafe private JsonDocument ExtractDocument(byte[] rawData, 
  Func<string, Guid, DateTime, byte[], byte[], JsonDocument> formDocument)
{
 var header = new DocumentHeader();
 Marshal.Copy(rawData, 0, new IntPtr(&header), documentBaseLength);
 var key = Encoding.Unicode.GetString(rawData, documentBaseLength, header.KeySize);
 var metadata = new byte[header.MetadataSize];
 var document = new byte[header.DocumentSize];
 Buffer.BlockCopy(rawData, documentBaseLength + header.KeySize, metadata, 0, header.MetadataSize);
 Buffer.BlockCopy(rawData, documentBaseLength + header.KeySize + header.MetadataSize, document, 0, header.DocumentSize);

 return formDocument(key, header.Etag, DateTime.FromFileTime(header.LastModifiedFileTime), metadata, document);
}

The extract document is as easy as it looks, simply pulling the data from the data structure and sending it to the json document form-er.

That's pretty much it, for the easy index queries, adding this and a couple other fix-ups allows rstudio to run (with one error on start-up, accessing /databases, which I will talk about next time.

Document data structure

Let's talk about the documents table data structure.  The basic information stored for a document from raven is the ID, the key, the etag, the last modification time, the metadata json dump and the document json dump. From the low-level view of BDB, that is two fixed length buffers (etag and last modification time) and three variable length buffers.  And since we are responsible for all storage in BDB, we need to handle all of this.

There are times in the raven storage engine protocol where only the document data is read, or only the metadata is read, or only the etag is checked.  BDB supports partial reads of data from the datafile, so we want to make sure we take advantage of that as well.

Where's the data

I am going to lay the document primary data section out as a fixed sized header followed by the three variable sized buffers.

[StructLayout(LayoutKind.Sequential, Pack = 0)]
unsafe private struct DocumentHeader
{
 public Guid Etag;
 public long LastModifiedFileTime;
 public int KeySize;
 public int MetadataSize;
 public int DocumentSize;
 //KEYDATA[KeySize]
 //METADATA[MetadataSize]
 //DOCUMENT[DocumentSize]
}

This is the structure for the raw data that will be stored in the data file.

Give me the data

Raven calls IDocumentStorageActions.AddDocument when it wants to add a document.  The old etag (if an update) is passed in along with the metadata and the actual document data (there's some ravendb transaction operations that I'm going to skip for now since I'm just trying to get a document in the database).

We want to forward this call down to our document table class.

Guid newEtag = uuidGenerator.CreateSequentialUuid();

using(var ms = new MemoryStream()) { data.WriteTo(ms); dataBuffer = ms.ToArray(); } 
using(var ms = new MemoryStream()) { metadata.WriteTo(ms); metadataBuffer = ms.ToArray(); }
database.DocumentTable.AddDocument(transaction, key, newEtag, SystemTime.UtcNow, dataBuffer, metadataBuffer);

This will give us the raw data we want to store in the database.

Store it


unsafe public void AddDocument(Txn transaction, string key, Guid etag, DateTime dateTime, byte[] data, byte[] metadata)
{
 DbEntry dkey;
 var keyBuffer = Encoding.Unicode.GetBytes(key);
 var dataBuffer = new byte[documentBaseLength + keyBuffer.Length + data.Length + metadata.Length];

 var header = new DocumentHeader
               {
                Etag = etag,
                LastModifiedFileTime = dateTime.ToFileTime(),
                KeySize = keyBuffer.Length,
                DocumentSize = data.Length,
                MetadataSize = metadata.Length
               };

 //find the existing document key
 var existingId = GetDocumentIdByKey(transaction, key);

 //update or insert?
 if(existingId == 0)
 {
  long lastId = 0;
  var vlastId = DbEntry.Out(new byte[8]);
  var vlastData = DbEntry.EmptyOut();

  using (var cursor = dataTable.OpenCursor(transaction, DbFileCursor.CreateFlags.None))
  {
   if (cursor.Get(ref vlastId, ref vlastData, DbFileCursor.GetMode.Last, DbFileCursor.ReadFlags.None)
      != ReadStatus.NotFound)
    lastId = BitConverter.ToInt64(vlastId.Buffer, 0);
  }
  dkey = DbEntry.InOut(BitConverter.GetBytes(lastId + 1));
 }
 else
 {
  dkey = DbEntry.InOut(BitConverter.GetBytes(existingId));
 }

 var offset = 0;
 Marshal.Copy(new IntPtr(&header), dataBuffer, offset, documentBaseLength); offset += documentBaseLength;
 Buffer.BlockCopy(keyBuffer, 0, dataBuffer, offset, keyBuffer.Length); offset += keyBuffer.Length;
 Buffer.BlockCopy(metadata, 0, dataBuffer, offset, metadata.Length); offset += metadata.Length;
 Buffer.BlockCopy(data, 0, dataBuffer, offset, data.Length);

 var dvalue = DbEntry.InOut(dataBuffer);
 dataTable.Put(transaction, ref dkey, ref dvalue);
}

That's quite a function call, but it's fairly simple when you break it down:

  1. Form the document header structure
  2. Search the secondary index for the document key and get the primary key (if it exists)
  3. If we found the primary key then this is an update, not an insert.
  4. If it's an insert then we need to generate a new primary key (again, we need to perform the auto-incrementing primary key)
  5. We find the highest current primary key by using a BDB cursor and jumping directly to the end.
  6. The new primary key is one more than that.
  7. If we are an update then we already have the primary key.
  8. Copy all of the header, key data, metadata and document data into a buffer.
  9. Put the data into the table.
That pretty much it, except for the secondary key callback that will occur when the Put operation happens.  Remember we are responsible for picking out the secondary key from the primary data field.

unsafe private DbFile.KeyGenStatus GetDocumentKeyForIndexByKey(DbFile secondary, ref DbEntry key, ref DbEntry data, out DbEntry result)
{
 //extract the key for the secondary index of the document table
 var header = new DocumentHeader();
 Marshal.Copy(data.Buffer, 0, new IntPtr(&header), documentBaseLength);

 var keyBuffer = new byte[header.KeySize];
 Buffer.BlockCopy(data.Buffer, documentBaseLength, keyBuffer, 0, keyBuffer.Length);

 result = DbEntry.InOut(keyBuffer);
 return DbFile.KeyGenStatus.Success;
}

This will pull the document key from the primary data field and return it to BDB for storage in the secondary index. With a little more plumbing sprinkled through we should be able to run the server now and do a document put operation and see it in the documents.db file : curl -X PUT http://localhost:8080/docs/bobs_address -d "{ FirstName: 'Bob', LastName: 'Smith', Address: '5 Elm St' }"

VERSION=3
format=bytevalue
database=data
type=btree
db_pagesize=8192
HEADER=END
 0100000000000000
 00000000000000000000000000000001e10574f8877ccd011800000005000000420000000000000062006f00620073005f0061006400640072006500730073000500000000420000000246697273744e616d650004000000426f6200024c6173744e616d650006000000536d69746800024164647265737300090000003520456c6d2053740000
DATA=END
VERSION=3
format=bytevalue
database=indexByKey
type=btree
db_pagesize=8192
HEADER=END
 62006f00620073005f006100640064007200650073007300
 0100000000000000
DATA=END

The data in the dump is display key then value.  So we have (in the primary data section) a key of 1, with the  document header followed by all of the variable data (you can see the first 16 bytes is the etag of 00000000-0000-0000-0000-000000000001).  Then in the secondary index you can see the unicode version of bobs_address as the key and the primary key 1 as the data.

We have successfully stored a document in the database, now let's try to get it back out.

How to CREATE INDEX

Like I said in the last post. BDB does not support indexes in the style of CREATE INDEX. You have to do them all manually. So if we want have our documents database primary key be an auto-incrementing long and also be able to search by document key, then we need to perform the secondary index ourselves.

We do this by creating another section in the documents.db data file.  This is a simple tree that maps the secondary key (our document name) to the primary key (our auto-incrementing long).

this.indexByKeyFile = env.CreateDatabase(DbCreateFlags.None);
this.indexByKey = (DbBTree)indexByKeyFile.Open(null, "documents.db", "indexByKey", DbType.BTree, Db.OpenFlags.Create
 | Db.OpenFlags.ThreadSafe | Db.OpenFlags.AutoCommit, 0);
dataTable.Associate(indexByKey, GetDocumentKeyForIndexByKey, DbFile.AssociateFlags.Create);

The same flags apply as the primary key btree, the difference is the associate call.  This call tells BDB that this section is a secondary index, and that anytime we do a put operation on the primary we want to get informed.  The callback for the event is GetDocumentKeyForIndexByKey and it's job is to pull the secondary key from the primary key's data section (we'll get to the code for that later since we can't write it without the primary table's data structure in place).

Of course, running this server again with a BDB dump, will get us a little closer.

VERSION=3
format=bytevalue
database=data
type=btree
db_pagesize=8192
HEADER=END
DATA=END
VERSION=3
format=bytevalue
database=indexByKey
type=btree
db_pagesize=8192
HEADER=END
DATA=END

We now have two sections of data in the file: the primary data section and the secondary index called indexByKey.

BDB???

I will readily admit that I am not a database expert, but I have had a fair amount of experience using various database engines: Sql server, PostgreSQL, Sqlite, Mysql, VistaDB, etc.  But when I started looking into BDB for the RavenDB storage engine, I got a little overwhelmed.  It all looked like greek to me.  There's no fancy management studio, there's no SQL, there's hardly anything.  You are basically given a put and a get that search Btree style for byte arrays.  Everything else is up to you.  It's like comparing assembly language to C#.

Interfacing BDB from C# is a daunting task.  BDB is not set up like a normal DLL where we can just import a bunch of simple C functions.  It runs it's api through a C struct pointer table.  This poses a problem for C# imports in that we have to parse the structure for each of the function pointers and interp to those.  Luckily there is already a project dedicated to a C#/BDB interface:  Berkeley DB for .NET

Let's create a data file

The first thing I set out to do was to create a BDB data file.  From the raven perpective the first interesting activity of a storage engine is TransactionalStorage.Initialize.  This is where raven opens the "database" and creates the schema if there isn't one.

Everything in BDB start with an environment.  Environments help with many of the BDB operations, but the main one that we are concerned with is transactions.  Setting up a environment is fairly simple.  We just need to set up the correct options for the initialization.

env = new Env(EnvCreateFlags.None);
env.ErrorStream = new MemoryStream();
env.Open(path, Env.OpenFlags.Create | Env.OpenFlags.InitTxn | Env.OpenFlags.InitLock | Env.OpenFlags.InitLog
 | Env.OpenFlags.InitMPool | Env.OpenFlags.ThreadSafe | Env.OpenFlags.Recover, 0);

The options are basically: create the environment if not there, initialize the transaction, locking systems, turn on log files, make the environment thread-safe, recover from the logs if we had a bad previous shutdown and enable the BDB memory buffer pool.

I am going to start with the document "table" setup since the first operation that I want to be able to accomplish with the new storage engine is to be able to store one document.  In order to create the document table, we need to open up a BDB file.  A BDB data file is just a container for the Btree.  One physical file can have multiple virtual tree in them.

this.dataTableFile = env.CreateDatabase(DbCreateFlags.None);
this.dataTable = (DbBTree)dataTableFile.Open(null, "documents.db", "data", DbType.BTree, Db.OpenFlags.Create
 | Db.OpenFlags.ThreadSafe | Db.OpenFlags.AutoCommit, 0);

First we create the database interface (CreateDatabase method name is poorly named).  Then we actually open the database file.  The file name is documents.db and it has one section in it called data.  It's a btree type file and we want to create it, make it thread safe and set the file up for auto commit transaction for all operations that don't have an explicit transaction (which we will always have a transaction).

Running this code with proper Disposes thrown in throughout will generate a file in your data directory called documents.db (and a bunch of other files for the database log and transaction management).  We can use the BDB utility functions to take a peek into the file : db_dump Data\documents.db.

VERSION=3
format=bytevalue
database=data
type=btree
db_pagesize=8192
HEADER=END
DATA=END

We have a file btree file, with 8k page sizes with no data.  Hey at least it's a start.

Intro to RavenDB BDB storage engine

I have set out on a quest to get the BDB storage engine working with RavenDB.  I am a big believer in C# running on Linux, we've been running production sites on Linux for years, and it's been a great experience.  Recently, we have started a new project where the RavenDB document database seemed like a good fit.  After about a week or two of poking around with it I discovered how great it is for certain applications.  Unfortunately the only production storage engine is based on Microsoft Esent database engine thus does not run under linux.

There has been some talk on the mailing lists about creating a new storage engine based on the Berkeley DB engine, which runs on both Windows and linux.  And after tackling the facet optimization for RavenDB (which I will hopefully post on later), I have decided to attempt this feat.

I have very little experience with RavenDB and exactly zero experience with BDB.  There is no documentation for the RavenDB storage backend (except for the code), so a lot of the work is based on trial and error.  So my starting point for RavenDB is the code, and for BDB is this tutorial http://pybsddb.sourceforge.net/reftoc.html.

You can track my progress in github

Step1 - Get the code to compile

  1. Copy the Raven.Storage.Esent project to Raven.Storage.Bdb, rename namespaces. 
  2. Delete all of the code from every class, and re-implemented all interfaces to throw NotImplementedException.
  3. A bdb class type to CreateTransactionalStorage so that we can create our new storage engine.
  4. Run and wait for the crash!
var storageEngine = SelectStorageEngine();
switch (storageEngine.ToLowerInvariant())
{
 case "esent":
  storageEngine = "Raven.Storage.Esent.TransactionalStorage, Raven.Storage.Esent";
  break;
 case "bdb":
  storageEngine = "Raven.Storage.Bdb.TransactionalStorage, Raven.Storage.Bdb";
  break;
 case "munin":
  storageEngine = "Raven.Storage.Managed.TransactionalStorage, Raven.Storage.Managed";
  break;
}
var type = Type.GetType(storageEngine);

Step2 - Get the server to run

Now that we have a server that at least uses our storage engine it's time to actually get the server to start up. Since I have no idea the order of events for the storage engine, it's simply a matter of a back and forth between eliminating NotImplementedException and running the server.

In order to eliminate the exceptions (without knowing what any of the functions really do yet) is to look at the return parameter.

  • If the function returns void then I just make a blank function
  • If the function returns an object then I return null
  • If the function returns IEnumberable, then I just yield break.
  • Some functions that have non-Esent code in them are copied from the previous storage engine
After a bunch of change/run sessions, we get a running server that does nothing.


Well that concludes the first post on the BDB storage engine.  Next time we will actually look at getting some files to the hard drive.