Friday, August 17, 2012

Easy index searches

My next goal was to get Raven Studio up and running so that I could see the documents without relying on db_dump and curl.  The main url's that rstudio accesses are /docs and /stats.  /stats is pretty easy to mock with null data, and the /docs url is the useful one for first looking at documents.

My first index

/docs ends up accessing IDocumentStorageActions.GetDocumentsByReverseUpdateOrder to get the most recently changed documents.  The most recently changed documents is based on the etag of the document.  So we need another index in our documents.db file that pulls the etag as the secondary index value.


this.indexByEtagFile = env.CreateDatabase(DbCreateFlags.None);
this.indexByEtag = (DbBTree)indexByEtagFile.Open(null, "documents.db", "indexByEtag", DbType.BTree, 
  Db.OpenFlags.Create | Db.OpenFlags.ThreadSafe | Db.OpenFlags.AutoCommit, 0);
dataTable.Associate(indexByEtag, GetDocumentEtagForIndexByEtag,
  DbFile.AssociateFlags.Create);

unsafe private DbFile.KeyGenStatus GetDocumentEtagForIndexByEtag(DbFile secondary, ref DbEntry key, 
  ref DbEntry data, out DbEntry result)
{
 //extract the key for the secondary index of the document table
 var header = new DocumentHeader();
 Marshal.Copy(data.Buffer, 0, new IntPtr(&header), documentBaseLength);

 var etagBuffer = new byte[16];
 Buffer.BlockCopy(data.Buffer, 0, etagBuffer, 0, etagBuffer.Length);

 result = DbEntry.InOut(etagBuffer);
 return DbFile.KeyGenStatus.Success;
}

This allows a sorted view of the etag values for all documents. Now to get the most recent list of changed documents we just use a BDB cursor to start pulling records from the end of the index.

Cursors

Cursors are how you move through an btree in BDB.  We can start a cursor at the beginning of an index, or at the end, then move forward or backwards, everything else is up to us.  So let's use the new etag index to get the documents skipping and paging.

unsafe public IEnumerable<JsonDocument> GetDocumentsByReverseUpdateOrder(Txn transaction, int start, int take,
  Func<string, Guid, DateTime, byte[], byte[], JsonDocument> formDocument)
{
 var ret = new List<JsonDocument>();

 using(var cursor = indexByEtag.OpenCursor(transaction, DbFileCursor.CreateFlags.None))
 {
  for (int i = 0; i < start; i++)
  {
   if(cursor.PrevSkip() == ReadStatus.NotFound)
    return ret;
  }

  foreach(var docData in cursor.ItemsBackward(true, DbFileCursor.ReadFlags.None).Take(take))
   ret.Add(ExtractDocument(docData.Data.Buffer, formDocument));
 }

 return ret;
}

We open the cursor, then for the skips, we just keep issuing previous calls without extracting any data.  Then we keep moving backwards picking up the data until we have a complete page of results.  The C#/BDB interface makes it easy by giving us an iterator over the retrieved items.  Once we have a complete document, we can extract the data from it an push it to the action for generating a json document.

unsafe private JsonDocument ExtractDocument(byte[] rawData, 
  Func<string, Guid, DateTime, byte[], byte[], JsonDocument> formDocument)
{
 var header = new DocumentHeader();
 Marshal.Copy(rawData, 0, new IntPtr(&header), documentBaseLength);
 var key = Encoding.Unicode.GetString(rawData, documentBaseLength, header.KeySize);
 var metadata = new byte[header.MetadataSize];
 var document = new byte[header.DocumentSize];
 Buffer.BlockCopy(rawData, documentBaseLength + header.KeySize, metadata, 0, header.MetadataSize);
 Buffer.BlockCopy(rawData, documentBaseLength + header.KeySize + header.MetadataSize, document, 0, header.DocumentSize);

 return formDocument(key, header.Etag, DateTime.FromFileTime(header.LastModifiedFileTime), metadata, document);
}

The extract document is as easy as it looks, simply pulling the data from the data structure and sending it to the json document form-er.

That's pretty much it, for the easy index queries, adding this and a couple other fix-ups allows rstudio to run (with one error on start-up, accessing /databases, which I will talk about next time.

3 comments:

  1. One problem is here:

    for (int i = 0; i < start; i++)
    {
    if(cursor.PrevSkip() == ReadStatus.NotFound)
    return ret;
    }


    This is an O(N) operation.
    We had the same bug in Esent.
    Can you make it skip N results in one go?

    ReplyDelete
    Replies
    1. And, it looks like the current Esent engine does exactly the same thing.

      while (start > 0)
      {
      if (Api.TryMoveNext(session, Documents) == false)
      return Enumerable.Empty();
      start--;
      }

      Delete
  2. I've looked for that initially, but I'm not sure that it's possible. I will continue to try to see if it's possible.

    ReplyDelete