Home Secondary Databases, or - How I Learned to Stop Worrying and Love the Cache
Post
Cancel

Secondary Databases, or - How I Learned to Stop Worrying and Love the Cache

Pre-note: This is an old article from when I headed Isoscribe. I’m reposting it here, because I thought it was somewhat cool.

Isoscribe runs on several databases

We have the primary database, in Mongo, which stores all the user attributes, posts, blogs, etc.

In addition, there are a number of databases we use to make our services faster, more reliable, and more secure.

Redis

Redis was the first of the bunch to be introduced to Isoscribe. We use it for session storage, which means you don’t store any data client side. Only a token that we reference. A few of the main reasons are as such:

  1. This is a lot faster, because you don’t need to send us a whole host of data on yourself - only a unique key
  2. Secure and cannot be tampered with
  3. Tokens effectively cannot be forged, only duplicated
  4. Session information persists between servers

Let me explain that last one. There are three primary ways to store session data on a browser. They all involve cookies - little bits of data that are isolated, and sent with every request to the server. The first one is to store all the data client-side. It can be encrypted or in plain text, but could be forged. It can be stored server-side, however, what if the user makes a request to a different server? Or that server goes offline? Isoscribe is all about fault tolerance and scalability, so this option doesn’t work.

The last one is to use an external database, and it’s what we did. It makes our lives easy, because we only have to store a single key in a cookie (you can see your key in the t cookie on your browser), while all data is stored in an insanely fast database.

JanusGraph

We use JanusGraph, communicating over GraphQL, to store role and permission information. I won’t go super in-depth with it, as it is still in active development.

However, graph databases have one specialty that no other database offers - relationships. Graph databases look more like webs or meshes. Every single “object” (say, a person) in the database has a set of “relationships” (father, mother, daughter, son, friend, coworker, etc.) with any other “object.” This makes traversing between multiple objects very easy.

Say you write for a blog, and you have a permission can_post set on your role. In a normal database, we would have to query your user, find your role, query your blog, find the role in the blog, query the role, and find the permission. This would be all separate queries that would be very time-intensive. Instead, using a graph database, we can just go directly between these objects, without having to search for an object by primary key, or running some other operation on the database. We can literally getFather on an object, rather than `SELECT

  • FROM people WHERE id = (SELECT fatherID FROM people WHERE name = “John Doe”). I'll give you a hint - getFather` is way nicer than the second option, and vastly cheaper.

Memcached

Isoscribe operates a cache in front of the main database, which allows us to store frequently accessed data, without “building” an object from a database response. All our caches operate in JSON, and are updated any time a change is made.

However, this cache is possibly my favorite part of this entire site. Let me walk you through it.

Memcached is a key-memory store. It’s similar to a key-value store, except instead of storing a “value”, it stores memory of the object. Essentially, they’re the same, but it’s a nitpicky thing I should probably mention.

We have keys that can be inferred from any given object. Because each object has a unique ID, we can request the object from the cache, which will return JSON, rather than querying the database and then building a JSON structure.

But this has a catch. We had a few issues with this cache where objects wouldn’t be properly updated. So, to combat this, we effectively use two kinds of updates. We update “lazily” and “actively.”

When a user updates an object (say, edits a post), we regenerate the cache. However, due to how long it could take to rebuild a list of posts for a blog, for example, we don’t always do it immediately when the user requests.

Lazy updates

When a change is made, but the computations to rebuild the cache would take too long to attach them to the user’s request, we start a “job” to run that update. It is asynchronous to the request, so the user can be told that the update was made, while the cache is rebuilt. Usually, building the cache takes less than a second. However, time is precious with API calls.

Active updates

These updates are made when, for example, a post is deleted. The user’s request is told to wait while the cache empties or updates the necessary objects. This also happens when the cache is missing data. We do not automatically populate the cache with every object in our database, but instead, only populate when it is requested. Over time, these will effectively mean the same thing, however, it allows us to have smaller instances, which is cheaper to run.

If the requested object is not in the cache, it will query the database, and if an object is found, generate the JSON view, store it in the cache, and return it to the user. It sounds complicated, however, every one of those (save storing in the cache) would happen anyway without a cache.

Hope you liked this writeup on the Isoscribe database structure

–E

This post is licensed under CC BY 4.0 by the author.