Hundreds of thousands of new updates

An abbreviated guest post version of this post can be found on the Mechanical Turk blog today.

The primary learning activity on Grockit is problem solving, so it’s probably not surprising that Grockit has developed a large (and growing) library of problems to solve. What began as a few hundred problems has quickly turned into hundreds of thousands of questions, answers, explanations, and more. Over the past few years, we’ve developed an assortment of technologies and processes to make this possible. Over the past few weeks, we’ve been revamping this infrastructure in some powerful ways, and I wanted to share a few (technical) notes on what we did, how, and why.

When you’ve got tens or hundreds of people authoring problems, things inevitably get messy. Invisible markup affects question formatting in unexpected ways, mathematical expressions are input in several different ways, questions often get modified and revised by a sequence of different authors, and it gets increasingly difficult to figure out who changed what, when, and how. In short, a fully-featured content management system (CMS) was in order. At the same time, however, there are some meaningful customizations that we made to our editor that a CMS wasn’t designed to support: everything from AP exam question types, alignments to various state standards, images externally hosted on Amazon S3, special handling of long reading passages, supporting open-ended responses to certain math problems, a combination of skill tags and taxonomies, and a slew of other requirements specific to the type of “content” that we needed to “manage.” Ultimately, we decided to take a hybrid approach, using a customized application for the high-level structure, and a version control system for low-level content management. The first challenge was building this hybrid application (an engineering problem), and the second challenge was moving from our existing system to the new one (a process problem). I’ll expand a bit on how we approached the engineering problem and how we tackled the process problem.

Grockit’s new content editor banishes hidden HTML markup by replacing the standard rich-text editor with a plain-text editor that uses Markdown to signal formatting. Markdown was designed to keep things simple and intuitive, and that’s just what we were looking for. In order to know who changed what, when, and how, we rely on Git, a distributed version control system frequently used in software development, where many different people are changing different parts of a system over time. Beyond accounting for every last text edit, git allows us to do full-text search of content, and even do full-text search through past versions of content. Think “track-changes”, but on steroids. Git is generally accessed from the file system, but we were looking for a simple web front-end for our system. The team at GitHub, an immensely popular website for collaboration on open-source software development, put together a fantastic tool that we adopted and modified, named Gollum. Gollum was developed as a git-backed wiki (in the form of a Ruby Sinatra application) that uses Markdown for formatting, and incorporates the browser-ready MathJax engine for beautiful LaTeX typesetting of mathematical expressions in modern web browsers. For example: [mathjax] [ Pleft(X_{vi} = 1|theta_v,beta_i,alpha_i,gamma_iright) = gamma_i + frac{1-gamma_i}{1+e^{-alpha_i(theta_v – beta_i)}}]

Loosely-coupling the Gollum-derived editor with the Monarch-based application afforded us with a customized editor that supported change tracking and reverting, full-text search of current and past version of content, much-simplified markup, more beautiful math, and a slew of other improvements and enhancements. This new system, however, assumed that text is in Markdown format. All of the existing questions, answers, and explanations weren’t, though, and that left something of a challenge.

The challenge: convert each bit of content in the Grockit system from free-form HTML to a Grockit-flavored Markdown without losing the necessary visual styling, then verify that the conversion was done correctly, fix it if it wasn’t, and then deploy the approved version to the production system once ready. Then repeat, hundreds of thousands of times. Clearly, we needed an automated process. We know that automated doesn’t necessarily mean accurate, however, so we decided that a partially-automated, partially-manual process was the best way to ensure that Grockit questions, answers, and explanations would continue to be accurate. Here’s what we did:

The first step, an automated process to convert HTML to Gollum-ready Markdown (codename: Smeagol) got us started. Some of the changes were so minor that no manual verification was necessary, and the new version could be immediately deployed to the production system. For the rest, we used Amazon’s Mechanical Turk service, for each item, to ask three different people whether or not the before and after (i.e. Smeagol and Gollum) content looked the same. If all three agreed that the conversion worked, we felt confident in switching to the Gollumnized Markdown. If not, we needed someone else to check and fix the change. For this, we built on the Pivotal Tracker API to build up an organized to-do list for a team of Grockit content authors to work through. In our trial run with Algebra I in the Academy, 45,000 conversion quality ratings were submitted in the first hour alone! Once a correction was saved, Grockit would start displaying the Gollumnized version. The result: A rolling process (without race conditions!) to update all of Grockit’s content, one field at a time, to a much cleaner, simpler, more trackable, more searchable, more flexible form moving forward.