Building a time series engine is hard. Beyond the typical database management problems of data distribution, fault tolerance, and read/write scaling, you have the additional reporting challenge — how do you make it simple to query?
This article goes into depth on Mage, the backend that powers Parse.ly’s new analytics dashboard.
Why did we need Mage?
In the case of Parse.ly, we had unique challenges that stemmed from the fact that we worked for the largest media companies on the web. So, when we rethought our backend architecture in 2014, we had some tough technical requirements for that new architecture.
For example, how do we…
- store audience behavior data on millions of URLs at minute-by-minute resolution, while also allowing roll-ups over time?
- allow for an almost limitless number of metrics, segments, and time dimensions per URL?
- deal with an ever-evolving set of crawled site metadata classifying those URLs?
- correct errors that might occur in our data collection in a matter of minutes?
- perform site-wide roll-ups, and even multi-site and network-wide rollups?
- do per-metric benchmarking, both to the site and to all data on record?
- group and filter URLs arbitrarily while performing calculations on their representative metrics?
And, how do we do this while maintaining sub-second queries? And without completely breaking the bank in our cloud hosting provider, Amazon Web Services?
The Challenge of Our Data
First, let me illustrate the scale of our data. We ingest about 10,000 data points per second from web browsers across the web. Our data collection infrastructure receives data from 400 million unique browsers monthly, and their browsing activity involves over 10 billion page views per month to publisher sites. This results in hundreds of gigabytes of raw data per month, and terabytes of data in our archive.
We also continuously crawl publisher websites for metadata, and we’ve built a web page metadata cache which is, itself, over 150GB of ever-changing data, such as the full text of articles published, headlines, image thumbnail URLs, sections, authors, CMS tags, and more. We must continuously join this metadata to our analytics stream to produce the insights that we do.
As an illustration of the mere economic challenge of storing it all, I can observe that some of our largest customers (by traffic) add nearly $1,000/month to our Amazon CDN costs. This is the cost Amazon bills us merely to serve the tiny JavaScript tracker code to web browsers across the web. This commoditized cost is just the starting point for us, as from that moment, packets of user activity start to stream to servers that we operate in Amazon EC2.
Now that you have a sense of the challenges of building an advanced time series store for web analytics, I will discuss how we attacked this problem.
A Lambda, or Log-Oriented, Architecture
I wrote a little bit about Lambda and Log-Oriented Architecture in my prior posts on this blog, “Apache Storm: The Big Reference” and “Loving a Log-Oriented Architecture”. This hinted at two foundational pieces of data architecture here at Parse.ly: Apache Storm and Apache Kafka.
Apache Storm is used as a real-time stream processor for analytics data throughout; we wrote our own open source implementation of its Python protocol, which we originally called pystorm. We also wrote a Storm project management and test framework, called streamparse, for making Storm work smoothly with Python. The streamparse project is now a popular public Github project used by many companies and institutions for working with Apache Storm easily from Python. We currently run Apache Storm 0.8.2 in production but are planning an upgrade to 0.9.x soon.
Pictured above: The first live demonstration of the streamparse open source library at PyData Silicon Valley 2014.
Apache Kafka is used as the data backbone of our architecture. We have used Kafka for several years and even wrote our own full-featured and high-performance client library in Python. It was originally based off an open source library by the Disqus team, called samsa. However, we are working on a new version of this library, now called pykafka, which we will release as open source soon. Though we still run Apache Kafka 0.7 in production, we are planning an upgrade to Apache Kafka 0.8.2 (recently released) soon, also making use of our new open source work in this area.
We discussed this architecture in more detail in a talk at PyData 2014 Silicon Valley, recorded in video form here on YouTube: Real-Time Logs and Streams.
Immutable Events at the Core
In my post about log-oriented architecture, I wrote that the core principle that has taken hold about this design pattern is as follows:
A software application’s database is better thought of as a series of time-ordered immutable facts collected since that system was born, instead of as a current snapshot of all data records as of right now.
We made this principle core to our new system design.
Pictured above: Andrew (CTO) and Keith (Backend Lead) discuss Parse.ly’s log-oriented architecture.
Data enters our system as raw Kafka topics, which contain the “firehose” of user activity from all of our publisher websites. This data does not follow a partitioning scheme and since data is collected from servers across multiple availability zones, it is not even guaranteed to be in time order.
One of the first things that happens to this data is that it is backed up. A Python process automatically uploads all the raw logs to Amazon S3, where they are grouped by customer account and day, after applying rules related to our privacy policy, such as expunging IP addresses for certain publishers.
The data flows into Storm topologies via Python spouts. This code runs in our “writer” topology. Its spouts spread the data throughout our data processing cluster, primarily batching them for performance reasons and writing them to a distributed data store that provides URL grouping and time ordering.
This distributed data store is primarily a staging area. It contains trailing 90 days of raw event data storage. Its primary purpose is to group and order — and then provide a mechanism to easily and continuously index (and re-index) the data in our time series database.
After data is ack’ed and written to this staging area, a new Kafka topic receives a signal that indicates that a URL’s analytics data need to be refreshed. This signals another Storm topology that it’s time to build up a time series view of this data.
Building a Time Series Index
This other code runs in our “indexer” topology. It is this topology that looks for URLs that have changed, queries for the most recent time series data from the staging area, performs a streaming join from our cached web crawl metadata, and writes time series records to our time series database.
Records are written at various granularities to allow for various degrees of query flexibility.
- Raw records are kept for 24 hours. This allows the best possible query flexibility, but since it’s several hundred gigabytes of data per day, it is not feasible to hold the data longer than this.
- “Minutely” records (5min samples) are kept for 30 days. This allows us to spot fast-moving real-time trends in data and draw fine-resolution timelines for individual URLs and posts. The grouping of data into 5min samples immediately reduces query flexibility, but it cuts down data storage considerably and lets us hold onto the data longer.
- “Hourly” records (1hour samples) are kept for as long as the publisher has paid for retention. In time series data modeling, we determined that 1hour samples tend to be best when you are trading off cost of storage vs rollup capabilities. The hourly records still allow time-of-day parting and still allow timezone adjustment, unlike daily records which would be (of course) 24x cheaper. And rolling up from hourly to daily, with a good time series data store, is relatively cheap.
Data is also heavily sharded across a large cluster of machines to allow for both rapid query response times via in-memory caches and page caches spread across many solid-state disks. The sharding scheme involves month-grouping all the 1hour samples and day-grouping the 5min samples and raw events.
Aggregating the Time Series Data
Aggregation is supported on all of our data through a number of neat tricks. For example, members of our team have done work with probabilistic data structures before, and we make use of HyperLogLog to do efficient cardinality counts on unique visitors. We use fast sum, count, average, and percentile aggregations via queries that resemble “real-time map/reduce jobs” across our cluster. We support arbitrary filtering via a boolean query language that can reduce the consideration set of URLs on which data is being aggregated.
Integer values and timestamp values are efficiently stored thanks to run-length encoding. The data storage of raw visitor IDs and repeated string metadata attributes are reduced in cost, by only storing their inverted index.
All Wired Together, It’s Magic
We are extremely pleased with the way Mage has turned out, and very excited to continue to hack on what must be one of the web’s largest and most useful time series databases. Understanding content performance across billions of page views and millions of unique visitors has been an eye-opening experience, and being able to deliver insights about this data to customers with sub-second latency has been a real “wow” factor for our product.
It has also allowed us to understand many more dimensions of our data than ever before. As my colleague, Toms, put it in a prior blog post about the thought that went into Parse.ly’s UX our challenge “now lies in finding how to tell the million stories each dimension is able to provide”.
As a summary of some of the new metrics that Mage now supports:
- page views by minute & hour throughout
- visitors by minute & hour throughout
- engaged / reading time (aka “attention minutes”) by minute / hour throughout
- social shares by hour throughout (still in development)
- sorting data by time, visitors, views, or average engagement
- contextual metrics, e.g. time per visitor and time per page, throughout
- breakouts by device
- breakouts by visitor loyalty
- breakouts for multi-page articles
- breakout of traffic recirculation between pages
- benchmarking on all metrics, e.g. above/below site average, percent of site/rollup total
- rollup reporting across multiple sites
And we will only continue to add to these over time as we learn even better ways to understand content and audience.
Summary
- In 2014, we re-built Parse.ly’s backend atop a “lambda architecture”.
- Apache Storm and Apache Kafka are used as core data processing technologies.
- A generic content and audience time series data store, called Mage, was born.
- Mage translates analytics requests into distributed queries that return time series aggregates.
- This new backend now powers an AngularJS web application that is a thin client on a rich set of capabilities.
How Has Mage Amazed Us?
- Mage currently stores 2 billion time series records across our publishers, and is growing.
- It allows for horizontal scalability, easy rebuilds/backups, and multi-AZ data distribution.
- Its data sharding scheme satisfies all of our important queries.
- It can handle over 10,000 real-time writes per second from our data firehose.
- It can return sub-second analytics queries from hundreds of concurrent dashboard users.
Are you a Pythonista who is interested in helping us build Mage? Reach out to work@parsely.com.