When "Dumb Pipes" Get Too Smart

I wrote this in 2012, when CloudFlare (an almost awkwardly popular CDN) introduced a bug into websites they hosted using some of their "advanced features" via injected JavaScript that managed to not just break the website but lock up entire web browers by tickling a scheduler starvation bug in WebCore. This issue was active for the better part of a day, until I managed to teach people at CloudFlare what was wrong with their code. AFAIK, there was never any disclosure or post-mortem from them about the issue. All I managed to get from Cloudflare were replies on forums that insisted their somehow not noticing this serious of an issue was OK as it only applied to websites that opted in to this mechanism (as if they were "asking for it" ;P).

While some people (including a developer at Google's Project Zero) have claimed CloudFlare to have an "excellent reputation for transparency", I would instead claim that it is mostly the case that CloudFlare likes to look "triumphant": their reporting of issues is extremely detailed when it comes to wide-scale service and software bugs (ones that affect everyone, not just them), and their reporting of governmental data requests is absolutely admirable; but their reporting tends to talk up their involvement in helping others (as has been the case with Internet-scale routing issues and denial of service attacks), puts them at the center of PR that had little to do with them (as was the case with Heartbleed, where they even were assuring people that private key data could not be leaked: they were later proven wrong), and downplaying their own issues (as we saw yet again with what some are calling Cloudbleed).

In today's case, their article on the subject talks about how quickly they mitgated this issue, but it makes it seem like the issue is something that requires active attack and that they have no reports of such a thing happening. In fact, the bug with Cloudbleed is one that, as far as anyone understands, affects all of their customers, and any information once leaked (which could be done entirely passively) cannot simply be retracted: a lot has been said about Google going to heroic efforts to try to scrub their caches, but in reality people have continued to report finding more files that contain private information that were cached, even pages whose cache has been purged still show result snippets with private information, and there are numerous crawlers that people wouldn't contact. I can even say that at the time of my writing this paragraph (February 24th at 12:05AM US/Pacific), it is still possible to find examples in the wild using Google's cache.

As it stands, CloudFlare is attempting to spin the PR to be upset at Google for not clearing their caches more quickly, arguing that that is why they took so long to disclose this issue (which had started to irk people at Google's Project Zero); however, I maintain: the existence of some information in Google's cache is a lot less important than making certain that everyone who has been using CloudFlare knows their information has been leaked, particularly given that it isn't even truly the case that Google is at the center of this problem, nor is it the case that complete erasure from popular web caches would be sufficient. Regardless, what follows is my article from 2012, as a reminder that when "dumb pipes" start trying to be overly smart, particularly with a Web 2.0 mindset of "move fast and break things", serious bugs will be left in the wake.

Every morning (just like every afternoon, evening, and most of every night, which of course then skews the definition of morning ;P) I find myself on IRC (Internet Relay Chat), the mechanism used by most of the people involved in the iPhone jailbreak community—including jailbreakers, developers, and repositories—to, well, hang out.

While tons of things are talked about, I find that most of my time is actually spent dealing with things that are actively failing: anything from a political issue (an argument has broken out that needs moderation) to technical challenges (whether in something I actually have control over, or something I do not).

Today (2012/07/11), the first thing I was hit with was a conversation between Britta (who works at SaurikIT helping to organize community support) and Optimo (who works at BigBoss as an account manager) about a post on reddit's r/jailbreak by BM99 titled Cydia freezing on app purchases?(iPhone 4S iOS 5.1.1).

Normally, the result of these reports is "a bug is fixed in something no one sees the changes to", but today the issue actually ended up getting much deeper: the bug turned out to be in something none of us had control over, and in fact a tool that is commonly used in a wider development community.

I therefore decided to do a writeup on what the bug was. Then, as I'm the kind of person who doesn't really like to write things unless I can go "all the way", I kept track of enough of the process of finding the bug to do an extensive writeup on the process, which might be of interest to some readers.

Isolating the Culprit

Of course, the first step in attempting to find a bug like this is going to be to fully read through the bug report, so I am including the text from reddit here.

I've been trying to install the metroon theme for dreamboard on my phone and every time I go into the purchase section I get about 10 seconds before cydia freezes. When cydia freezes I can still use the home button and sleep button but nothing inside the app is responsive, even after waiting for 10 minutes nothing happens. I posted this here because I didn't know where else I could have posted, but please if you can help leave your responses as a comment to this post!

Interestingly, while we were all able to replicate the issue with the specific package Metroon, none of BigBoss' packages were affected. As it turns out, there were actually multiple things wrong with that package (all of which have now been reported to ModMyi), but none that would cause Cydia to entirely lock up.

While I had been investigating that package (discounting the other issues) and Optimo had been finding more examples, Britta had been searching Twitter. There she found people complaining about a few other packages hosted by ModMyi, as well as an intriguing conversation between @FranklyOnline (Peter Frank) and @modmyi.

-> Peter Frank: @modmyi Still problem.. Your modmyi-links hangs both Tweet and Mobile Safari on iOS 5.0.1 (iPad1) and 5.1.1 (3GS).

-> Peter Frank: @modmyi Disabling JavaScripts in Mobile Safari atleast stops hanging Safari when visiting modmyi. Bad Ad Banners using JavaScript??

-> ModMyi.com: @FranklyOnline our mobile version has no ads?

-> Peter Frank: @modmyi what ever it is, when disabling JavaScript your links work in Mobile Safari, otherwise, they hang/crashes both Safari and Twitter.

Lo and behold, the issue with ModMyi was not actually isolated to their hosting of Cydia package depictions (Cydia is pretty much a web browser, showing rich content provided, designed, and hosted by the repositories for the packages that they control); whatever the problem, it affected all of the properties being served by ModMyi.

Even more interestingly, that conversation implied that the core problem could not be one of ModMyi's advertisers (repositories can put any content they want on packages, and many of them choose to monetize their traffic using advertisements; sadly, these providers often end up causing other problems that then must be fixed).

After looking through ModMyi's depictions to see all of the different third-party JavaScripts that were being loaded, I proceeded to iteratively clear my cache and use /etc/hosts to block sets of possible candidates until I had managed to narrow down the culprit: CloudFlare, a "modern" CDN.

Round-Trip Time

To explain what CloudFlare is, it is important to first note the key problem with website performance: round-trip time. Even travelling at the speed of light (which will be slowed down as we are using electricity travelling through imperfect wires, and then further slowed down by intermediary computers and processing), to go halfway around the world and back takes 129ms.

While 129ms, a mere fraction of a second, may sound like a small amount of time, it is long enough to be easily noticeable; in fact, it is slow enough that I can clap my hands at about that rate (making it a rather physical time unit).

More importantly, even the simplest of websites will require numerous round-trips: one to look up the name "modmyi.com", one to setup a connection with their web server, and a third to get the actual page. Already, for users in (let's say) the Middle East, we are looking at a half a second just to get the intial content.

At this point, things just get worse: most websites contain numerous associated files, such as images, stylesheets, and scripts. Each of these files will require another round-trip to download (there is a technique called "pipelining" designed to mitigate this problem, but sadly it causes enough other issues that browsers are forced to avoid it).

Content Delivery Networks

One solution to this problem, of course, is to have your website hosted in multiple locations: even just having a second copy of your site available in a country on the other side of the globe (or maybe in another dense region of your users) goes a ways to help.

However, in addition to adding complexity to your operations, it is only feasible for a small (or even for a large) company to manage only so many offsite locations: maybe you have one or two remote servers, but your customers may be spread out among every continent.

Some people rightfully considered this issue to be a "business opportunity", and the idea of the Content Delivery Network, or CDN, was born: a service that does the legwork of building machines everywhere, and then intelligently managing which parts of your site can be served, from where, for whom.

One of the main deciding factors, then, while looking at a CDN is how many "points of presence" (PoPs) the network provides. Akamai, the largest, has over 1200; CDNetworks (the one I use) has over 100 (many fewer than Akamai, but more in the far East, as they are a company that started in Vietnam).

CDNs actually can offer benefits even if your content must eventually be served from your own servers anyway (maybe it is user-specific). For more information, I have in the past done a few writeups on Hacker News, #2823268 and #4204054.

CloudFlare's Pitch

CloudFlare is an oft cited example of this form of business (a CDN). That said, they are fairly small: they only have 14 PoPs, with seemingly no locations at all covering the Middle East (or Mexico, or Russia, or Africa). That said, they don't charge people for bandwidth, so they may be a good option for large sites that don't have any money ;P.

However, they consider their role to be providing more than just bandwidth and latency improvements at the networking level: their selling point is their ability to optimize your website, making modifications that will (hopefully ;P) improve its performance. In other words, rather than being a "dumb pipe", CloudFlare is "smart".

One way to visualize this difference is with a delivery service (one that you can dispatch with a simple phone call to any location to pick something up and then bring it to you): while it is advantageous in such a business to have more locations and more drivers, imagine if the items you ordered were actually improved in transit.

CloudFlare Preloader

One of the features that CloudFlare provides (for their Pro account level) is a "preloader": when your website is executed in the user's browser, CloudFlare additionally downloads a list of commonly-accessed files and makes certain, in the background, that the browser has each of them cached.

While maybe a nifty feature in some contexts, in the case of ModMyi the list of files seems to be a bunch of content for ModMyi's main website... the result being that millions of users of Cydia are wasting bandwidth and limited cache space downloading a number of files from ModMyi that are both a) large (as they are designed for desktop browsers) and b) useless.

Regardless, assuming it works correctly, a key thing to realize about the preloader is that to perform this magic it must modify your website: it needs to add JavaScript code to your site that will execute on the client's computer, making various server requests with the goal of downloading these files.

Back to Debugging

Alright, with that context out of the way, I can now return to the story of the actual lockup issue that was being seen with Cydia today. The first thing that I did was to replicate the problem on my device (which is exceptionally useful as I can install a larger/slower debug build of Cydia), and then use gdb to see where it got stuck.

(gdb) bt #0 0x309010d8 in __psynch_mutexwait () #1 0x31ed0c4a in pthread_mutex_lock () #2 0x32f752c6 in _WebTryThreadLock () #3 0x32f75292 in WebThreadLock () #4 0x329574a8 in -[UIWebBrowserView _webTouchEventsRecognized:] () #5 0x0000b5fc in $UIWebBrowserView$_webTouchEventsRecognized$ () #6 0x371c922a in -[NSObject performSelector:withObject:] () #7 0x32956da6 in -[UIWebTouchEventsGestureRecognizer _processTouches:withEvent:type:] () #8 0x329569e4 in -[UIWebTouchEventsGestureRecognizer touchesBegan:withEvent:] () #9 0x3281e024 in -[UIWindow _sendGesturesForEvent:] () #10 0x3281dddc in -[UIWindow sendEvent:] () #11 0x328044ec in -[UIApplication sendEvent:] () #12 0x32803d2c in _UIApplicationHandleEvent () #13 0x31377df2 in PurpleEventCallback () #14 0x3723e552 in __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ () #15 0x3723e4f4 in __CFRunLoopDoSource1 () #16 0x3723d342 in __CFRunLoopRun () #17 0x371c04dc in CFRunLoopRunSpecific () #18 0x371c03a4 in CFRunLoopRunInMode () #19 0x31376fcc in GSEventRunModal () #20 0x32832742 in UIApplicationMain () #21 0x000383ca in main ()

What we are seeing here is that the "main thread" (which on iOS is typically where all of the UI event processing happens) is stuck attempting to "lock" (get exclusive access to) the "web thread" (where a lot of text rendering and, more importantly, JavaScript events and code are executed).

The next question, of course, is "why isn't this working", and the answer is simply "the web thread is busy". Using gdb to again look at what is happening, this time on the web thread, we get a new stack trace.

(gdb) thread 3 [Switching to thread 3 (process 1071), "WebThread"] 0x3306f10a in WebCore::Frame::keepAlive () (gdb) bt #0 0x3306f10a in WebCore::Frame::keepAlive () #1 0x3306fa8c in WebCore::JSDOMWindowBase::globalExec () #2 0x330cdfc2 in WebCore::JSEventListener::handleEvent () #3 0x330cded2 in WebCore::EventTarget::fireEventListeners () #4 0x32fd5af2 in WebCore::EventTarget::fireEventListeners () #5 0x330e1ef2 in WebCore::EventTarget::dispatchEvent () #6 0x32fea5ba in WebCore::EventTarget::dispatchEvent () #7 0x33565950 in WebCore::MessagePort::dispatchMessages () #8 0x335d9276 in WebCore::ScriptExecutionContext::dispatchMessagePortEvents () #9 0x3325426a in WebCore::performTask () #10 0x322057ba in WTF::dispatchFunctionsFromMainThread () #11 0x3723ea62 in __CFRUNLOOP_IS_CALLING_OUT_TO_A_TIMER_CALLBACK_FUNCTION__ () #12 0x3723e6c8 in __CFRunLoopDoTimer () #13 0x3723d29e in __CFRunLoopRun () #14 0x371c04dc in CFRunLoopRunSpecific () #15 0x371c03a4 in CFRunLoopRunInMode () #16 0x3301712e in RunWebThread () #17 0x31eddc1c in _pthread_start () #18 0x31eddad8 in thread_start ()

Of course, any of these functions could be causing our problem, so it is important to get a few more stack traces. The common story seems to be: it is stuck dispatching messages, and never seems to finish. This means that the web thread is constantly busy, and cannot do any work for the main thread.

To verify this, we can quickly attempt to work our way back up the stack. First, we turn on scheduler-locking (which allows us to watch just this one thread, to keep from being distracted by work happening elsewhere) and then use next to move through each frame.

(gdb) set scheduler-locking on (gdb) next Single stepping until exit from function <WebCore::Frame::keepAlive> which has no line number information. 0x3306fa8c in WebCore::JSDOMWindowBase::globalExec () (gdb) next ... (gdb) next Single stepping until exit from function <WebCore::EventTarget::dispatchEvent> which has no line number information. 0x33565950 in WebCore::MessagePort::dispatchMessages () (gdb) next Single stepping until exit from function <WebCore::MessagePort::dispatchMessages> which has no line number information.

The result is that we are now stuck: the function dispatchMessages is unable to exit for some reason, and has been forced to just sit around dispatching message events forever. As, eventually, someone will need access to this thread, if we can't get out of this function we can be assured the main thread will eventually get stuck.

Scheduler Hijacking and Starvation

As we are probably all curious why this happens (right? hopefully? ;P), we can be thankful that WebCore is (mostly) open source, and the version used on iOS is available from one of Apple's websites. In this case, we want to look at MessagePort.cpp.

void MessagePort::dispatchMessages() { ASSERT(started()); OwnPtr<MessagePortChannel::EventData> eventData; while ( m_entangledChannel && m_entangledChannel-> tryGetMessageFromRemote(eventData) ) { OwnPtr<MessagePortArray> ports = MessagePort::entanglePorts( *m_scriptExecutionContext, eventData->channels()); RefPtr<Event> evt = MessageEvent::create( ports.release(), eventData->message()); ExceptionCode ec = 0; dispatchEvent(evt.release(), ec); ASSERT(!ec); } }

As we can see in this code (which I reformatted slightly, but did not modify except to remove a comment), this function will not exit until there are no messages left to process: every time it is done dispatching an event, it actively attempts to get another message, and only when it cannot is it willing to allow other work to occur.

This is an example of a bad scheduler design: one that is capable of being hijacked (specifically, this scheduler is "unfair", allowing some of the threads that need access to this resource to "starve"). The result is that if you can make certain there are always messages to process, you can simply own the web thread forever.

Implementing Faster nextTick

Knowing what we are looking for (a message being posted) we can now pull apart CloudFlare's JavaScript to find the code that is actually causing the lockup. They use a minifier that somewhat obfuscates the code, but if we isolate what we care about and add back some whitespace, it is fairly readable.

(function(l) { var F = l.setTimeout; W = function() { if ("MessageChannel" in l) { var a = new MessageChannel, b = []; a.port1.onmessage = function() { b.shift()() }; return function(c) { b.push(c); a.port2.postMessage("") } } return function(a) { F(a, 0) } }(); var u = CloudFlare.define; u("cloudflare/utility", function() { var a = {}; a.nextTick = W; return a }); })(window);

What we have here is a function called nextTick being exported out of the cloudflare/utility package. The function's goal seems to be to take a block of code and delay executing it until "very soon after now", using either setTimeout with a timeout of 0 (classic) or taking advantage of an HTML5 feature: message channels.

Using message channels for this purpose seems to be "the new thing", with various libraries supporting it; we can even learn more about the reasoning from articles such as "window.nextTick" on sites like NonBlocking.io.

The general idea seems to be that if you are using setTimeout very quickly—such as to schedule parts of an animation—there is actually a minimal timeout of 4ms that many browsers enforce. While you can also get around this using standard HTML5 messages, these would affect the normal operation of the website CloudFlare is serving (something CloudFlare should be going to lengths to avoid).

However, with message channels, you can make your own private messaging port and then send it messages that no other scripts on the page can see: this isolates your effects from the rest of the website. As we have seen, messages are delivered very quickly: in fact, so quickly as to starve out anyone else attempting to run code. Some people, of course, may consider this a feature ;P.

Mysterious Mistake in Deferred Loop

Sadly, one of the common use cases for something like nextTick would be to first do work, and then schedule more work to be done on the next tick... exactly the behavior that would cause a lockup. Of course, if you do that forever, the assumption is that you are just burning CPU for no good reason, which would be silly, especially for CloudFlare.

We can therefore assume there is a bug somewhere, and we now need to figure out where. As we are already running in a modified web browser (Cydia), it was very easy for me to steal the requests for CloudFlare's JavaScript and make modifications, both to the script and to the browser, to let us get debugging information out even while the UI is locked up (adding a special window.cydia.log that sends data to a file).

Eventually, I was able to track the code using nextTick down to a script called "oracle.js" that is downloaded and cached by CloudFlare in HTML5's local storage, and then injected into the page when there doesn't seem to be any data cached (there are various reasons why they might have added this level of indirection which I have decided to not bother going into here).

With some help from a portable JavaScript stacktrace library I found (developed by Eric Wendelin), I was able to track down the use (and eventually the source file) that leads to the lockup.

// cdn-cgi/nexp/aav=1499794953/cloudflare/oracle.js resolvePrecacheList().then(function(list) { var index = 0; (function() { var next = arguments.callee; if (index === list.length) user.setCookie("CF_ORACLE", "PRECACHED"); else if (list[index]) loadResource(list[index]).then(next, next); else utility.nextTick(function() { next(); }); index++; })() }, function(error) { console.error(error); });

This code does some things that are weird (such as using arguments.callee, a deprecated feature that here is trivially replaced by naming the anonymous function in its enclosing scope, as that's the only possible path to call it anyway), but the real mistake is how it tracks its termination condition.

What the function does is wait for the list of entries to cache; once it is loaded, it iterates through the list item by item, loading each one. This iteration is performed by storing the array index in a local variable of the enclosing function, and relying on scheduled callbacks for each instance of the loop.

Therefore, index will either be "the end of the list" (in which case it sets a cookie and quits), "pointing at a valid entry" (in which case it will ask for that entry to be loaded, going back into the loop for either success or failure), or "pointing at nothing" (in which case it just skips to the next one).

Dynamic Type System Failure

Honestly, this code looks "correct": even though index++ will move the index one past the end of the list, by the time it gets to that point we already should no longer be calling this function. Barring a bug in the deferred execution library (maybe one that causes this function to get called "an extra time"), the loop should terminate.

That said, we know it doesn't work, so something else has to be wrong here, and it turns out that there is: the whole time we are staring at this (at least when I was: maybe the people reading this article know better) we have been analyzing it under the totally reasonable assumption that list is an array... what if it is, in fact, an object?!

function resolvePrecacheList() { var result = deferred.defer(); loader.ajax({ url: '/cdn-cgi/async/cf/uri/' + config.oracle + '.json', method: 'GET', complete: function(data) { data = json.parse(data); result.resolve(data.top_static || []); }, }); return result.promise; }

As we can see, the code for getting the list of files to precache (which I both reformatted and cut all of the error handling logic from, for brevity) parses a JSON object sent from the server: no other modifications are made locally except to pull out the top_static field. So, we now just go and pull the original data being parsed (as before, I have added some whitespace and redacted the massive center of the long list of files).

{"top_static":{ "0":"http:\/\/modmyi.com\/clientscript\/vbulletin_css\/style00130l\/forumhome-rollup.css", "1":"http:\/\/modmyi.com\/images\/misc\/tab-collapsed.png" "2":"http:\/\/modmyi.com\/info\/d\/custom\/pinstripes.png", "3":"http:\/\/modmyi.com\/images\/pagination\/last-right.png", ... "251":"http:\/\/modmyi.com\/images\/_ipadapps\/ipadapps-15.jpg", "252":"http:\/\/modmyi.com\/info\/d\/custom\/f-style.css" },"cuid":"1164acfaa217df6ae5fe85e3351038ec","zid":488773}

The problem is now fairly evident: the list of files to cache is actually an object, not an array, and thereby does not have length parameter. This means that the termination condition (index === list.length) of the deferred loop construct will never hit, and it will instead fall off the end and get stuck in a tight loop over nextTick, locking up the browser (due to the behavior of message dispatching in WebCore).

But... What Broke Today?

We might then assume that the problem today happened because CloudFlare changed their JSON serializer: that there is now a bug on the server that was not there yesterday that is causing this array to accidentally get serialized as if it were an object.

(This may seem far-fetched, but if the server is itself written in a fairly dynamic language, it might fail to detect what things are arrays and which are not; while developing Cycript, which has code for generating JavaScript from Objective-C data structures, I once ended up with similar output.)

However, while I was extracting the preloader code out of the local storage database, I found an older entry that had been previously cached under a slightly different filename. Weirdly, this older version is minified (unlike the newer code), but it is thankfully again fairly reasonable to read now that we've seen the original.

// cdn-cgi/nexp/v=615554213/cloudflare/oracle.js k().then(function(a) { var b=0; (function() { var d = arguments.callee; b === a.length || b > 100 ? c.setCookie("CF_ORACLE", "PRECACHED") : a[b] ? l(a[b]).then(d, d) : h.nextTick(function() { d() }), b++ })() }, function(a) { i.error(a) })

The key difference is that the old version of the code had an extra clause in its termination condition: || index > 100. This provides a fallback guarantee that the iteration will eventually stop, even if it had long since walked off the end of the list and was pulling blanks—a little inefficient for a bit, but it would terminate.

With this finding, it could very well be that the JSON encoder bug has been around for a long time, and that it simply was not symptomatic before until someone noticed that seemingly-unneeded check and removed it (which makes sense for various reasons, including "why are we sending people an array with 253 entries and then only using the first 100 of them?" ;P).

CloudFlare's Risk Profile

Now, given all of this, it occurs to me to wonder how many people who use CloudFlare really are thinking through the idea that they are not just using a normal CDN, or even a normal proxy service, but that their website's functionality is actually being modified in transit, and that at any point the people working on CloudFlare may introduce a bug.

Of course, that's true of everything: that's true of your hosting company, and is true of the power company supplying them power. However, as we can see here, CloudFlare is making dynamic code changes to websites, and is using new features such as HTML5 message channels and local storage: features that might not even be implemented correctly in the browser, and which you may be actively avoiding for your own website due to the results of your testing process.

So, once you sign up for this service, you may find that your website works on your development machines, that it continues to work on your staging systems, that it even works today on production... but that in the middle of the night your website may randomly be upgraded to something that interacts poorly with your users' browsers, and as it doesn't effect everyone, it goes unfixed until you complain.

Of course, the tradeoffs and benefits of this service may be valuable: I, for example, continue to use Amazon EC2 despite the occasional outage, as the things it has made possible in my development cycle have been game changing to what is otherwise such a small operation. Somehow, though, this seems much more intrusive.

Comments

(As I mostly use HN to comment on other peoples' sites, I will simply leave a link to this article's Hacker News submission if anyone feels the need to comment on mine.)