Skip to content
Biz & IT

Surprise! Mozilla can produce near-native performance on the Web

We put Mozilla’s JavaScript subset—asm.js—to the test.

Ars Staff | 167
Credit: Aurich Lawson / Thinkstock
Credit: Aurich Lawson / Thinkstock
Story text

In a bid to make JavaScript run ever faster, Mozilla has developed asm.js. It’s a limited, stripped down subset of JavaScript that the company claims will offer performance that’s within a factor of two of native—good enough to use the browser for almost any application. Can JavaScript really start to rival native code performance? We’ve been taking a closer look.

The quest for faster JavaScript

JavaScript performance became a big deal in 2008. Prior to this, the JavaScript engines found in common Web browsers tended to be pretty slow. These were good enough for the basic scripting that the Web used at the time, but it was largely inadequate for those wanting to use the Web as a rich application platform.

In 2008, however, Google released Chrome with its V8 JavaScript engine. Around the same time, Apple brought out Safari 4 with its Nitro (née Squirrelfish Extreme) engine. These engines brought something new to the world of JavaScript: high performance achieved through just-in-time (JIT) compilation. V8 and Nitro would convert JavaScript into pieces of executable code that the CPU could run directly, improving performance by a factor of three or more.

Mozilla and Microsoft followed suit. Mozilla introduced TraceMonkey in Firefox 3.5 in 2009 and Microsoft released Chakra in 2011.

JIT compilation provided great scope for accelerating the performance of JavaScript programs, but it has its limits. The problem is JavaScript itself. The behavior of the language makes it hard to optimize. In languages such as C and C++, the behavior of a program is baked in when the program is compiled. Languages like Java and C# add a little more flexibility, but most of the time they share that same characteristic. The functions and data that make up a particular class are fixed when the program is compiled.

This isn’t true of JavaScript. In JavaScript, the way an object is meant to behave can change at more or less any time. A JIT engine could produce executable code to make an object behave one way, and then that object could be modified to invalidate the executable code. This means that the executable code has to be quite conservative to guard against this kind of modification. From time to time, bugs have cropped up that cause bad code to be generated.

Browser developers are, therefore, in a frustrating position. They want scripting engines that are faster to enable the browser to be used for a wider range of applications, but their efforts to improve performance are hamstrung by JavaScript itself. The language simply isn’t designed for high performance optimization.

Breaking the speed limit by changing the rules

This has all led to a number of efforts to change JavaScript itself. The first notable one is Google Dart. Google Dart is a scripting language that is aimed at the same kind of programs as JavaScript is currently used for, with syntax that is broadly familiar to JavaScript developers but without many of the traits that make JavaScript difficult to optimize.

Google’s original ambition was to have Dart integrated into the browser, using a Dart-specific engine where available or translating to JavaScript when not. Google also developed Dartium, a fork of its Chromium browser (Chromium being the open-source counterpart to Chrome) that includes the Dart engine.

As a practical matter, getting both Web and browser developers to embrace an all-new language with an all-new engine is an uphill struggle. JavaScript isn’t going to go away any time soon, so adding additional languages simply increases the complexity of browsers and spreads development resources thinner.

asm.js

Mozilla proposed an alternative. Rather than using an entirely new language, Mozilla defines a strict subset of JavaScript that it calls asm.js. The asm.js subset of JavaScript is very limited. It eschews, for example, JavaScript’s object-oriented constructs. As a result, it also eschews many of JavaScript’s hard-to-optimize dynamic capabilities.

Instead of using objects and classes, asm.js programs manipulate a large array representing “memory” in a manner not entirely dissimilar to the way C and C++ programs manipulate system memory. This does not mean that concepts such as objects and classes cannot be used. It means instead that they must be implemented and used by asm.js programs in the same way that C++ compilers implement and use them. In a C++ program, an object in memory is typically represented by the memory address of the class’s v-table (a table of all the functions belonging to the object’s class) followed by the storage for the object’s data. So too in asm.js: the memory array would contain, in consecutive elements, the array index of the v-table and then the object data.

asm.js also contains special hints to indicate which data types are being used. In traditional JavaScript, numbers can behave more or less like integers, or more or less like floating point numbers. The behavior changes depending on the operations being performed. For example, JavaScript will let you perform bitwise operations on floating point numbers by coercing those numbers into integers first. This coercion happens automatically and implicitly, meaning that JIT compilers cannot safely assume that a number is of one type or the other. asm.js uses explicit indicators to specify whether numbers (and operations on those numbers) should use integer-like behavior or floating point-like.

This representation is much lower level than that found in traditional JavaScript programs, but it comes with an important constraint: it’s nonetheless still JavaScript. The big memory array uses (relatively recently introduced) JavaScript Typed Arrays. It was originally created for WebGL, but it became available in all modern browsers, including the WebGL-less Internet Explorer 10. The number type indicators similarly use JavaScript constructs. For example, to indicate that a number is an integer, asm.js uses “bitwise or with zero” (an operation that forces JavaScript to coerce to integer-like, but which does not change the number’s value).

The result is that, unlike Dart programs that need a Dart engine or explicit translation to JavaScript, asm.js programs already run in any browser. They’re just JavaScript programs, albeit weird JavaScript programs that don’t look like anything that a human would ever produce.

Fewer features mean better performance

Browsers that recognize and have explicit support for asm.js can, however, take advantage of this knowledge to perform better optimization. An engine that knows about asm.js also knows that asm.js programs are forbidden from using many JavaScript features. As a result, it can produce much more efficient code. Regular JavaScript JITs must have guards to detect this kind of dynamic behavior. asm.js JITs do not; asm.js forbids this kind of dynamic behavior, so the JITs do not need to handle it. This simpler model—no dynamic behavior, no memory allocation or deallocation, just a narrow set of well-defined integer and floating point operations—enables much greater optimization.

The fact that asm.js doesn’t look like JavaScript any human would produce might seem like a problem. Scant few developers of native code programs use assembler, and asm.js is even more feature-deprived than most real assembly languages. Mozilla doesn’t really intend for developers to write asm.js programs directly, however. Instead, the idea is that compilers use asm.js as the target, with programs themselves written in some other language.

That language is typically C or C++, and the compiler used to produce asm.js programs is another Mozilla project: Emscripten. Emscripten is a compiler based on the LLVM compiler infrastructure and the Clang C/C++ front-end. The Clang compiler reads C and C++ source code and produces an intermediate platform-independent assembler-like output called LLVM Intermediate Representation. LLVM optimizes the LLVM IR. LLVM IR is then fed into a backend code generator—the part that actually produces executable code. Traditionally, this code generator would emit x86 code. With Emscripten, it’s used to produce JavaScript.

Emscripten can be used in two modes. It can produce regular JavaScript and it can produce asm.js JavaScript. In both cases, the output would not be described as human-readable. Just as with asm.js, the regular JavaScript uses the basic concept of a big array to represent “memory” with operations performed on that array. It was the success of this approach that led to the development of asm.js: asm.js is a formalized set of rules for how this style of JavaScript should be written.

So that’s what asm.js is. The real question, however, is how fast does it go? We’ve built a number of common benchmarks using Emscripten to take a look.

The travails of benchmarking

Benchmarking is a tricky subject. There are lots of benchmarks out there, and through necessity we’ve excluded far more than we’ve included. The browser environment is limited in many ways; JavaScript programs have limited network connectivity, limited facilities for persistent storage, limited access to graphics and audio hardware, and so on. To that end, we focused on benchmarks that are reasonably self-contained and computationally focused.

First, a trio of classic benchmarks: Whetstone, Dhrystone, and LINPACK.

Whetstone is a floating point benchmark that can be traced back to 1972. It’s a synthetic benchmark (i.e. one that does not correspond directly to any useful program) that was originally designed to be representative of the kinds of scientific programs run at the UK’s National Physical Laboratory at the time. It performs a range of simple operations: addition and subtraction of arrays of numbers, trigonometric calculations, exponentiations, and square roots.

Dhrystone, the name a riff on Whetstone, is a synthetic integer benchmark first created in 1984. Just as Whetstone was produced by analyzing the instruction mix of floating point programs and replicating that in a benchmark, Dhrystone was based on the mix of integer instructions performed by typical programs of the day. It tests things like function calls, string manipulation, and array accessing.

Two versions of Dhrystone are in common use, the original version 1.x family and an updated version 2.x family from 1988. Dhrystone 2.x was designed to produce scores comparable to and compatible with 1.x, but it had code structured in such a way as to prevent certain compiler optimizations. Specifically, Dhrystone 2.x split the source code into a pair of files with some functions in one file and others in a second file. This prevented C compilers of the day from performing tricks such as eliminating the function calls and instead writing the function bodies directly at the point at which they’re called.

These techniques were not entirely successful in 1988; they’re almost wholly useless in 2013. Modern compilers contain facilities specifically to optimize functions even when they are split across multiple files. This gave us two options: either figure out the right pessimizations to add to the source code to prevent the compiler from doing the best job it could or acknowledge that real programs do in fact use compiler optimizations. As such, we used the (simpler) Dhrystone 1.x code and acknowledged that the compiler might subvert some aspects of the benchmark. This may make the results less useful if trying to compare to other machines, but when comparing different compilers on the same machine, it’s unimportant.

Dhrystone and Whetstone are both old, and this limits them in some ways. On modern computers, they’ll fit entirely within cache, for example. So they give no indication of the interplay between the processor and its memory subsystem. However, these tests remain in common use. The small size is in some way advantageous. Dhrystone is routinely used to gauge the performance of low-end processors, particularly those used in embedded systems, precisely because it is so small and simple that it can run on almost anything.

The original LINPACK benchmark is another old floating point test, dating back to 1979. Unlike Whetstone and Dhrystone, LINPACK performs meaningful calculations: specifically, it solves systems of linear equations, performing lots of computations on matrices of numbers.

Unlike Dhrystone and Whetstone, LINPACK has scaled to modern systems. The matrices can simply be made ever larger to increase the problem. A variant of LINPACK, HPL, is commonly used on supercomputer clusters, and LINPACK performance is used to rank supercomputers in the Top500 list.

For all three of these benchmarks, a variety of near-identical C versions exist. The ones I used were taken from here. The code as downloaded contains minimal changes sufficient to build the software on typical PCs. I subsequently modified them to use different timing routines (more on this later), remove some non-standard low-level hardware probing, and remove unnecessary file I/O.

All three produce their own score that is intended to give some indication of “performance per second.” For LINPACK, the result is simply a count of FLOPS, floating point operations per second. For Dhrystone, the result is VAX MIPS. This is an indication of the performance relative to the ancient DEC VAX 11/780 computer. The VAX 11/780 was notionally a machine capable of one million of instructions per second (MIPS), and it could run the Dhrystone tests 1757 times per second. A score of 2 VAX MIPS in Dhrystone means, therefore, that a system is twice as fast as the VAX 11/780 at running Dhrystone. Whetstone similarly produces a count of WIPS, Whetstone Instructions Per Second.

The next industry standard benchmark was the STREAM memory bandwidth benchmark. This benchmark shouldn’t tax the JavaScript engine particularly, but it should give some indication of how efficient the large memory array is when compared to just accessing memory directly. It uses four different subtests: one simply copies memory from one array to another; the others read one or more values, perform some kind of computation, and then write the result. It produces a set of four scores measured in megabytes per second, one for each sub-test.

In all four of these well-known tests, higher scores indicate better performance.

Now for something written in this century

The remaining tests were taken or adapted from The Computer Language Benchmarks Game. This set of benchmarks includes a wide range of routines in a variety of languages that all solve the same set of problems. Some of them are loosely based on real-world problems (for example, performing regular expression matches against DNA sequences); others are purely synthetic (such as allocating and deallocating data structures). Anyone is free to submit source code that solves the various problems, and many of the benchmarks include both generic C++ routines and highly optimized platform-specific equivalents.

In general, I took the highest-ranked pure C++ routines for each test. Versions that used, for example, SSE functions were excluded, as were those with explicit multithreading. Code with implicit multithreading using the OpenMP API was usable, as OpenMP can simply be turned off to produce a single-threaded program.

We did not use all of the tests from the Benchmarks Game. Two of them, chamenos-redux and thread-ring, are explicitly multithreaded and hence not supported in the browser environment. One of them, k-nucleotide, is not explicitly multithreaded but only has multithreaded submissions.

Of the rest, we folded three of the tests—fasta, reverse-complement, and regex-dna—together to produce one combined test. The fasta test produces pseudo-random DNA sequences that use the FASTA file format. The reverse-complement test reads a DNA sequence and emits its reverse complement: the sequences of bases that the strand binds to. The regex-dna test searches DNA sequences for particular patterns.

The original Benchmark Game tests use fasta to produce an output file, which is then read by the reverse-complement and regex-dna tests. In the original tests, the reading and writing is done by redirecting input and output to and from files (e.g. regex-dna < fasta.txt).

This is somewhat awkward in a browser environment, so the tests were modified in two ways. First, all three tests were combined into one super-test that we’ve named fasta-combo. Second, the data was passed between the three test sections using a file rather than redirected input.

The other tests used are binary-trees, which constructs a bunch of data structures and then gets rid of them; fannkuch-redux, which generates lists of numbers and re-orders them according to a certain set of rules; mandelbrot, which produces a portable bitmap of the well-known fractal the Mandelbrot set; meteor-contest, which counts the number of ways that some puzzle pieces can be assembled; n-body, which simulates the orbits of the Jovian planets; and spectral-norm, which calculates the square root of the largest eigenvalue of an infinite matrix.

All of these tests are measured by their execution time. As such, lower scores mean better performance.

The tests were also consistently modified so that they would record and display on-screen their own execution time. This was done more for convenience than idealism. Although UNIX machines are generally equipped with the oh-so-handy time program to measure how long something takes, Windows has no ready equivalent. Having “invasive” timing that’s integrated into the tests means that the timings will tend to be slight underestimates. For native programs, the timer won’t include the initial startup and initialization; for the asm.js programs, it won’t include the initial asm.js compilation. We felt that the convenience in using the tests on multiple platforms outweighed this slight inaccuracy.

Finally, a disclaimer of sorts. These are not the only benchmark programs that could be written. They may not be the best benchmark programs that could be written. The programs themselves might be improvable in ways that will meaningfully impact the results. Such modifications might change the winners and losers, such as they are. These benchmarks may not be even remotely representative of the workloads that you care about. These are not issues specific to these benchmarks. They’re issues endemic to all benchmarks, and ultimately, the only benchmark that truly matters is “how well does my specific application work?” In general, that is a question you will have to answer for yourself.

Get it on GitHub

For our testing, we built all the code (available on GitHub with a Visual Studio 2012 project and a GNU Make makefile) with clang 3.2 and Emscripten incoming@1ed2d7f to produce both conventional JavaScript and asm.js JavaScript. Visual Studio 2012 was used to produce native code with basically all the optimizations turned on (including auto-vectorization). To ensure compatibility with some of the source code, the November CTP preview version of the compiler was used. This shouldn’t alter the optimization at all, but it enables it to handle some C++ features used in some of the test programs. The clang-based compiler is the only option for using Emscripten; we used Visual Studio for the native binaries as it’s the standard choice on Windows, is widely used, generally has decent all-around performance, and generally produces working output. Clang on Windows is currently best described as a “work in progress.”

At the time of writing, the asm.js compiler was only available in Firefox Nightly builds. To that end, we used Nightly 2013-05-07 on 64-bit Windows 8. For comparison, we also used 32-bit Internet Explorer 10.0.4 as well.

Emscripten can produce both standalone JavaScript files, suitable for use with environments such as node.js (a standalone wrapper for Chrome’s V8 engine) and jsshell (a standalone wrapper for Firefox’s JavaScript engine), and HTML files with the scripts embedded. The HTML output additionally provides access to various browser features such as Canvas and WebGL for 2D and 3D graphics, respectively. For most of the testing, we used the HTML output because we wanted to be able to compare to Internet Explorer, which has no tool comparable to node.js or jsshell.

We tried various versions of Chrome (stable, dev, and canary branches) but the tests would not run consistently in any of them, most likely due to the amount of memory they use. We assigned 512MB to the memory array used in JavaScript, as this was the smallest power-of-two size that allowed all the tests to run (asm.js requires its memory array size to be a power of two). Internet Explorer handled this with aplomb, but Chrome consistently produced sad tabs. Firefox worked most of the time but would need occasional restarts after complaining of out-of-memory errors.

The test hardware is an Intel Sandy Bridge Xeon E3-1275 with a 3.4GHz base clock and 3.8GHz turbo. For those unfamiliar with this model, it’s essentially equivalent to the Core i7-2600 but equipped with the faster GPU of the Core i7-2600K. The system has 16GB of 1333MHz RAM.

Crunching numbers

So enough talking. Let’s look at some numbers. We’ll take the four classic higher-is-better benchmarks first.

Bigger bars mean better performance.
Bigger bars mean better performance.

Bigger bars mean better performance.
Bigger bars mean better performance.

From the outset, some things are very clear. First, the plain JavaScript is surprisingly quick: 40 percent of the performance of native code, on average. Second, asm.js makes it a whole lot faster: 69 percent faster, on average, for overall performance of 68 percent of native. For people with fast computers like the test system, even with the performance hit incurred by using asm.js, the performance will tend to be well within the “acceptable” range.

The biggest gains were in the STREAM tests, with performance more than doubling for the scale, add, and triad subtests. This suggests that eliminating the checks and conversions when reading from the memory arrays is important to asm.js’s speed improvement.

The smallest gains were shown by the floating point intensive LINPACK and Whetstone tests. Whetstone barely improved from the switch to asm.js, gaining just seven percent. LINPACK did better, with a 38 percent improvement, but this is only slightly more than half the improvement seen by other tests. Based on the Whetstone result alone, one might interpret this as meaning that Firefox’s JavaScript engine is already close to the limit when it comes to floating point performance. However, it’s clear from the LINPACK test that this isn’t in fact the case. This was the worst performing test overall, providing just 26 percent of native performance.

And now the Benchmark Game tests.

Smaller bars mean better performance.
Smaller bars mean better performance.

These results are much less favorable to the old JavaScript approach. On average, its runtime is 4.2 times the runtime of the native code. Conversely, these results really show the difference that asm.js can make: its average runtime is just 64 percent longer than the native code runtime. The asm.js versions of the code run in less than 40 percent of the time that the conventional JavaScript takes. For many tasks, this could be the difference between “too slow for the Emscripten approach to be useful” and “we can use Emscripten to deploy to the Web.”

These averages may even paint a pessimistic picture. The mandelbrot test demonstrates substantially worse performance than the others, clocking in at 3.9 times the runtime of the native code. It’s not immediately obvious why this should be the case. None of the operations it performs are unique to that test; it does a mix of file I/O, floating point arithmetic, and bitwise integer operations, but all of these can be found in other tests without showing any significant performance penalty.

Exclude mandelbrot and the asm.js runtime averages just 26 percent longer than the native runtime.

For single-threaded code performing a reasonable mix of tasks, asm.js performs remarkably well at least on the basis of its performance.

Nothing’s ever easy, is it?

Unfortunately, the real situation is a bit more complex than that. Before anyone starts going out and using asm.js to write programs, or even evangelizing it to developers as Mozilla has arguably been doing recently, there are some important points to consider.

Most simply of all, the fast asm.js performance only exists in Firefox—and only in the unstable Nightly builds at that. This technology isn’t in any stable version of Firefox, and it’ll be some time before it will be.

Other browsers don’t have special asm.js support at all. Google expressed some interest in adding asm.js optimizations to Chrome, but it hasn’t done so yet. Worse, our testing showed that the use of asm.js actually hurt performance in non-asm.js browsers.

Before, we compared native code to Firefox performance with and without asm.js. Let’s see what happens when adding Internet Explorer into the mix:

Bigger bars mean better performance.
Bigger bars mean better performance.

Bigger bars mean better performance.
Bigger bars mean better performance.

It’s clear that—for the kind of code that Emscripten produces, at least—Chakra is slower than Firefox’s engine almost across the board. Only the n-body test runs faster in Microsoft’s browser. This doesn’t mean that Chakra is necessarily slower in “real life”; the code produced by Emscripten is substantially unlike any normal code that browsers would run. It may not be altogether surprising that they don’t handle it well.

Smaller bars mean better performance.
Smaller bars mean better performance.

But it’s also clear that, in many cases, asm.js is making a bad situation even worse. It goes from runtime of 6.4× native to runtime of 6.8× worse. That’s a six percent deterioration.

As such, if developers want to write Web apps in a style that performs well in every browser then, for the time being at least, asm.js probably isn’t the optimal approach.

The Emscripten programs also seemed quite memory hungry. The biggest program we tested was probably the STREAM benchmark. This allocates about 240MB of memory for its test data. In spite of this, the native code version ran in a total of less than 256MB. asm.js requires the memory array to have a fixed power-of-two size, and there was hope that a 256MB array would be sufficient for it, too. However, it wasn’t, so we had to go up to the next size, 512MB. This combination of memory hunger and power-of-two sizing means that any modest program could easily require a vast memory array.

There will be improvements in this regard sooner or later as at least one other browser vendor has expressed interest in asm.js. At its I/O conference last week, Google said that optimization work it had been doing led to a 2.4 times improvement in asm.js performance. With Google turning its attention to asm.js, it’s likely that the Chrome crash problems will also be tackled one way or another.

Platform limitations matter

Emscripten currently doesn’t provide a fully featured runtime environment. There are the obvious things it can’t do—full and unfettered network access, for example—due to the constraints of running in the browser. There are also some odd gaps in its libraries.

For example, the POSIX/UNIX standard function clock_gettime(), which is used to retrieve the system time (and, optionally, the time of various other clocks), exists in Emscripten, but it doesn’t work. It just sits at time zero forever. We didn’t find any serious, insurmountable issues here (for example, another time function, gettimeofday() worked just fine), but the finding made clear that this is still very much a work in progress.

There are, however, some more serious omissions. Take threading for example. In some respects, Emscripten’s hands are tied. JavaScript, as traditionally used, is strictly serial. Only one piece of script code can ever be running at a time. A relatively new specification, Web Workers, grafts partial concurrency onto the JavaScript environment, but it’s very limited. Web Workers can’t share or modify each other’s data, so there are only limited facilities for communicating between Web Workers. It might technically be possible to somehow map multiple threads onto Web Workers, but there’s no obvious way in which it would be easy or efficient.

As a result, Emscripten can only be used to run single-threaded programs. One repercussion this had in our testing is that it tended to be a little kind to Emscripten. To keep things fair, we didn’t use any multithreading in the native code programs. But we could have, and some of the programs have multithreaded variants. Run those and the performance gap is considerably greater:

Smaller bars mean better performance.
Smaller bars mean better performance.

The multithreaded binary-trees and mandelbrot tests are identical to their single-threaded counterparts, except, in each case, for a single line change that directs the compiler to use OpenMP for a single loop. This provides simple and straightforward parallelism with minimal developer effort. Even this simple change yields significant improvements: compared to the OpenMP programs, the asm.js versions take 2.87 times longer for binary-trees and 12.2 times longer for mandelbrot.

Similarly, Emscripten has no access to the SIMD (Single Instruction, Multiple Data) instruction sets such as SSE and AVX that modern processors offer because the Firefox scripting engine doesn’t know how to produce vector code. The native code compiler we tested did use SSE2, but in most cases was only able to use its scalar (Single Instruction, Single Data) capabilities. The two exceptions are LINPACK and STREAM: the simple loops in two of the LINPACK subtests were automatically vectorized, and all four of the loops in the STREAM subtests were vectorized.

Again, this can make a big difference. The Benchmark Game included an SSE3 version of the fannkuch-redux test. This is a more complex change than using OpenMP—it requires rewriting the entire program, more or less—but the performance change is significant:

Smaller bars mean better performance.
Smaller bars mean better performance.

The SSE3 optimized version of the test took just 40 percent of the generic version. In turn, this gave the asm.js version a much greater penalty: it took 3.7 times the time of the optimized native code.

We don’t have source code for multithreaded or SIMD versions of the classic tests immediately at hand, but closed source, binary-only versions do exist. While we’re a little wary of making such comparisons and certainly don’t think that the exact numbers should be treated as gospel, the results are at least illustrative of the performance that well-optimized native code can have when compared to single threaded cross-platform C and C++:

Bigger bars mean better performance.
Bigger bars mean better performance.

Bigger bars mean better performance.
Bigger bars mean better performance.

For those curious: Intel’s heavily optimized LINPACK can be downloaded here. The widely used benchmarking tool SiSoftware Sandra contains multithreaded, optimized versions of Whetstone (using SSE4) and Dhrystone (using SSE3). It also includes many memory benchmarks, one set of which (the floating point copy, scale, add, and triad quartet) is essentially an optimized version of STREAM.

A lot of programs are not particularly performance sensitive and are not particularly performance optimized. For these more generic programs, asm.js should fare pretty well. But if the original program is optimized in certain ways, such as using both thread-level and data-level (SIMD) parallelism, a substantial gap can open up between native code and asm.js. Software that depends on the likes of SSE2 and multithreading to achieve decent performance is liable to suffer severely when used with Emscripten, if it will even run at all.

Clearly, if asm.js is to be able to compete with this kind of program, some high performance solution must be developed: some kind of equivalent to shared memory multithreading. Such a change would, however, be enormously disruptive to the basic model of the Web browser. It would raise many hairy issues with regard to interaction between JavaScript programs and the rest of the webpage (or WebGL). There’s no easy solution here, so we wouldn’t expect any overnight improvement.

An annoying development model

We were fortunate in that the starting source code all essentially worked, so little debugging or other development work was necessary. That’s just as well because debugging support for Emscripten programs, whether using asm.js or not, is to a great extent non-existent.

The JavaScript programs that Emscripten produces are huge. The binary-tree test, for example, results in a 16,896 byte native code executable. Its JavaScript counterparts are 379,784 bytes for regular JavaScript and 667,207 bytes for asm.js. Aside from the download implications this would have (though they can be mitigated through HTTP compression), these are simply huge JavaScript files. It turns out that most browsers aren’t really built for this kind of thing. Try to use your browser’s built-in debugging tools (whether Firefox, Chrome, or Internet Explorer) and you’ll find that they get awfully slow, to the point of being unusable.

But maybe that doesn’t matter because you wouldn’t really want to debug the JavaScript anyway. The JavaScript emitted by Emscripten is of a comparable level of abstraction to native assembly code. It does have one nicety that assembly doesn’t—it has proper functions and function calls—but apart from that, it’s substantially impenetrable. If something goes wrong when running a program, there’s no good way of figuring out what or why from the JavaScript.

A technique for addressing this, at least in part, is being developed. Source Maps should allow JavaScript debuggers to correlate between the compiled source and the thing that generated it. However, the debugging experience still, in practice, leaves an awful lot to be desired. Native code debuggers are complex and capable things, allowing easy transitions between source and assembly view, step-by-step execution with both source and assembly granularity, structured views of in-memory data, and much more besides. JavaScript debuggers are generally less mature, and asm.js makes this immaturity more acute.

Presently, your best bet is to debug native code using native debugging tools and hope that the Emscripten compilation works as it should. That may not be attractive in projects where you’re performing significant interoperation between portions of code compiled with Emscripten and conventional hand-written JavaScript.

Compilation is also very slow. A number of optimization steps are performed, with Emscripten bringing in node.js and, for non-asm.js builds, even Java to perform certain optimization tasks if desired. The result is very slow compilation, with the Emscripten builds taking between 10 and 50 times longer to compile than the native code ones. This again makes development and debugging a chore.

We also came across some minor bugs during development, wherein Emscripten would produce improper output. Emscripten developer and Mozilla engineer Alon Zakai was very helpful, producing a couple of timely fixes to address specific issues we came across.

Amazingly, asm.js actually works

asm.js works. It was surprising. We expected better-than-JavaScript performance, but we thought that Mozilla must have been cooking the books somehow with its claims of “within 2× native performance.” With some provisos around multithreading and SIMD code, the company was telling the truth. It really is that fast, and it can be very close to native performance.

In spite of that, it still may not be fast enough. Our testing PC isn’t the fastest in the world and is a processor generation behind, but it’s still pretty quick. Judging by the Steam Hardware Survey, even among gamers—a demographic that generally needs faster PCs than most—our system is within the top 15 percent. It can afford to have programs take 26 percent longer when they run, because for the most part, they’ll still run very fast.

So if a lowlier computer was used, would we reach the same conclusions? I still have systems where I find myself waiting for the computer to catch up; do I really want to make those waits 26 percent longer? (I don’t think so.) This goes double for when the code is already performance sensitive, such as in a game engine. Games often push devices to the limit, and 26 percent worse execution time could easily be the difference between smooth graphics and jerky dropped frames.

It’s even worse on battery-powered devices. When a laptop, smartphone, or tablet is running a compute-intensive task for 26 percent longer, that also means that its CPU is running at full speed for 26 percent longer. That means that it takes 26 percent longer for the processor to drop down to its power saving mode. So not only do you have to wait; you get worse battery life, too.

On mobile devices, every cycle matters. Giving up 26 percent of them to run programs in the browser rather than natively is unappealing, and yet this is precisely what Mozilla wants us to do.

So while we might not be willing to give up native apps for asm.js Web apps, giving up traditional Web apps for asm.js Web apps is a different story. The asm.js output is faster than the conventional Emscripten output, and preliminary testing indicates that it can be quite a bit faster than hand-coded, human-readable JavaScript, too.

We don’t yet have equivalent versions of all the Benchmark Game tests written in hand-coded JavaScript, but we’ve tried a few, running them in jsshell rather than a browser.

Smaller bars mean better performance.
Smaller bars mean better performance.

These results are very varied. Two of the idiomatic JavaScript programs are slower than their asm.js counterparts; two are faster. But this is a useful result. The rules of the Benchmark Game for these four tests are that not only must the tests compute the same results, but they must also use the same algorithm. Both the C++ and hand-written JavaScript code are, therefore, good approximations of what a real developer would write when tackling these problems in the same way.

asm.js may fall a bit short of delivering truly native performance when compared to native code—but it might yet be valuable as a way of beating JavaScript itself.

Listing image: Aurich Lawson / Thinkstock

167 Comments