arquero

Arquero API Reference

Top-Level Table Verbs Op Functions Expressions Extensibility


Table Expressions

Most Arquero verbs accept table expressions: functions defined over table column values. For example, the derive verb creates new columns based on the provided expressions:

table.derive({
  raise: d => op.pow(d.col1, d.col2),
  'col diff': d => d.col1 - d['base col']
})

In the example above, the two arrow function expressions are table expressions. The input argument d represents a row of the data table, whose properties are column names. Table expressions can include standard JavaScript expressions and invoke functions defined on the op object, which, depending on the context, may include standard, aggregate, or window functions.

At first glance table expressions look like normal JavaScript functions… but hold on! Under the hood, Arquero takes a set of function definitions, maps them to strings, then parses, rewrites, and compiles them to efficiently manage data internally. From Arquero’s point of view, the following examples are all equivalent:

  1. function(d) { return op.sqrt(d.value); }
  2. d => op.sqrt(d.value)
  3. ({ value }) => op.sqrt(value)
  4. d => sqrt(d.value)
  5. d => aq.op.sqrt(d.value)
  6. "d => op.sqrt(d.value)"
  7. "sqrt(d.value)"

Examples 1 through 5 are function definitions, while examples 6 and 7 are string literals. Let’s walk through each:

Limitations

A number of JavaScript features are not allowed in table expressions, including internal function definitions, variable updates, and for loops. The only function calls allowed are those provided by the op object. (Why? Read below for more…) Most notably, parsed table expressions do not support closures. As a result, table expressions can not access variables defined in the enclosing scope.

To include external variables in a table expression, use the params() method method to bind a parameter value to a table context. Parameters can then be accessed by including a second argument to a table expression; all bound parameters are available as properties of that argument (default name $):

table
  .params({ threshold: 5 })
  .filter((d, $) => d.value < $.threshold)

To pass in a standard JavaScript function that will be called directly (rather than parsed and rewritten), use the escape() expression helper. Escaped functions do support closures and so can refer to variables defined in an enclosing scope. However, escaped functions do not support aggregate or window operations; they also sidestep internal optimizations and result in an error when attempting to serialize Arquero queries (for example, to pass transformations to a worker thread).

const threshold = 5;
table.filter(aq.escape(d => d.value < threshold))

Alternatively, for programmatic generation of table expressions one can fallback to a generating a string – rather than a proper function definition – and use that instead:

// note: threshold must safely coerce to a string!
const threshold = 5;
table.filter(`d => d.value < ${threshold}`)

Column Shorthands

Some verbs – including groupby(), orderby(), fold(), pivot(), and join() – accept shorthands such as column name strings. Given a table with columns colA and colB (in that order), the following are equivalent:

  1. table.groupby('colA', 'colB') - Refer to columns by name
  2. table.groupby(['colA', 'colB']) - Use an explicit array of names
  3. table.groupby(0, 1) - Refer to columns by index
  4. table.groupby(aq.range(0, 1)) - Use a column range helper
  5. table.groupby({ colA: d => d.colA, colB: d => d.colB }) - Explicit table expressions

Underneath the hood, all of these variants are grounded down to table expressions.

Aggregate & Window Shorthands

For aggregate and window functions, use of the op object outside of a table expression allows the use of shorthand references. The following examples are equivalent:

  1. d => op.mean(d.value) - Standard table expression
  2. op.mean('value') - Shorthand table expression generator

The second example produces an object that, when coerced to a string, generates 'd => op.mean(d["value"])' as a result.


Two-Table Expressions

For join verbs, Arquero also supports two-table table expressions. Two-table expressions have an expanded signature that accepts two rows as input, one from the “left” table and one from the “right” table.

table.join(otherTable, (a, b) => op.equal(a.key, b.key))

The use of aggregate and window functions is not allowed within two-table expressions. Otherwise, two-table expressions have the same capabilities and limitations as normal (single-table) table expressions.

Bound parameters can be accessed by including a third argument:

table
  .params({ threshold: 1.5 })
  .join(otherTable, (a, b, $) => op.abs(a.value - b.value) < $.threshold)

Two-Table Column Shorthands

Rather than writing explicit two-table expressions, join verbs can also accept column shorthands in the form of a two-element array: the first element of the array is either a string or string array with columns in the first (left) table, whereas the second element indicates columns in the second (right) table.

Given two tables – one with columns x, y and the other with columns u, v – the following examples are equivalent:

  1. table.join(other, ['x', 'u'], [['x', 'y'], 'v'])
  2. table.join(other, [['x'], ['u']], [['x', 'y'], ['v']])
  3. table.join(other, ['x', 'u'], [aq.all(), aq.not('u')])

All of which are in turn equivalent to using the following two-table expressions:

table.join(other, ['x', 'u'], {
  x: (a, b) => a.x,
  y: (a, b) => a.y,
  v: (a, b) => b.v
})

Why are only op functions supported?

Any function that is callable within an Arquero table expression must be defined on the op object, either as a built-in function or added via the extensibility API. Why is this the case?

As described earlier, Arquero table expressions can look like normal JavaScript functions, but are treated specially: their source code is parsed and new custom functions are generated to process data. This process prevents the use of closures, such as referencing functions or values defined externally to the expression.

So why do we do this? Here are a few reasons:

Of course, one might wish to make different trade-offs. Arquero is designed to support common use cases while also being applicable to more complex production setups. This goal comes with the cost of more rigid management of functions. However, Arquero can be extended with custom variables, functions, and even new table methods or verbs! As starting points, see the params, addFunction, and addTableMethod functions to introduce external variables, register new op functions, or extend tables with new methods.

All that being said, not all use cases require portability, safety, etc. For such cases Arquero provides an escape hatch: use the escape() expression helper to apply a standard JavaScript function as-is, skipping any internal parsing and code generation.