Jekyll2018-05-01T02:14:11+00:00https://libnnc.org//nnc, the deep learning framework from libccvLiu LiuNNC: Neural Network Collection2018-04-28T00:00:00+00:002018-04-28T00:00:00+00:00https://libnnc.org/2018/04/28/nnc<h2>What’s NNC?</h2>
<p>NNC is the natural progression against <code class="highlighter-rouge">ccv_convnet</code>, which is a couple of years old now. <code class="highlighter-rouge">ccv_convnet</code>’s monolithic, single path neural layer design didn’t really feel right with more advanced network architectures.</p>
<p>NNC took some good ideas from more recent neural network frameworks and did a long re-think on how to achieve both efficiency and expressiveness. The design itself is layered. At the highest layer, you have ordinary neural network primitives that reflect real-world usage such as Inception module, LSTM, RNN et al. At the lowest layer, depending on the infrastructure, it maps to allocated tensors on GPU, computations backed by CuDNN, and computation graphs driven with CUDA streams (or you can exchange that with CPU, Metal, and libdispatch). For the abstractions in between, there are trade-offs and constraints to accommodating both the library design and usage.</p>
<p>In a few sentences, here is how the layered design works.</p>
<p>NNC starts with tensors, commands and streams, which closely map to low level computation primitives. On top of that, a concrete computation graph can be constructed and executed directly. Above that, a symbolic graph can express everything about the concrete computation graph, without actual tensor and stream allocations. A dynamic graph can contain both symbolic graph representation and the concrete computation graph, thus, carries out the computation immediately while retain the ability to do series of processes on top of the symbolic graph representation. Familiar primitives such as LSTM or RNN then were built on top of either the symbolic graph or the dynamic graph constructs.</p>
<p>There are roughly <strong>5 layers</strong> built on top of each other.</p>
<h2>1. Tensors, Commands and Streams</h2>
<p>Tensors are multi-dimensional arrays at its basic level.</p>
<p>Commands are ops (operations) in other framework’s terminology.</p>
<p>Streams are the synchronization mechanism. Each command instance executed serially on a given stream. Different command instances on different streams will be scheduled in parallel if the underlying infrastructure permits.</p>
<p>A command is identified by its <code class="highlighter-rouge">cmd</code> identifier. It processes a set of input tensors, and write output to a set of output tensors. There is no limits on how many input tensors it can accept or how many output tensors it can write to.</p>
<p>A command can only have one set of attributes (recognized by NNC) specified. These attributes (such as whether this can be an <em>inplace</em> operation) help on symbolic processes. If you find that you need to implement the same command but these attributes cannot be hold, you need to rename the command to avoid invalid symbolic processes.</p>
<p>One command, however, can be backed by several <strong>backend</strong> implementations. Command backend implementors, besides the ones who implement <code class="highlighter-rouge">*_REF</code> free to only support specific cases of the input tensors (for example, a particular tensor layout, or a specific tensor size (3x3?), or half precision numbers). But once a backend accepts the input, it follows exactly the command attributes specified above (for example, any backend that implements a <em>inplace</em> command, will allow any parts of its input to be overwritten by this command at time while this command is executing without affecting the correctness of the output).</p>
<p>At runtime, a command will select the appropriate backend based on the input type and execution time.</p>
<h2>2. Computation Graph</h2>
<p><strong>Computation graph</strong> expresses how the computation carries out. The output tensors can be used as input for the next command, so on and so forth. That is where <em>TensorFlow</em> got its name from. At this layer, <strong>computation graph</strong> knows the execution orders (data dependencies) between each command instances, and will schedule them on proper streams to ensure these execution orders are respected. Tensors themselves are not associated with the execution order at this point.</p>
<p>A <strong>computation graph</strong> can contain a sub-graph, which is a <strong>computation graph</strong> itself. It is executed as a single command instance by the parent <strong>computation graph</strong>. As of now, a <em><code class="highlighter-rouge">while</code> type sub-graph</em> (for looping) and a <em><code class="highlighter-rouge">case..of</code> type sub-graph</em> (for branching) are supported.</p>
<p>A <strong>computation graph</strong> can be auto-tuned to find the best backend implementations that minimize the total execution time. There may be future optimizations to allow modifying the graph itself to do more aggressive tuning (such as including tensor conversions to trade between slower implementation and conversion + faster implementation).</p>
<p>In short, once you have a <strong>computation graph</strong>, the computation can be carried out naturally because there is no extra assumptions about execution environment and no more parameters or allocations need to be specified.</p>
<h2>3. Symbolic Graph</h2>
<p><strong>Symbolic graph</strong> expresses commands, the associated tensors and their execution orders (dependencies). This may sound very similar to the <strong>computation graph</strong> above, but there are several important differences:</p>
<ol>
<li>
<p>there is no concept of <em>stream</em> at this point, because the <strong>symbolic graph</strong> doesn’t carry out the actual computation, and <em>stream</em> can be determined purely by the execution order;</p>
</li>
<li>
<p>there is no tensor allocation. <strong>Symbolic graph</strong> uses the tensor metadata (layout, precision, dimensions, even which GPU it is associated with), but no actual allocation took place until it is compiled to a <strong>computation graph</strong>;</p>
</li>
<li>
<p>There is no 1:1 mapping guarantee about the commands in the <strong>symbolic graph</strong> with the command instances in the <strong>computation graph</strong>.</p>
</li>
</ol>
<p>In fact, <strong>symbolic graph</strong> doesn’t take tensors. It takes tensor symbols. The tensor symbol usage within the symbolic graph follows strict <em>static single assignment (SSA)</em> rule. It can only be used as a command instance’s output once. This is important because by following <em>SSA</em>, potential data races are completely eliminated. More over, certain processes and the actual tensor allocation algorithm are much easier to implement with this assumption. With <em>SSA</em> rule, the execution orders (dependencies) can be generated trivially.</p>
<p>It may feel like the tensor metadata is over-specified. For example, why precision or layout, or which GPU it resides is relevant? Because tensor symbols have many to 1 mapping with the actual tensors. Specifications on the tensor symbol avoid processes on the <strong>symbolic graph</strong> resulting a tensor symbol that needs to be backed with conversions. Any conversions on the <strong>symbolic graph</strong> has to be explicit command instances.</p>
<p>Having that in mind, however, you can take an <em>alias</em> of a tensor symbol, which is a sliced / reshaped tensor symbol from the original. It allows several operations to be zero effort on the actual <strong>computation graph</strong>. The <em>alias</em> itself still have to follow the same <em>SSA</em> rule, which means all the <em>aliases</em> and the original tensor symbol can only be written once (if two <em>aliases</em> as outputs point to non-overlapping parts of the original tensor, the written-once rule is not violated).</p>
<p>Processes can be carried out on the <strong>symbolic graph</strong> ranging from <em>automatic differentiation</em>, to <em>common sub-expression elimination</em> (CSE), or <em>operator fusion</em> (finer-grained set of commands be replaced by a combined command implementation).</p>
<p>When the actual computation is needed. A <strong>symbolic graph</strong> can be compiled to a <strong>computation graph</strong>. The compilation process can involve optimizations that previously already possible on the given <strong>computation graph</strong> (such as CSE). More importantly, this step performs additional optimization passes that will violate the <em>SSA</em> rule above. Currently, it will perform following processes that are not available as pure optimization passes:</p>
<ol>
<li>
<p>In-place safe command instance will operate on the same tensor symbol inputs / outputs whenever possible (for example, <code class="highlighter-rouge">1.23 * x => y</code> will be re-written to <code class="highlighter-rouge">1.23 * x => x</code> if no other places use <code class="highlighter-rouge">x</code>);</p>
</li>
<li>
<p>Tensor allocation based on the liveness analysis for the tensor symbols. This step will generate the many to 1 mapping between tensor symbols with the actual tensors;</p>
</li>
<li>
<p>Emit implicit commands for tensor initialization. Certain tensor symbols need to be initialized before use (zero init for now), which is impossible to know when until tensor allocation was taken place. This is one reason why there is no 1:1 mapping between <strong>symbolic graph</strong> and <strong>computation graph</strong>.</p>
</li>
</ol>
<p>All above steps are carried out recursively for its <em><code class="highlighter-rouge">while</code> / <code class="highlighter-rouge">case..of</code> type sub-graphs</em> too.</p>
<h2>4. Dynamic Graph</h2>
<p><strong>Dynamic graph</strong> operates on concrete tensor instances. It took input tensors, executed a command on them, and took the outputs. From this perspective, it is very similar to the <strong>computation graph</strong>. The conceptual difference, is that the <strong>computation graph</strong> carries out execution from a specification, while <strong>dynamic graph</strong> forms a specification from the actual execution.</p>
<p>Thus, <strong>dynamic graph</strong> will construct a <strong>symbolic graph</strong> along its execution. It enables the <strong>dynamic graph</strong> to perform the same kind of sophisticated optimization passes and analysis once needed (such as <em>automatic differentiation</em>)</p>
<p>More over, <strong>dynamic graph</strong> implements a simple memorization mechanism. The tensors it uses will carry a hash, as well as a specific command. The output tensors can be retrieved from the cache by the generated hash if it is possible, to avoid repetitive computations.</p>
<h2>5. Common Neural Network Primitives</h2>
<p>The goal here is to provide a set of <strong>common neural network primitives</strong> for modeling as well as parameter updates.</p>
<h2>Supplementary Materials</h2>
<h3>Toll-Free Bridging</h3>
<p><em>Toll-free bridging</em> here means that a <code class="highlighter-rouge">ccv_dense_matrix_t</code> struct, without any conversions at all, can be cast to a <code class="highlighter-rouge">ccv_nnc_tensor_t</code> struct and then used with nnc directly. The byte pattern is specifically arranged such that a 3 dimensional <code class="highlighter-rouge">ccv_nnc_tensor_t</code> can be cast back to <code class="highlighter-rouge">ccv_dense_matrix_t</code> vice versa. This allows seamless integration with the rest of image process primitives provided by ccv.</p>
<h3>Automatic Differentiation</h3>
<p><em>Automatic differentiation</em> supported by nnc is its reverse mode. The implementation is simple enough because we enforced <em>SSA</em> throughout the <strong>symbolic graph</strong>.</p>
<p>Each command need to implement its forward function, as well as its backward function. The backward function takes the input / output of the its forward function, as well as the gradients (matching the output tensors) as its input. It outputs the gradients with respect to the input (matching the input tensors of the forward function).</p>
<p>When doing <em>automatic differentiation</em>, from its <strong>symbolic graph</strong>, a backward command matching each forward command is created. The execution order (dependencies) is exactly reverse. <em>SSA</em> guarantees each tensor symbol is written once, that means the gradient w.r.t. that symbol needs to only be summed once as well.</p>
<p><em>alias</em> introduced some complexities to the implementation. Namely, because an alias can be used as input for follow-up commands, its reverse suggests different gradients w.r.t. different <em>aliases</em> required to be summed at certain point. That means these gradients need to be potentially zero init to avoid generating garbage results. This is done by inserting zero init tensor symbol property, which indicated an implicit zero init command will be injected at <strong>symbolic graph</strong> compilation time.</p>
<p>The specific implementation also means taking second order derivative isn’t possible with nnc at this point. It will be possible however in the future once the backward function can be specified by a set of forward functions and then we can do command substitution on the <strong>symbolic graph</strong>.</p>
<h3><code class="highlighter-rouge">while</code> Type Sub-Graph</h3>
<p>The <em><code class="highlighter-rouge">while</code> type sub-graph</em> is a special type of a <strong>symbolic graph</strong> or a <strong>computation graph</strong>. This is because it expresses a generic loop structure with custom evaluation function supplied.</p>
<p>The loop execution within a <em><code class="highlighter-rouge">while</code> type sub-graph</em> looks like this:</p>
<ol>
<li>The sub-graph starts the execution from a set of source command instances;</li>
<li>It proceeds either serially or in parallel until all evaluation command instances executed. The subsequent command instances are on hold;</li>
<li>The evaluation function is called, and depends on the result, the execution within the sub-graph will either abort (break), or continue, until all the destination command instances executed and reached;</li>
<li>Once all destination command instances executed and reached, we will start from step 1. again.</li>
</ol>
<p>For <em><code class="highlighter-rouge">while</code> type symbolic sub-graph</em>, the obvious question would be how <em>SSA</em> rule plays out in the loop structure. We allow in the sub-graph to specify certain output tensor symbols carry over to the input tensor symbols in the next round, practically made these input tensor symbols parameters. The <em>compilation</em> step will handle this properly and allocate the input tensors at the same memory locations as the output tensors (there are <code class="highlighter-rouge">ccv_nnc_tensor_multiview_t</code> workaround if the condition cannot be satisfied).</p>
<p>When doing <em>automatic differentiation</em>, a <code class="highlighter-rouge">ccv_nnc_tensor_tape_t</code> need to be provided for the <em><code class="highlighter-rouge">while</code> type sub-graph</em> to record the outputs properly.</p>
<h3><code class="highlighter-rouge">case..of</code> Type Sub-Graph</h3>
<p>The <em><code class="highlighter-rouge">case..of</code> type sub-graph</em> is another special type of a <strong>symbolic graph</strong> or a <strong>computation graph</strong>. It expresses a generic branch structure with custom evaluation function supplied.</p>
<p>The <em><code class="highlighter-rouge">case..of</code> type sub-graph</em> contains several separate sub-graphs identified by indexes from 0 to n:</p>
<ol>
<li>The evaluation function is called, if the result is >= 0, a sub-graph is selected for execution, otherwise, jump to step 3.;</li>
<li>The selected sub-graph executed from beginning to end;</li>
<li>If the result is < 0, no sub-graph executed.</li>
</ol>
<p>For <em><code class="highlighter-rouge">case..of</code> type symbolic sub-graph</em>, if a tensor symbol is <em>written-once</em>, how to proceed if all sub-graphs skipped (in typical case, if a sub-graph executed, presumably, the tensor you want will be written by a command in that sub-graph)? We allow you to specify for these output tensor symbols, which symbol from the input can be supplied as <em>replacement</em>. The <em>compilation</em> step will ensure a <code class="highlighter-rouge">ccv_nnc_tensor_multiview_t</code> is created to handle these cases.</p>
<p>When doing <em>automatic differentiation</em>, a <code class="highlighter-rouge">ccv_nnc_tensor_tape_t</code> need to be provided for the <em><code class="highlighter-rouge">case..of</code> type sub-graph</em> to record the outputs properly.</p>
<h3>Limits and Constraints</h3>
<ol>
<li>
<p>Tensor itself supports up to 8 dimensions. This is defined in <code class="highlighter-rouge">CCV_NNC_MAX_DIM_ALLOC</code>.</p>
</li>
<li>
<p>Tensor’s dimension can only reach to up <code class="highlighter-rouge">INT_MAX</code>. That may be a limiting factor for some of the tensors if they need more than 8GiB (32-bit floating point assumed) on one dimension.</p>
</li>
<li>
<p>The limit on number of inputs and output tensors is <code class="highlighter-rouge">INT_MAX</code>. To perform <em>automatic differentiation</em> properly, this number drops to <code class="highlighter-rouge">floor(INT_MAX / 3)</code>. However, for more than 64 parameters, there are internal heap allocation required, which makes previously deterministic execution none-deterministic (it may take arbitrarily long depending on the <code class="highlighter-rouge">malloc</code> you use).</p>
</li>
<li>
<p>The allocated tensor size can go up to <code class="highlighter-rouge">min(UINT64_MAX, SIZE_MAX)</code>.</p>
</li>
<li>
<p>A computation can only depend on no more than <code class="highlighter-rouge">2^16</code> other computations. This is determined by a core macro <code class="highlighter-rouge">CCV_NNC_GRAPH_VISIT</code>.</p>
</li>
<li>
<p>The sub-graph can go as deep as <code class="highlighter-rouge">2^(31 - 4)</code>, otherwise the outer-most while count tensor cannot be referenced by the inner-most sub-graph.</p>
</li>
<li>
<p>The maximum number of GPU devices per machine or NUMA nodes per machine is 4095. This is defined in <code class="highlighter-rouge">CCV_COMPUTE_DEVICE_ANY</code>.</p>
</li>
</ol>Liu LiuWhat’s NNC? NNC is the natural progression against ccv_convnet, which is a couple of years old now. ccv_convnet’s monolithic, single path neural layer design didn’t really feel right with more advanced network architectures. NNC took some good ideas from more recent neural network frameworks and did a long re-think on how to achieve both efficiency and expressiveness. The design itself is layered. At the highest layer, you have ordinary neural network primitives that reflect real-world usage such as Inception module, LSTM, RNN et al. At the lowest layer, depending on the infrastructure, it maps to allocated tensors on GPU, computations backed by CuDNN, and computation graphs driven with CUDA streams (or you can exchange that with CPU, Metal, and libdispatch). For the abstractions in between, there are trade-offs and constraints to accommodating both the library design and usage. In a few sentences, here is how the layered design works. NNC starts with tensors, commands and streams, which closely map to low level computation primitives. On top of that, a concrete computation graph can be constructed and executed directly. Above that, a symbolic graph can express everything about the concrete computation graph, without actual tensor and stream allocations. A dynamic graph can contain both symbolic graph representation and the concrete computation graph, thus, carries out the computation immediately while retain the ability to do series of processes on top of the symbolic graph representation. Familiar primitives such as LSTM or RNN then were built on top of either the symbolic graph or the dynamic graph constructs. There are roughly 5 layers built on top of each other. 1. Tensors, Commands and Streams Tensors are multi-dimensional arrays at its basic level. Commands are ops (operations) in other framework’s terminology. Streams are the synchronization mechanism. Each command instance executed serially on a given stream. Different command instances on different streams will be scheduled in parallel if the underlying infrastructure permits. A command is identified by its cmd identifier. It processes a set of input tensors, and write output to a set of output tensors. There is no limits on how many input tensors it can accept or how many output tensors it can write to. A command can only have one set of attributes (recognized by NNC) specified. These attributes (such as whether this can be an inplace operation) help on symbolic processes. If you find that you need to implement the same command but these attributes cannot be hold, you need to rename the command to avoid invalid symbolic processes. One command, however, can be backed by several backend implementations. Command backend implementors, besides the ones who implement *_REF free to only support specific cases of the input tensors (for example, a particular tensor layout, or a specific tensor size (3x3?), or half precision numbers). But once a backend accepts the input, it follows exactly the command attributes specified above (for example, any backend that implements a inplace command, will allow any parts of its input to be overwritten by this command at time while this command is executing without affecting the correctness of the output). At runtime, a command will select the appropriate backend based on the input type and execution time. 2. Computation Graph Computation graph expresses how the computation carries out. The output tensors can be used as input for the next command, so on and so forth. That is where TensorFlow got its name from. At this layer, computation graph knows the execution orders (data dependencies) between each command instances, and will schedule them on proper streams to ensure these execution orders are respected. Tensors themselves are not associated with the execution order at this point. A computation graph can contain a sub-graph, which is a computation graph itself. It is executed as a single command instance by the parent computation graph. As of now, a while type sub-graph (for looping) and a case..of type sub-graph (for branching) are supported. A computation graph can be auto-tuned to find the best backend implementations that minimize the total execution time. There may be future optimizations to allow modifying the graph itself to do more aggressive tuning (such as including tensor conversions to trade between slower implementation and conversion + faster implementation). In short, once you have a computation graph, the computation can be carried out naturally because there is no extra assumptions about execution environment and no more parameters or allocations need to be specified. 3. Symbolic Graph Symbolic graph expresses commands, the associated tensors and their execution orders (dependencies). This may sound very similar to the computation graph above, but there are several important differences: there is no concept of stream at this point, because the symbolic graph doesn’t carry out the actual computation, and stream can be determined purely by the execution order; there is no tensor allocation. Symbolic graph uses the tensor metadata (layout, precision, dimensions, even which GPU it is associated with), but no actual allocation took place until it is compiled to a computation graph; There is no 1:1 mapping guarantee about the commands in the symbolic graph with the command instances in the computation graph. In fact, symbolic graph doesn’t take tensors. It takes tensor symbols. The tensor symbol usage within the symbolic graph follows strict static single assignment (SSA) rule. It can only be used as a command instance’s output once. This is important because by following SSA, potential data races are completely eliminated. More over, certain processes and the actual tensor allocation algorithm are much easier to implement with this assumption. With SSA rule, the execution orders (dependencies) can be generated trivially. It may feel like the tensor metadata is over-specified. For example, why precision or layout, or which GPU it resides is relevant? Because tensor symbols have many to 1 mapping with the actual tensors. Specifications on the tensor symbol avoid processes on the symbolic graph resulting a tensor symbol that needs to be backed with conversions. Any conversions on the symbolic graph has to be explicit command instances. Having that in mind, however, you can take an alias of a tensor symbol, which is a sliced / reshaped tensor symbol from the original. It allows several operations to be zero effort on the actual computation graph. The alias itself still have to follow the same SSA rule, which means all the aliases and the original tensor symbol can only be written once (if two aliases as outputs point to non-overlapping parts of the original tensor, the written-once rule is not violated). Processes can be carried out on the symbolic graph ranging from automatic differentiation, to common sub-expression elimination (CSE), or operator fusion (finer-grained set of commands be replaced by a combined command implementation). When the actual computation is needed. A symbolic graph can be compiled to a computation graph. The compilation process can involve optimizations that previously already possible on the given computation graph (such as CSE). More importantly, this step performs additional optimization passes that will violate the SSA rule above. Currently, it will perform following processes that are not available as pure optimization passes: In-place safe command instance will operate on the same tensor symbol inputs / outputs whenever possible (for example, 1.23 * x => y will be re-written to 1.23 * x => x if no other places use x); Tensor allocation based on the liveness analysis for the tensor symbols. This step will generate the many to 1 mapping between tensor symbols with the actual tensors; Emit implicit commands for tensor initialization. Certain tensor symbols need to be initialized before use (zero init for now), which is impossible to know when until tensor allocation was taken place. This is one reason why there is no 1:1 mapping between symbolic graph and computation graph. All above steps are carried out recursively for its while / case..of type sub-graphs too. 4. Dynamic Graph Dynamic graph operates on concrete tensor instances. It took input tensors, executed a command on them, and took the outputs. From this perspective, it is very similar to the computation graph. The conceptual difference, is that the computation graph carries out execution from a specification, while dynamic graph forms a specification from the actual execution. Thus, dynamic graph will construct a symbolic graph along its execution. It enables the dynamic graph to perform the same kind of sophisticated optimization passes and analysis once needed (such as automatic differentiation) More over, dynamic graph implements a simple memorization mechanism. The tensors it uses will carry a hash, as well as a specific command. The output tensors can be retrieved from the cache by the generated hash if it is possible, to avoid repetitive computations. 5. Common Neural Network Primitives The goal here is to provide a set of common neural network primitives for modeling as well as parameter updates. Supplementary Materials Toll-Free Bridging Toll-free bridging here means that a ccv_dense_matrix_t struct, without any conversions at all, can be cast to a ccv_nnc_tensor_t struct and then used with nnc directly. The byte pattern is specifically arranged such that a 3 dimensional ccv_nnc_tensor_t can be cast back to ccv_dense_matrix_t vice versa. This allows seamless integration with the rest of image process primitives provided by ccv. Automatic Differentiation Automatic differentiation supported by nnc is its reverse mode. The implementation is simple enough because we enforced SSA throughout the symbolic graph. Each command need to implement its forward function, as well as its backward function. The backward function takes the input / output of the its forward function, as well as the gradients (matching the output tensors) as its input. It outputs the gradients with respect to the input (matching the input tensors of the forward function). When doing automatic differentiation, from its symbolic graph, a backward command matching each forward command is created. The execution order (dependencies) is exactly reverse. SSA guarantees each tensor symbol is written once, that means the gradient w.r.t. that symbol needs to only be summed once as well. alias introduced some complexities to the implementation. Namely, because an alias can be used as input for follow-up commands, its reverse suggests different gradients w.r.t. different aliases required to be summed at certain point. That means these gradients need to be potentially zero init to avoid generating garbage results. This is done by inserting zero init tensor symbol property, which indicated an implicit zero init command will be injected at symbolic graph compilation time. The specific implementation also means taking second order derivative isn’t possible with nnc at this point. It will be possible however in the future once the backward function can be specified by a set of forward functions and then we can do command substitution on the symbolic graph. while Type Sub-Graph The while type sub-graph is a special type of a symbolic graph or a computation graph. This is because it expresses a generic loop structure with custom evaluation function supplied. The loop execution within a while type sub-graph looks like this: The sub-graph starts the execution from a set of source command instances; It proceeds either serially or in parallel until all evaluation command instances executed. The subsequent command instances are on hold; The evaluation function is called, and depends on the result, the execution within the sub-graph will either abort (break), or continue, until all the destination command instances executed and reached; Once all destination command instances executed and reached, we will start from step 1. again. For while type symbolic sub-graph, the obvious question would be how SSA rule plays out in the loop structure. We allow in the sub-graph to specify certain output tensor symbols carry over to the input tensor symbols in the next round, practically made these input tensor symbols parameters. The compilation step will handle this properly and allocate the input tensors at the same memory locations as the output tensors (there are ccv_nnc_tensor_multiview_t workaround if the condition cannot be satisfied). When doing automatic differentiation, a ccv_nnc_tensor_tape_t need to be provided for the while type sub-graph to record the outputs properly. case..of Type Sub-Graph The case..of type sub-graph is another special type of a symbolic graph or a computation graph. It expresses a generic branch structure with custom evaluation function supplied. The case..of type sub-graph contains several separate sub-graphs identified by indexes from 0 to n: The evaluation function is called, if the result is >= 0, a sub-graph is selected for execution, otherwise, jump to step 3.; The selected sub-graph executed from beginning to end; If the result is < 0, no sub-graph executed. For case..of type symbolic sub-graph, if a tensor symbol is written-once, how to proceed if all sub-graphs skipped (in typical case, if a sub-graph executed, presumably, the tensor you want will be written by a command in that sub-graph)? We allow you to specify for these output tensor symbols, which symbol from the input can be supplied as replacement. The compilation step will ensure a ccv_nnc_tensor_multiview_t is created to handle these cases. When doing automatic differentiation, a ccv_nnc_tensor_tape_t need to be provided for the case..of type sub-graph to record the outputs properly. Limits and Constraints Tensor itself supports up to 8 dimensions. This is defined in CCV_NNC_MAX_DIM_ALLOC. Tensor’s dimension can only reach to up INT_MAX. That may be a limiting factor for some of the tensors if they need more than 8GiB (32-bit floating point assumed) on one dimension. The limit on number of inputs and output tensors is INT_MAX. To perform automatic differentiation properly, this number drops to floor(INT_MAX / 3). However, for more than 64 parameters, there are internal heap allocation required, which makes previously deterministic execution none-deterministic (it may take arbitrarily long depending on the malloc you use). The allocated tensor size can go up to min(UINT64_MAX, SIZE_MAX). A computation can only depend on no more than 2^16 other computations. This is determined by a core macro CCV_NNC_GRAPH_VISIT. The sub-graph can go as deep as 2^(31 - 4), otherwise the outer-most while count tensor cannot be referenced by the inner-most sub-graph. The maximum number of GPU devices per machine or NUMA nodes per machine is 4095. This is defined in CCV_COMPUTE_DEVICE_ANY.The NNC Tensor Allocation Algorithm2018-04-27T00:00:00+00:002018-04-27T00:00:00+00:00https://libnnc.org/2018/04/27/nnc-alloc<p>Today, most neural network computations are organized as acyclic graph with each node represent a computation with a list of tensors (or multi-dimensional arrays) associated with it. Certain modifications are added to support control structures such as <em>if</em> or <em>do…while</em> loop. Given that for most implementations, they represent the acyclic graph in symbolic form (thus, no computation has been executed and no tensor has been allocated.), an comprehensive and efficient allocation algorithm is desirable and has been shown not only improve space utilization but also the speed (due to data locality).</p>
<p>The NNC tensor allocation algorithm is a take on the problem, it has the following properties:</p>
<ol>
<li>
<p>Treat tensors as a region of memory, enable reuse part of the previously allocated memory;</p>
</li>
<li>
<p>Support loop structure, thus, if there is a while loop, this allocation algorithm will handle tensor reuse properly, without introduce any extra data transfer operation;</p>
</li>
<li>
<p>Enable efficient memory reuse when branching.</p>
</li>
</ol>
<h2>Tensor Representation</h2>
<p>To simplify the description, we will assume tensors are represented as a continuous memory region. A extension of this algorithm allows “alias” representation for a given tensor, thus, pointing to a sub-range of the memory region that tensor represents.</p>
<p>Each tensor is represented symbolically. Static single assignment form is assumed, thus, a tensor can be only assigned as node output once.</p>
<h2>Loop Representation</h2>
<p>A <em>do…while</em> loop is represented as a sub-graph of existing graph. The sub-graph is represented as one node in the parent graph. All inputs for this while loop are captured as node inputs, and the outputs for this while loop are captured as node outputs. A condition is evaluated at the beginning each round of the loop, and at the end of the loop, inputs are updated with the new outputs based specifications (thus, which input is replaced by which output). Although not proved mathematically, this should enable all types of <em>do…while</em> loop construction.</p>
<h2>The Problem Definition</h2>
<p>Before we get into details of the NNC tensor allocation algorithm, let’s get the problem definition straight. Given a graph with above tensors and loops, the problem asks to assign <code class="highlighter-rouge">n</code> tensors to one memory region buffer such that for each node operation, the input tensors and the output tensors have non-overlap memory regions assigned. each of the <code class="highlighter-rouge">n</code> tensors also have an offset and size associated to denote from where within the buffer is the tensor memory region. We want to find an arrangement so that the size of the buffer is smallest.</p>
<p>It is easy to see this problem is NP-Complete. Therefore, the challenge is to find a good enough approximation to the optimal solution.</p>
<h2>The Core Algorithm</h2>
<p>Before stating the core algorithm, there are a few principles we want to follow, and hopefully you will find these principles make sense.</p>
<ol>
<li>
<p>Deterministic and reproducible allocations;</p>
</li>
<li>
<p>Handle common neural network structures well (ResNet, DenseNet, LTSM etc.);</p>
</li>
<li>
<p>Handle ties well, and have a well-reasoned way to break the tie.</p>
</li>
</ol>
<p>With these in mind, I will first discuss the basic structure of the algorithm, then some alternatives we may have, but why not pursuit. Afterwards, I will discuss one important extension of this algorithm to support <em>do…while</em> loop construction.</p>
<h2>Basic Structure</h2>
<p>The algorithm consider the above problem as the so-called interference graph, which is widely known for register allocations. In this graph, a node represents a tensor. If there is an edge between two tensors, that means these two tensors has to be allocated to non-overlap memory region.</p>
<p>The interference graph captured the non-overlap nature, however, the partial reuse of tensors is under specified with the interference graph. Thus, we have our first structure to represent the constraints, and now we need our second structure to represent the solution.</p>
<p>The second structure is an acyclic graph with edge weights for our solution. The acyclic graph with edge weights (the allocation graph) has one source node to represent the memory region buffer, a directional edge associated itself with a weight, that represent an allocation of <code class="highlighter-rouge">x</code> bytes from the edge’s source node to the its destination node. There is one dummy sink node that represents the buffer when its allocation is reclaimed. In this structure, two tensors could be connected only if they don’t interfere with each other, thus, the destination tensor can reuse part of the memory region from the source tensor.</p>
<p>Based on the second structure, an iterative construction of the solution would be to insert tensor nodes into acyclic graph with infinite reservoir of outflow from source node to the sink node until all tensor nodes are inserted. At that point, the size of the buffer would be all weights of the source node’s outgoing edges. Our algorithm now reduced to the candidate tensor selection when forming this graph structure.</p>
<h2>Candidate Selection</h2>
<p>A set of candidates are maintained for the graph insertion. Each candidate is a tuple of tensors (max of 3) that doesn’t interfere with each other. The candidate selection algorithm like this:</p>
<ol>
<li>
<p>Go through all tensors that hasn’t been inserted, find the tensor that has the most number of edges in the interfere graph, if multiple tensors have the same number edges, add them all to the set of candidates;</p>
</li>
<li>
<p>For each candidate tensor in the set 1, try to find another tensor that doesn’t interfere with it and has larger size than the candidate tensor. Making them a tuple and add to the set of candidates;</p>
</li>
<li>
<p>Order the set of candidates first by the maximum size of tensors in the tuple, then by the total number of edges on the interference graph of tensors in the tuple.</p>
</li>
<li>
<p>Go through the ordered set of candidates, try to find an edge on the allocation graph such that the source of the edge and the destination of the edge don’t interfere with any tensors in the tuple and none of the source or the destination of the edge are the dummy source or sink nodes. If such edge is found, we find the candidate.</p>
</li>
<li>
<p>If none of the candidate can have such edge found in the allocation graph, we select the first candidate.</p>
</li>
</ol>
<h2>Insertion</h2>
<p>The selected tuple of tensors then need to be inserted into the allocation graph. The insertion is straight-forward.</p>
<ol>
<li>
<p>During selection, an edge is already picked, if it is not, we make a new edge with weight as the maximum size among the tuple of tensors from dummy source to the dummy sink node.</p>
</li>
<li>
<p>Tensors in the tuple ordered by its order of access. The order must be available on the computation graph otherwise these tensors will interfere with each other.</p>
</li>
<li>
<p>The weight of previous edge decreased by the maximum size among the tuple of tensors.</p>
</li>
<li>
<p>An edge from the previous edge’s source to the first tensor is inserted. The weight on the edge will be the size of the first tensor.</p>
</li>
<li>
<p>An edge from the first tensor to the second tensor is inserted. If the size of the second tensor is larger than the first tensor, the weight on the new edge will be the size of the first tensor, and another edge is inserted from the source to the second tensor with weight of the difference. Otherwise, the weight on the new edge will be the size of the second tensor.</p>
</li>
<li>
<p>Similarly, edges from the first tensor, second tensor, or the source will be inserted with respected weights.</p>
</li>
<li>
<p>Finally, edges from the all tensors to the destination will be inserted with the remaining weights.</p>
</li>
</ol>
<p>Repeat above until all tensors are connected in the allocation graph.</p>
<h2>Intuition</h2>
<p>Go with the tensor that has most interference is a common greedy strategy in register allocation. It removes most uncertainty that otherwise needs to branch over.</p>
<p>However, unlike register allocation, in tensor computation graphs, there are less cases that one tensor will span over a large chunk of computations especially in inference stage. Thus, a lot of tensors will have identical number of edges in the interference graph. For these cases, how to break the tie is crucial.</p>
<p>For our allocation algorithm, the allocation size is used as the the tie-breaker. If applying allocation size naively as the second sorting key, in tensor computation graphs, you may still find a lot of cases that you have tie. It is because the tensors that has similar life-span tends to be of the similar usage, thus, has similar dimensionality. For large class of neural networks, we found that by pairing up the tensor has the most interference with the tensor that has larger size (these two have to not interfere with each other), it is more likely for us to reach the trivial solution.</p>
<h2>Loop</h2>
<p>Tensor allocation with loop has to have a very specific definition of what a loop is. More broadly speaking, the types of control structure in a computation graph to support directly relevant to the allocation algorithm. The loop we specifically concerned are the ones with one conditional statement to exit the loop (traditional while-loop). For NNC tensor allocation algorithm to work, a new construct, called multi-view tensor, need to be introduced. Alternatively, the algorithm introduced here will be applicable to a specific loop that contains multiple conditional exits and phi function.</p>
<p>If you accept that certain data transfer is required for loop to work, the loop handling for tensor allocation algorithm is trivial. <strong>A loop can be considered as a sub-computation graph</strong>, and the same allocation algorithm can be applied to the sub-computation graph. When reached the end of the graph and we need to loop over again, data can then be transferred to the parameters.</p>
<p>For example, if you have:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(while (c < 5) { // c is the loop counter
y = Convolution(x, w, b)
})(x <= y) // This syntax means in the next loop, x will contain the content of y, you can think of this as x = Convolution(x, w, b), but such representation is forbidden in static single assignment form.
</code></pre></div></div>
<p>The tensor allocation algorithm is trivial is we accept that we need to transfer data from <code class="highlighter-rouge">y</code> to <code class="highlighter-rouge">x</code> every time. This section however, we will discuss how to completely eliminate such data transfer with a novel and generic tensor allocation scheme.</p>
<h2>Multi-view Tensor</h2>
<p>This is a special tensor that with nested structure. For a leaf multi-view tensor, it can point to multiple memory regions based on the loop counter. Particularly, a multi-view tensor can be configured with a repeat length. Its pointer will be updated prior to the actual computation each round the the correct memory region: <code class="highlighter-rouge">ptr = ptrs[loop_counter % repeat_length]</code>. There are some complications such as the support for two types of multi-view tensors. Type I will be the one described above. Type II will have a special memory region that only used when <code class="highlighter-rouge">loop_counter == 0</code>.</p>
<p>A multi-view tensor can not only points to memory regions, but to a set of other multi-view tensors, following the same semantics, thus, the nested structure.</p>
<h2>Loop with Efficient Tensor Allocation</h2>
<p>Above are all the constructs we need to implement efficient tensor allocation algorithm (the efficient here means no data transfer required).</p>
<p>For each parameter, we first identify whether co-allocating them to the same memory region is sufficient. In some cases, they are, thus, we can simply do that and then apply our tensor allocation algorithm to the sub-computation graph.</p>
<p>However, in some cases (like the superficial case we made above), it is not possible. For these, we need to <em>unroll</em> the loop.</p>
<p>For example, unrolled above loop will be:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>while (a < 5) {
z = Convolution(x, w, b)
b = a + 1
if (b) exit
y = Convolution(z, w, b)
c = b + 1
}(x <= y, a <= c)
</code></pre></div></div>
<p>One extra conditional exit added to make the loop syntactically equivalent to the one we had before.</p>
<p>When a loop unrolled as above, for the particular case, we can see that now <code class="highlighter-rouge">y</code> can be co-allocated with <code class="highlighter-rouge">x</code> (They are not interfere with each other).</p>
<p>It can be proved that any loop can be unrolled into a form that the parameters can be co-allocated. The exercise will be left to readers on how to use this to tackle something like <code class="highlighter-rouge">x[c] = Convolution(x[c - 4], w, b)</code> which requires to access variable from several loops before.</p>
<p>Once a loop can co-allocate all its parameters after unrolling, we can apply the tensor allocation algorithm on the unrolled computation graph.</p>
<p>The allocation on the unrolled computation graph then can be used to create the multi-view tensors. Now, the repeat length on the multi-view tensors correspond to how many times we unrolled the loop. Each memory region will be pointing to corresponding tensor on the unrolled computation graph as well.</p>
<h2>Sub-Computation Graph</h2>
<p>Sub-computation graph’s tensor allocation generated number of buffers and each buffer size. These will be used as regular tensor in the parent computation graph. The whole allocation algorithm then becomes recursive.</p>
<h2>Conclusion</h2>
<p>I believe the above algorithm is the first to address the tensor allocation problem with partial memory reuse and loop efficiency in mind. This algorithm is also presented as an extensible framework that can be considered in the future to support more control structures.</p>Liu LiuToday, most neural network computations are organized as acyclic graph with each node represent a computation with a list of tensors (or multi-dimensional arrays) associated with it. Certain modifications are added to support control structures such as if or do…while loop. Given that for most implementations, they represent the acyclic graph in symbolic form (thus, no computation has been executed and no tensor has been allocated.), an comprehensive and efficient allocation algorithm is desirable and has been shown not only improve space utilization but also the speed (due to data locality). The NNC tensor allocation algorithm is a take on the problem, it has the following properties: Treat tensors as a region of memory, enable reuse part of the previously allocated memory; Support loop structure, thus, if there is a while loop, this allocation algorithm will handle tensor reuse properly, without introduce any extra data transfer operation; Enable efficient memory reuse when branching. Tensor Representation To simplify the description, we will assume tensors are represented as a continuous memory region. A extension of this algorithm allows “alias” representation for a given tensor, thus, pointing to a sub-range of the memory region that tensor represents. Each tensor is represented symbolically. Static single assignment form is assumed, thus, a tensor can be only assigned as node output once. Loop Representation A do…while loop is represented as a sub-graph of existing graph. The sub-graph is represented as one node in the parent graph. All inputs for this while loop are captured as node inputs, and the outputs for this while loop are captured as node outputs. A condition is evaluated at the beginning each round of the loop, and at the end of the loop, inputs are updated with the new outputs based specifications (thus, which input is replaced by which output). Although not proved mathematically, this should enable all types of do…while loop construction. The Problem Definition Before we get into details of the NNC tensor allocation algorithm, let’s get the problem definition straight. Given a graph with above tensors and loops, the problem asks to assign n tensors to one memory region buffer such that for each node operation, the input tensors and the output tensors have non-overlap memory regions assigned. each of the n tensors also have an offset and size associated to denote from where within the buffer is the tensor memory region. We want to find an arrangement so that the size of the buffer is smallest. It is easy to see this problem is NP-Complete. Therefore, the challenge is to find a good enough approximation to the optimal solution. The Core Algorithm Before stating the core algorithm, there are a few principles we want to follow, and hopefully you will find these principles make sense. Deterministic and reproducible allocations; Handle common neural network structures well (ResNet, DenseNet, LTSM etc.); Handle ties well, and have a well-reasoned way to break the tie. With these in mind, I will first discuss the basic structure of the algorithm, then some alternatives we may have, but why not pursuit. Afterwards, I will discuss one important extension of this algorithm to support do…while loop construction. Basic Structure The algorithm consider the above problem as the so-called interference graph, which is widely known for register allocations. In this graph, a node represents a tensor. If there is an edge between two tensors, that means these two tensors has to be allocated to non-overlap memory region. The interference graph captured the non-overlap nature, however, the partial reuse of tensors is under specified with the interference graph. Thus, we have our first structure to represent the constraints, and now we need our second structure to represent the solution. The second structure is an acyclic graph with edge weights for our solution. The acyclic graph with edge weights (the allocation graph) has one source node to represent the memory region buffer, a directional edge associated itself with a weight, that represent an allocation of x bytes from the edge’s source node to the its destination node. There is one dummy sink node that represents the buffer when its allocation is reclaimed. In this structure, two tensors could be connected only if they don’t interfere with each other, thus, the destination tensor can reuse part of the memory region from the source tensor. Based on the second structure, an iterative construction of the solution would be to insert tensor nodes into acyclic graph with infinite reservoir of outflow from source node to the sink node until all tensor nodes are inserted. At that point, the size of the buffer would be all weights of the source node’s outgoing edges. Our algorithm now reduced to the candidate tensor selection when forming this graph structure. Candidate Selection A set of candidates are maintained for the graph insertion. Each candidate is a tuple of tensors (max of 3) that doesn’t interfere with each other. The candidate selection algorithm like this: Go through all tensors that hasn’t been inserted, find the tensor that has the most number of edges in the interfere graph, if multiple tensors have the same number edges, add them all to the set of candidates; For each candidate tensor in the set 1, try to find another tensor that doesn’t interfere with it and has larger size than the candidate tensor. Making them a tuple and add to the set of candidates; Order the set of candidates first by the maximum size of tensors in the tuple, then by the total number of edges on the interference graph of tensors in the tuple. Go through the ordered set of candidates, try to find an edge on the allocation graph such that the source of the edge and the destination of the edge don’t interfere with any tensors in the tuple and none of the source or the destination of the edge are the dummy source or sink nodes. If such edge is found, we find the candidate. If none of the candidate can have such edge found in the allocation graph, we select the first candidate. Insertion The selected tuple of tensors then need to be inserted into the allocation graph. The insertion is straight-forward. During selection, an edge is already picked, if it is not, we make a new edge with weight as the maximum size among the tuple of tensors from dummy source to the dummy sink node. Tensors in the tuple ordered by its order of access. The order must be available on the computation graph otherwise these tensors will interfere with each other. The weight of previous edge decreased by the maximum size among the tuple of tensors. An edge from the previous edge’s source to the first tensor is inserted. The weight on the edge will be the size of the first tensor. An edge from the first tensor to the second tensor is inserted. If the size of the second tensor is larger than the first tensor, the weight on the new edge will be the size of the first tensor, and another edge is inserted from the source to the second tensor with weight of the difference. Otherwise, the weight on the new edge will be the size of the second tensor. Similarly, edges from the first tensor, second tensor, or the source will be inserted with respected weights. Finally, edges from the all tensors to the destination will be inserted with the remaining weights. Repeat above until all tensors are connected in the allocation graph. Intuition Go with the tensor that has most interference is a common greedy strategy in register allocation. It removes most uncertainty that otherwise needs to branch over. However, unlike register allocation, in tensor computation graphs, there are less cases that one tensor will span over a large chunk of computations especially in inference stage. Thus, a lot of tensors will have identical number of edges in the interference graph. For these cases, how to break the tie is crucial. For our allocation algorithm, the allocation size is used as the the tie-breaker. If applying allocation size naively as the second sorting key, in tensor computation graphs, you may still find a lot of cases that you have tie. It is because the tensors that has similar life-span tends to be of the similar usage, thus, has similar dimensionality. For large class of neural networks, we found that by pairing up the tensor has the most interference with the tensor that has larger size (these two have to not interfere with each other), it is more likely for us to reach the trivial solution. Loop Tensor allocation with loop has to have a very specific definition of what a loop is. More broadly speaking, the types of control structure in a computation graph to support directly relevant to the allocation algorithm. The loop we specifically concerned are the ones with one conditional statement to exit the loop (traditional while-loop). For NNC tensor allocation algorithm to work, a new construct, called multi-view tensor, need to be introduced. Alternatively, the algorithm introduced here will be applicable to a specific loop that contains multiple conditional exits and phi function. If you accept that certain data transfer is required for loop to work, the loop handling for tensor allocation algorithm is trivial. A loop can be considered as a sub-computation graph, and the same allocation algorithm can be applied to the sub-computation graph. When reached the end of the graph and we need to loop over again, data can then be transferred to the parameters. For example, if you have: (while (c < 5) { // c is the loop counter y = Convolution(x, w, b) })(x <= y) // This syntax means in the next loop, x will contain the content of y, you can think of this as x = Convolution(x, w, b), but such representation is forbidden in static single assignment form. The tensor allocation algorithm is trivial is we accept that we need to transfer data from y to x every time. This section however, we will discuss how to completely eliminate such data transfer with a novel and generic tensor allocation scheme. Multi-view Tensor This is a special tensor that with nested structure. For a leaf multi-view tensor, it can point to multiple memory regions based on the loop counter. Particularly, a multi-view tensor can be configured with a repeat length. Its pointer will be updated prior to the actual computation each round the the correct memory region: ptr = ptrs[loop_counter % repeat_length]. There are some complications such as the support for two types of multi-view tensors. Type I will be the one described above. Type II will have a special memory region that only used when loop_counter == 0. A multi-view tensor can not only points to memory regions, but to a set of other multi-view tensors, following the same semantics, thus, the nested structure. Loop with Efficient Tensor Allocation Above are all the constructs we need to implement efficient tensor allocation algorithm (the efficient here means no data transfer required). For each parameter, we first identify whether co-allocating them to the same memory region is sufficient. In some cases, they are, thus, we can simply do that and then apply our tensor allocation algorithm to the sub-computation graph. However, in some cases (like the superficial case we made above), it is not possible. For these, we need to unroll the loop. For example, unrolled above loop will be: while (a < 5) { z = Convolution(x, w, b) b = a + 1 if (b) exit y = Convolution(z, w, b) c = b + 1 }(x <= y, a <= c) One extra conditional exit added to make the loop syntactically equivalent to the one we had before. When a loop unrolled as above, for the particular case, we can see that now y can be co-allocated with x (They are not interfere with each other). It can be proved that any loop can be unrolled into a form that the parameters can be co-allocated. The exercise will be left to readers on how to use this to tackle something like x[c] = Convolution(x[c - 4], w, b) which requires to access variable from several loops before. Once a loop can co-allocate all its parameters after unrolling, we can apply the tensor allocation algorithm on the unrolled computation graph. The allocation on the unrolled computation graph then can be used to create the multi-view tensors. Now, the repeat length on the multi-view tensors correspond to how many times we unrolled the loop. Each memory region will be pointing to corresponding tensor on the unrolled computation graph as well. Sub-Computation Graph Sub-computation graph’s tensor allocation generated number of buffers and each buffer size. These will be used as regular tensor in the parent computation graph. The whole allocation algorithm then becomes recursive. Conclusion I believe the above algorithm is the first to address the tensor allocation problem with partial memory reuse and loop efficiency in mind. This algorithm is also presented as an extensible framework that can be considered in the future to support more control structures.NNC Dynamic Graph Execution2018-04-26T00:00:00+00:002018-04-26T00:00:00+00:00https://libnnc.org/2018/04/26/nnc-dy<p>Frameworks such as <strong>PyTorch</strong> or <strong>TensorFlow Eager</strong> nowadays have dynamic graph support, which is a fancy word to describe when a computation is carried out while constructing the computation graph.</p>
<p>If <strong>dynamic graph execution</strong> is just about executing a command when issuing it, this is not interesting. <strong>Dynamic graph execution</strong> by these frameworks also supports <em>automatic differentiation</em>. A good <strong>dynamic graph execution</strong> framework such as <strong>PyTorch</strong> enables easier debugging, more intuitive coding thus quicker experimentation cycle.</p>
<p>That has been said, there are a few drawbacks when you support <strong>dynamic graph execution</strong> naively.</p>
<ol>
<li>Limited optimization opportunities. With <strong>dynamic graph execution</strong>, the framework lacks the foresight, makes optimizations such as <em>common sub-expression elimination</em> or <em>data layout optimization</em> hard to implement;</li>
<li>Unbounded memory usage. Since a <strong>dynamic graph execution</strong> engine needs to be able to differentiate arbitrary variables within the framework, a Wengert list (a tape) has to be kept. In many situations, to trim that list requires user attention otherwise the memory usage will continue to grow.</li>
</ol>
<p>To work-around 1., mixing <strong>static graph execution</strong> with <strong>dynamic graph execution</strong> is desirable. However, that imposes its own set of problems: when a <strong>static graph</strong> contains a <strong>dynamic graph</strong>, and if the <strong>static graph</strong> contains a loop structure, the tape for the <strong>static graph</strong> need to cross into the <strong>dynamic graph</strong> to continue work. When a <strong>dynamic graph</strong> contains a <strong>static graph</strong>, the Wengert list (the tape) of the <strong>dynamic graph</strong> need to not only store the tensors, but also the <strong>static graph</strong> as a whole.</p>
<p>NNC’s <strong>dynamic graph execution</strong> design will attempt to address above problems with reasonable compromises. It borrows some good ideas from 10 years ago when I first started to implement ccv.</p>
<h2>Naming The Variable</h2>
<p>Like in most frameworks, <strong>dynamic graph execution</strong> in NNC operates at variables. <strong>Dynamic graph</strong> executes command on a set of input variables, writes the result to a set of output variables. Variables can be inspected anytime with <code class="highlighter-rouge">ccv_nnc_tensor_from_variable</code>. The underlying tensor may not be allocated when the variable is created. <code class="highlighter-rouge">ccv_nnc_tensor_variable_t</code> is an opaque structure and its inner work shouldn’t be of an interest to users.</p>
<h2>Tracing The Operation</h2>
<p>Frameworks such as <strong>PyTorch</strong> or <strong>TensorFlow Eager</strong> use the tape to record which operations are executed, and record the inputs / outputs along the way. <em>automatic differentiation</em> was implemented (its reverse mode) by walking back on the tape. This is simple to implement, and easier to support higher order gradients (by record another tape while walking back on the existing tape). This also makes optimizations on the <em>automatic differentiation</em> pass difficult because no data dependencies are specified. It is definitely possible to infer the data dependencies from the tape, and then employ optimizations or automatic parallelization. For mature framework such as <strong>TensorFlow</strong>, that kind of work is to reimplement some of the fundamental pieces of the software.</p>
<p>NNC uses its <strong>symbolic graph</strong> (Level-3 APIs) to trace the operation. When a command executed on a <strong>dynamic graph</strong>, we can figure out data dependencies with input variables (each input variable has a unique tensor symbol assigned). Even though the variables in the <strong>dynamic graph</strong> don’t follow the <em>static single assignment</em> (SSA) rule, the underlying tensors and tensor symbols do. Thus, through the normal execution of the <strong>dynamic graph</strong>, we have formed a full <strong>symbolic graph</strong> for later computation.</p>
<p>Upon <em>automatic differentiation</em>, no tape is used (or, the <strong>symbolic graph</strong> serves as an advanced tape). We simply leverage the ahead of time <em>automatic differentiation</em> system implemented in <strong>symbolic graph</strong> to optimize, compile and schedule the actual computation. That means any optimization techniques we implemented on Level-2 or Level-3 APIs will be available to <strong>dynamic graph</strong> as well.</p>
<h2>Optimizations (Not Ready)</h2>
<p>At this point, <strong>dynamic graph</strong> looks suspiciously like just another function dispatching mechanism. Ten years ago, when I started ccv, one of the motivation is to implement a function memorization technique, at that time, it is called <em>cached image processing</em> to workaround issues that in traditional computer vision pipeline, low level feature extraction passes often shared between different components (face detector, motion tracker etc.). In <strong>symbolic graph</strong>, this is trivially implemented as <em>common sub-expression elimination</em> (CSE). CSE cannot be implemented in <strong>dynamic graph</strong> because it cannot look ahead. However, the same memorization technique can be used to avoid duplicate computations.</p>
<p>In <strong>PyTorch</strong>, there is a need to <code class="highlighter-rouge">requires_grad</code> such that the framework knows which variable should be discarded to save memory. If it is not done carefully, the memory usage can grow unbounded. <strong>Dynamic graph</strong> here provides <code class="highlighter-rouge">ccv_nnc_tensor_variable_free</code> where when a tensor variable is freed, we will release its memory when it is safe. This method meant to hook up with object finalization methods in host languages (C++’s destructor, Objective-C’s <code class="highlighter-rouge">dealloc</code>, <code class="highlighter-rouge">deinit</code> in Swift, <code class="highlighter-rouge">finalize</code> in Java, <code class="highlighter-rouge">tp_dealloc</code> in Python).</p>
<p>Because <strong>symbolic graph</strong> formed from <strong>dynamic graph execution</strong> contains the proper data dependencies, memory reduction techniques such as automatic binomial checkpointing can be implemented with a change of cache eviction policy. If we implemented binomial checkpointing in <strong>symbolic graph</strong> as one optimization pass, we can also leverage that upon <em>automatic differentiation</em> in <strong>dynamic graph</strong>. The flexibility of sharing the same underlying infrastructure is very satisfying.</p>
<h2>Interoperability (Not Ready)</h2>
<p>There are some sticky issues with interoperability between <strong>static graph</strong> (the <strong>symbolic graph</strong> we formed by hand) with <strong>dynamic graph</strong>. The way they interoperate is through <code class="highlighter-rouge">CCV_NNC_CUSTOM_FORWARD</code> / <code class="highlighter-rouge">CCV_NNC_CUSTOM_BACKWARD</code> functions. When a <strong>static graph</strong> includes a <strong>dynamic graph</strong>, its tape needs to book-keeping for the <strong>dynamic graph</strong>. When a <strong>dynamic graph</strong> includes a <strong>static graph</strong>, it also needs to create a tape at that point for the execution. All these implies significant changes for the <code class="highlighter-rouge">ccv_nnc_tensor_tape_t</code> implementation to accommodate these new requirements.</p>
<h2>Some Maybes</h2>
<p>One of the major reason (or the reason) to use <strong>dynamic graph</strong> is its unparalleled debuggability. You can inspect tensors as you go in the code. However, this ability can be retained if the execution is separated from the <strong>dynamic graph</strong> forming. Your code can go a long way by forming computations and the underlying execution could be asynchronous. The synchronization happens only when you inspect these tensors to either debug, or practically, determine the control flow. This also offers limited look ahead ability to <strong>dynamic graph</strong> that enables more shared optimizations from Level-3 APIs. Implementing this is complicated. Synchronization point can easily turned into deadlock point, and the inter-play of <strong>static graph</strong> inside a <strong>dynamic graph</strong> inside a <strong>static graph</strong> could be more delicate. In a world where we modify languages to extract <strong>static graph</strong> (Swift for TensorFlow), the reason to have this kind of sophisticated <strong>dynamic graph</strong> implementation may be mooted.</p>Liu LiuFrameworks such as PyTorch or TensorFlow Eager nowadays have dynamic graph support, which is a fancy word to describe when a computation is carried out while constructing the computation graph. If dynamic graph execution is just about executing a command when issuing it, this is not interesting. Dynamic graph execution by these frameworks also supports automatic differentiation. A good dynamic graph execution framework such as PyTorch enables easier debugging, more intuitive coding thus quicker experimentation cycle. That has been said, there are a few drawbacks when you support dynamic graph execution naively. Limited optimization opportunities. With dynamic graph execution, the framework lacks the foresight, makes optimizations such as common sub-expression elimination or data layout optimization hard to implement; Unbounded memory usage. Since a dynamic graph execution engine needs to be able to differentiate arbitrary variables within the framework, a Wengert list (a tape) has to be kept. In many situations, to trim that list requires user attention otherwise the memory usage will continue to grow. To work-around 1., mixing static graph execution with dynamic graph execution is desirable. However, that imposes its own set of problems: when a static graph contains a dynamic graph, and if the static graph contains a loop structure, the tape for the static graph need to cross into the dynamic graph to continue work. When a dynamic graph contains a static graph, the Wengert list (the tape) of the dynamic graph need to not only store the tensors, but also the static graph as a whole. NNC’s dynamic graph execution design will attempt to address above problems with reasonable compromises. It borrows some good ideas from 10 years ago when I first started to implement ccv. Naming The Variable Like in most frameworks, dynamic graph execution in NNC operates at variables. Dynamic graph executes command on a set of input variables, writes the result to a set of output variables. Variables can be inspected anytime with ccv_nnc_tensor_from_variable. The underlying tensor may not be allocated when the variable is created. ccv_nnc_tensor_variable_t is an opaque structure and its inner work shouldn’t be of an interest to users. Tracing The Operation Frameworks such as PyTorch or TensorFlow Eager use the tape to record which operations are executed, and record the inputs / outputs along the way. automatic differentiation was implemented (its reverse mode) by walking back on the tape. This is simple to implement, and easier to support higher order gradients (by record another tape while walking back on the existing tape). This also makes optimizations on the automatic differentiation pass difficult because no data dependencies are specified. It is definitely possible to infer the data dependencies from the tape, and then employ optimizations or automatic parallelization. For mature framework such as TensorFlow, that kind of work is to reimplement some of the fundamental pieces of the software. NNC uses its symbolic graph (Level-3 APIs) to trace the operation. When a command executed on a dynamic graph, we can figure out data dependencies with input variables (each input variable has a unique tensor symbol assigned). Even though the variables in the dynamic graph don’t follow the static single assignment (SSA) rule, the underlying tensors and tensor symbols do. Thus, through the normal execution of the dynamic graph, we have formed a full symbolic graph for later computation. Upon automatic differentiation, no tape is used (or, the symbolic graph serves as an advanced tape). We simply leverage the ahead of time automatic differentiation system implemented in symbolic graph to optimize, compile and schedule the actual computation. That means any optimization techniques we implemented on Level-2 or Level-3 APIs will be available to dynamic graph as well. Optimizations (Not Ready) At this point, dynamic graph looks suspiciously like just another function dispatching mechanism. Ten years ago, when I started ccv, one of the motivation is to implement a function memorization technique, at that time, it is called cached image processing to workaround issues that in traditional computer vision pipeline, low level feature extraction passes often shared between different components (face detector, motion tracker etc.). In symbolic graph, this is trivially implemented as common sub-expression elimination (CSE). CSE cannot be implemented in dynamic graph because it cannot look ahead. However, the same memorization technique can be used to avoid duplicate computations. In PyTorch, there is a need to requires_grad such that the framework knows which variable should be discarded to save memory. If it is not done carefully, the memory usage can grow unbounded. Dynamic graph here provides ccv_nnc_tensor_variable_free where when a tensor variable is freed, we will release its memory when it is safe. This method meant to hook up with object finalization methods in host languages (C++’s destructor, Objective-C’s dealloc, deinit in Swift, finalize in Java, tp_dealloc in Python). Because symbolic graph formed from dynamic graph execution contains the proper data dependencies, memory reduction techniques such as automatic binomial checkpointing can be implemented with a change of cache eviction policy. If we implemented binomial checkpointing in symbolic graph as one optimization pass, we can also leverage that upon automatic differentiation in dynamic graph. The flexibility of sharing the same underlying infrastructure is very satisfying. Interoperability (Not Ready) There are some sticky issues with interoperability between static graph (the symbolic graph we formed by hand) with dynamic graph. The way they interoperate is through CCV_NNC_CUSTOM_FORWARD / CCV_NNC_CUSTOM_BACKWARD functions. When a static graph includes a dynamic graph, its tape needs to book-keeping for the dynamic graph. When a dynamic graph includes a static graph, it also needs to create a tape at that point for the execution. All these implies significant changes for the ccv_nnc_tensor_tape_t implementation to accommodate these new requirements. Some Maybes One of the major reason (or the reason) to use dynamic graph is its unparalleled debuggability. You can inspect tensors as you go in the code. However, this ability can be retained if the execution is separated from the dynamic graph forming. Your code can go a long way by forming computations and the underlying execution could be asynchronous. The synchronization happens only when you inspect these tensors to either debug, or practically, determine the control flow. This also offers limited look ahead ability to dynamic graph that enables more shared optimizations from Level-3 APIs. Implementing this is complicated. Synchronization point can easily turned into deadlock point, and the inter-play of static graph inside a dynamic graph inside a static graph could be more delicate. In a world where we modify languages to extract static graph (Swift for TensorFlow), the reason to have this kind of sophisticated dynamic graph implementation may be mooted.