Cyclone - Hardware Accelerated N-D Arrays for Java
Cyclone - Hardware Accelerated N-D Arrays for Java

Cyclone - Hardware Accelerated N-D Arrays for Java

Status
In progress
Author
Created
Aug 10, 2024 01:13 PM
Tags
Blog Post
Java

Introduction

In the past, Java has often been superseded by both high and low-level programming languages when it comes to both speed and hardware acceleration capabilities. Languages such as C++ and python allow developers to interact with compute platforms such as OpenCL or CUDA. As one will probably have noticed, this has sparked a significant increase in capabilities in the fields of machine learning, computer vision, computer graphics and so many more.
 
While I am glad that the world is moving towards easy integration and use of these platforms for certain programming languages, some languages such as Java have had a hard time adapting. Even though it has proved itself as one of the most influential programming languages over the past decades, Java is often being left behind by more flexible languages. The most prominent reason for this is the fact that Java runs on the JVM (Java Virtual Machine), which is an abstraction layer on top of the hardware that runs the actual application. This allows Java to be a language that is extremely portable, but also implies it is quite difficult to gain direct access to hardware devices such as the GPU and their features. Luckily, in recent years, there has been a move towards supporting heterogeneous computing paradigms.
 
One way in which this is achieved is through the use of TornadoVM, which allows developers to leverage hardware capabilities while using the Java programming language. In the words of the developers themselves:
 
TornadoVM is an open-source plugin to Java Virtual Machines that allows programmers to automatically run Java programs on heterogeneous hardware. TornadoVM can currently execute on multi-core CPUs, dedicated GPUs […], integrated GPUs […], and FPGAs […].
 
While this is great, little to no effort has been taken to write Java APIs that make it easy for developers to quickly start developing high-performance, parallel applications.
 
This is where Cyclone comes into play. It is a library that aims at providing developers an all-in-one framework for storing and manipulating tensor-like data structures in 1-D, 2-D and 3-D. The ultimate goal of Cyclone is to provide methods for working with these data structures that aim to speed up the development process. In the upcoming sections, we will take a look at how to use Cyclone, perform a simple benchmark compared to sequential and multi-threaded execution and reason about some implementation details and future plans for the library.

Cyclone Library

Some Words of Caution

What Cyclone CANNOT Do

The promise of TornadoVM is being able to parallelize certain computation sequences, often called kernels. For data manipulation however, this comes with a major limitation. It is often the case one wants to perform some relatively arbitrary sequence. For example, given that we have a collection of floating point values we want to map to their squared value. In regular Java, one might write:
 
// Use map to square values values.map(value -> value * value);
 
In this case, the function that is being evaluated when mapping the values in the collection is only known at compile time. In this case, it is impossible to offload this operation to a heterogeneous device, since it is not necessarily known at compile time what the kernel will look like at runtime. This means general utilitarian functions such as map, mapIndexed, filter, and others cannot be properly offloaded to run on a heterogeneous device. This is why Cyclone provides many simple operations whose kernel can be known at compile time.

Cyclone API

Cyclone is a library, which means it implements logic to work and interact with data (in this case on heterogeneous devices). Using the library implementation itself would be cumbersome and not provide a great experience towards developers. Therefor, Cyclone offers an extensive API to interact with the library. The goal is to abstract away much of the implementation logic of the library and provide the user with a clean interface. This methodology follows the Facade design pattern.
 
While it is definitely possible to implement buffers (see later) for custom types, this can quickly become cumbersome and confusing, due to the shear amount of operations that the default Cyclone buffer offers. This is also the reason why Cyclone ships with a variety of types such as those implemented by the Complex, Matrix, Vector and Quaternion interfaces.

Heterogeneous Vs. Sequential / Multi-threaded Computing

In simple use cases, Cyclone will not be faster than sequential or multi-threaded implementations. This is because certain types of operations (e.g. simple arithmetic) do not exhibit any performance gains compared to the overhead necessary to transfer the data to an external device, perform the operation and transfer the data back. Cyclone has the ability to do this, but for simple operations this will only yield a significant performance gain when working with large arrays.
 

Cyclone Buffers

With these preliminaries out of the way, let’s finally take a look at what Cyclone can do. Cyclone works with buffers of different dimensions. A buffer can be either have one, two or three dimensions. In this blog post, we’ll stick to using one-dimensional buffers to show the core operations. A buffer can be easily created using the CycloneBufferFactory:
 
// The buffer can be parameterized by both the size (i.e. number of elements) // and the default value, which is optional. int size = 1024; float value = 1.0f; // Construct the buffer FloatBuffer1D buffer = CycloneBufferFactory.constructFloat1D(size, value);
 
Additionally, a buffer can be created by providing an array, list or set of values (for the latter, an index is assigned randomly due to the nature of a set being unordered).
 
// Construct a buffer from an array float[] array = new float[size]; FloatBuffer1D buffer = CycloneBufferFactory.constructFloat1D(array); // Construct a buffer from a list List<Float> list = new ArrayList<>(); FloatBuffer1D buffer = CycloneBufferFactory.constructFloat1D(list) // Construct a buffer from a set Set<Float> set = new HashSet<>(); FloatBuffer1D buffer = CycloneBufferFactory.constructFloat1D(set)
 
As one will see, the buffer allows for many different types of computations to be performed. Internally, the buffer is kept in one single class called the AbstractBuffer, which implements all the operations necessary to facilitate these computations. This class is an implementation of the CycloneBuffer<T> which itself extends from multiple different interfaces such as ArithmeticBuffer<T> and AggregationBuffer<T>. This allows the buffer to be specialized such that only a limited subset of operations can be performed. This makes it easier to know what operations are available for different purposes, but it is not required to specialize a buffer before being able to call a specific operation.
 
// The buffer can be specialized as an ArithmeticBuffer<T>, allowing operations // such as add(), sub(), mul(), div(), ... buffer.<ArithmeticBuffer<Float>>as().add(2.0f); // This is not necessary, as the operations from the ArithmeticBuffer<T> can also // be called directly on the buffer. buffer.add(2.0f);
 
This design decision is made for several reasons:
  • Specialization of a buffer allows users to write more verbose code, as the buffer implementation itself might be quite overwhelming due to the amount of kernel operations Cyclone offers.
  • Specialization is not necessary, implementing the common Fluent Interface design pattern. This allows for the chaining of multiple operations at once, increasing the expressivity of the library.
 
// In most cases, multiple must be invoked separately buffer.add(2.0f); buffer.mul(0.5f); float result = buffer.sum(); // With Cyclone, these operations can all be chained for a cleaner implementation float result = buffer.add(2.0f).mul(0.5f).sum();
 
THIS POST IS STILL A WIP - Check back later for more…