the YoGA Philosophy

YoGA aims at providing the user with the ability to work on the GPU from within Yorick. It allows to easily build and debug high-level applications that run on the GPU thanks to Yorick's interpreted environment.

Because the GPU is a "device" in a "host" GPU applications performance tend to be limited by memory operations between the host and the GPU. Depending on the memory bandwidth of the GPU used, data transfers (host->GPU & GPU->host) as well as memory allocation can kill your acceleration factor. Hence GPU applications do not necessarily fit all needs and codes have to be designed carefully to take into account these limitations.

Using GPUs is interesting if you want to do large-scale computations, using a constant memory space with very limited memory transfers between the host and the device. In this scenario, you can allocate memory space once, then transfer- intensive compute -transfer back and then free memory. This can be very effective depending on the amount of data to transfer both ways.

C++ implementation

To write YoGA we follow a very simple philosophy to stay as close as possible to this rule. Every feature in YoGA relies on a C++ template class : yoga_obj. This object includes an init method that allocates the memory space needed for the computation. It also includes both memory transfer methods (device2host and host2device) as well as a device2device to transfer data internally on the GPU from one yoga_obj to another. Finally the destructor frees the memory space allocated at the init.

This class also includes a bunch of computing method from cublas vector/matrix operations to cudpp scanning. This provides the user the ability to manipulate the data in the yoga_obj in a variety of way. It also includes optional fields that are initialized and used when required (fft plans, sort / scan plans, etc ...).

This scheme allows to entirely separate memory operations from the rest of the computations. It is expandable as users can easily implement new methods in this framework with very little effort. Only the methods need to be implemented as the object structure already provides memory space on the device to manipulate the data.

Yorick implementation

The Yorick implementation uses an opaque object that points to this C++ class : Yoga Object. It is built using the standard API for interfacing yorick packages to the interpreter. This way persistent objects on the GPU memory can be created and manipulated from Yorick. To this object in Yorick are also associated wrappers that allow to mimic basic operations on Yorick variables (alloc / create, destroy / free, print, eval). Hence a Yoga Object can be manipulated in the same way as a standard Yorick variable. Allocation is done once, and destroy is handled either by the user when needed or by Yorick when terminating (minimal chances for a leak).

Additionally, device2host and host2device routines are provided allowing the transfer between a standard Yorick variable and a Yoga Object.

C-wrappers aimed at being launched from within Yorick have also been added. They wrap calls to these yoga_obj methods using the content of the stack as arguments. These wrappers can be called as functions, in which case they will create new Yoga Objects to store the result or a subroutines in which case they will use pre-existing objects. They provide various mathematical functionalities. After object creation, using these wrappers, the user can build a fast sequence with a no memory space allocation , perform multiple complex operations on Yoga Objects, only on the GPU and then transfer back the result (for display for instance) and eventually (and optionaly) desallocate. See for instance the Practice YoGA page for some details and a practical example.

Updated by Damien Gratadour over 10 years ago · 1 revisions