Saturday, October 19, 2013

Java on Parallella


Many of our algorithms are implemented in a way that enables them to execute many calculations simultaneously. It is called parallelization. In my short IT career I have seen many erroneous thinking “If 8 simultaneous threads speeded up my algorithm 6 times than I'll increase number of threads and this will give me even better results! Great!” Unfortunately, Nope …
 
If more than N threads are trying to run in parallel on N core processor this is physically impossible. In this case (most often) operating system “cuts” your algorithm in a few pieces trying to switch between them giving impression that we can create arbitrary number of threads with impunity.
Unfortunately this switching (named context switching) comes with a cost:


Ideal multitasking

multitasking with context switching


All of us can easily understand that context switching involves some time consuming logic but why the hell tasks with context switching are running little bit longer? Anter alia because context switch clears TLB (http://en.wikipedia.org/wiki/Translation_lookaside_buffer) and “dirty” CPU caches. That's why we often should take a look at hardware and “mechanical sympathy”.

So, what we really want is more cores/processors but companies like Intel are selling us in reasonable price maximum 8 cores. These processors are fast but sometimes we want to have little bit slower but two or four times more of them. A few months ago I started looking for cheap multimulti processor computer and I found Parallella (http://www.parallella.org/). 

Parallella supercomputer

Parallella computer costs 99$ and contains:
- 16 cores Epiphany coprocessor (64 cores en route),
- 2 cores ARM-A9 host processor (with FPGA logic),
- 1 GB SDRAM,
- Gigabit Ethernet, 2x uUSB, uHDMI, uSD.
Impressed? I was.


Epiphany architecture

Heart of Parallela is Epiphany coprocessor. It consists of 16/64 RISC Processors connected together by mesh network-on-chip.


The main challenge in programming on Epiphany is to create the correct algorithm to offload host processor. You have multiple CPUs available but currently no operating system and what is more important no memory fences (http://en.wikipedia.org/wiki/Memory_barrier). Memory access architecture is referred as “weak ordering of loads and stores”. The guarantees are:
  • Load operations complete before the returned data is used by a subsequent instruction,
  • Load operations using data previously written use the updated values,
  • Store operations eventually propagate to their ultimate destination.

Why the hell am I writing about this? Because the Java Memory Model (http://www.slideshare.net/michalwarecki/java-memory-model-23207253) requires specified rules when using monitors and volatile keyword. E.g. x86 processors contains locked instruction which allows us to be sure that sequential write and read to the same variable will be executed in this order.
Epiphany doesn't have such instructions but if we want to be Java Language Specification (JLS) compatible we need to implement it somehow. Such JLS requirements will be the hardest to implement.

Epiphany SDK

Epiphany comes with very good Software Development Kit which makes programming easier. It consists of:
  • Epiphany Multicore Development IDE (based on Eclipse),
  • C/C++ Compiler (E-GCC),
  • Assembler (E-AS),
  • Linked (E-LD),
  • Instruction set simulator (E-RUN),
  • Hardware Connection Server (E-SRVER),
  • Loader Utility (E-LOADER),
  • Debugger (E-DBG),
  • Utility libraries.
This is enough to create powerful programs.


Great but can I run Java on it?

Not yet but we can help to make it work. Java has support for many architectures but Epiphany is not enough popular to interest large companies like Oracle or Redhat. I'm a big fun of Java and all hardware novelties so I decided to try implementing it on my own. Currently, I'm working on engineering Epiphany assembly code generator which is based on OpenJDK Graal JiT Compiler. Another challenge is to create efficient code dispatcher and Java API which would make programming easier and safer. It has to be designed for heterogeneous architectures like Parallella. Here comes OpenJDK Sumatra project started by AMD to support GPU/APU offload handling.
Implementation of communication by a shared memory buffers between main Java process and offloaded code will be just a pleasure :-)

OpenJDK Graal

Purpose of Graal project is to exposure functionalities to make dynamic compilers and interpreters in pure Java. Currently, there is possibility to generate assembly code for architectures:
  • AMD64,
  • Sparc,
  • PTX,
  • HSAIL.
I want to make it possible to generate Epiphany assembly code. People often ask me, “why Graal?” so the simplest answers are:
  • because it will be much more quicker to implement such code generator (e. g. why would we need to reinvent the weel?),
  • because it has a very good architectural design,
  • because I can refer to the existing implementations,
  • because it is supported by Oracle and AMD.


Code example for HSAIL:
 
Java code:
public static void testMulThreeArrays(int[] out, int[] ina, int[] inb,
                                      int gid) {
        out[gid] = ina[gid] * inb[gid];
}


Generated HSAIL code:
kernel &run (  
 kernarg_u64 %_arg0, 
 kernarg_u64 %_arg1, 
 kernarg_u64 %_arg2 
 ) { 
 ld_kernarg_u64  $d6, [%_arg0]; 
 ld_kernarg_u64  $d2, [%_arg1]; 
 ld_kernarg_u64  $d1, [%_arg2]; 
 workitemabsid_u32 $s8, 0; 
                                    
@L0: 
 ld_global_s32 $s0, [$d2 + 12]; 
 cmp_ge_b1_u32 $c0, $s8, $s0; 
 cbr $c0, @L1; 
@L2: 
 ld_global_s32 $s0, [$d1 + 12]; 
 cmp_ge_b1_u32 $c0, $s8, $s0; 
 cbr $c0, @L3; 
@L4: 
 ld_global_s32 $s0, [$d6 + 12]; 
 cmp_ge_b1_u32 $c0, $s8, $s0; 
 cbr $c0, @L5; 
@L6: 
 cvt_s64_s32 $d0, $s8; 
 mul_s64 $d0, $d0, 4; 
 add_u64 $d2, $d2, $d0; 
 ld_global_s32 $s0, [$d2 + 16]; 
 cvt_s64_s32 $d3, $s8; 
 mul_s64 $d3, $d3, 4; 
 add_u64 $d1, $d1, $d3; 
 ld_global_s32 $s3, [$d1 + 16]; 
 mul_s32 $s3, $s3, $s0; 
 cvt_s64_s32 $d8, $s8; 
 mul_s64 $d8, $d8, 4; 
 add_u64 $d6, $d6, $d8; 
 st_global_s32 $s3, [$d6 + 16]; 
 ret; 
@L1: 
 ret; 
@L3: 
 ret; 
@L5: 
 ret; 
};


Cool! I'm not familiar with HSAIL but it is very readable. How to generate such code:
private void test(String snippet) {
        StructuredGraph graph = parse(snippet);
        HSAILCompilationResult compResult = HSAILCompilationResult
            .getHSAILCompilationResult(graph);
        compResult.dumpCompilationResult();
}


where snippet is a method name. Nice and simple. I'll skip parse implementation because it is not necessary here.
Main Graal compiler internal components:
  • architecture: in this component we define how hardware architecture (registers etc.) looks like. For example this is my draft implementation of Epiphany:
/**
 * Represents the Epiphany architecture.
 */
public class Epiphany extends Architecture {

    /**
     * Epiphany architecture does not provide any implicit memory barrier.
     */
    public static final int NO_MEM_BAR = 0x0000;

    /**
     * The general purpose registers can all be used in an integral context as well as in a floating-point context.
     * There is no distinction per FPU and CPU so CPU category is used.
     */
    public static final Register.RegisterCategory CPU = new Register.RegisterCategory("CPU");
    public static final Register.RegisterCategory FPU = new Register.RegisterCategory("FPU");

    //Argument / result / scratch registers
    public static final Register a1  = new Register(1,  1,  "r0",  CPU);
    public static final Register a2  = new Register(2,  2,  "r1",  CPU);
    public static final Register a3  = new Register(3,  3,  "r2",  CPU);
    public static final Register a4  = new Register(4,  4,  "r3",  CPU);

    //Registers variable
    public static final Register v1  = new Register(5,  5,  "r4",  CPU);
    public static final Register v2  = new Register(6,  6,  "r5",  CPU);
    public static final Register v3  = new Register(7,  7,  "r6",  CPU);
    public static final Register v4  = new Register(8,  8,  "r7",  CPU);
    public static final Register v5  = new Register(9,  9,  "r8",  CPU);
    public static final Register v6  = new Register(10, 10,  "r9",  CPU);
    public static final Register v7  = new Register(11, 11,  "r10",  CPU);

    //Variable Register / Frame Pointer
    public static final Register v8  = new Register(12, 12,  "r11",  CPU);

    //Intra - procedure call scratch register
    public static final Register pcs  = new Register(13, 13,  "r12",  CPU);

    //Stack Pointer
    public static final Register sp  = new Register(14, 14,  "r13",  CPU);

    //Link Register
    public static final Register lr  = new Register(15, 15,  "r14",  CPU);

    //General use
    public static final Register r15  = new Register(16, 16,  "r15",  CPU);
    public static final Register r16  = new Register(17, 17,  "r16",  CPU);
    public static final Register r17  = new Register(18, 18,  "r17",  CPU);
    public static final Register r18  = new Register(19, 19,  "r18",  CPU);
    public static final Register r19  = new Register(20, 20,  "r19",  CPU);
    public static final Register r20  = new Register(21, 21,  "r20",  CPU);
    public static final Register r21  = new Register(22, 22,  "r21",  CPU);
    public static final Register r22  = new Register(23, 23,  "r22",  CPU);
    public static final Register r23  = new Register(24, 24,  "r23",  CPU);
    public static final Register r24  = new Register(25, 25,  "r24",  CPU);
    public static final Register r25  = new Register(26, 26,  "r25",  CPU);
    public static final Register r26  = new Register(27, 27,  "r26",  CPU);
    public static final Register r27  = new Register(28, 28,  "r27",  CPU);

    //Reserved for constants
    public static final Register r28  = new Register(29, 29,  "r28",  CPU);
    public static final Register r29  = new Register(30, 30,  "r29",  CPU);
    public static final Register r30  = new Register(31, 31,  "r30",  CPU);
    public static final Register r31  = new Register(32, 32,  "r31",  CPU);

    //General use
    public static final Register r32  = new Register(33, 33,  "r32",  CPU);
    public static final Register r33  = new Register(34, 34,  "r33",  CPU);
    public static final Register r34  = new Register(35, 35,  "r34",  CPU);
    public static final Register r35  = new Register(36, 36,  "r35",  CPU);
    public static final Register r36  = new Register(37, 37,  "r36",  CPU);
    public static final Register r37  = new Register(38, 38,  "r37",  CPU);
    public static final Register r38  = new Register(39, 39,  "r38",  CPU);
    public static final Register r39  = new Register(40, 40,  "r39",  CPU);
    public static final Register r40  = new Register(41, 41,  "r40",  CPU);
    public static final Register r41  = new Register(42, 42,  "r41",  CPU);
    public static final Register r42  = new Register(43, 43,  "r42",  CPU);
    public static final Register r43  = new Register(44, 44,  "r43",  CPU);
    public static final Register r44  = new Register(45, 45,  "r44",  CPU);
    public static final Register r45  = new Register(46, 46,  "r45",  CPU);
    public static final Register r46  = new Register(47, 47,  "r46",  CPU);
    public static final Register r47  = new Register(48, 48,  "r47",  CPU);
    public static final Register r48  = new Register(49, 49,  "r48",  CPU);
    public static final Register r49  = new Register(50, 50,  "r49",  CPU);
    public static final Register r50  = new Register(51, 51,  "r50",  CPU);
    public static final Register r51  = new Register(52, 52,  "r51",  CPU);
    public static final Register r52  = new Register(53, 53,  "r52",  CPU);
    public static final Register r53  = new Register(54, 54,  "r53",  CPU);
    public static final Register r54  = new Register(55, 55,  "r54",  CPU);
    public static final Register r55  = new Register(56, 56,  "r55",  CPU);
    public static final Register r56  = new Register(57, 57,  "r56",  CPU);
    public static final Register r57  = new Register(58, 58,  "r57",  CPU);
    public static final Register r58  = new Register(59, 59,  "r58",  CPU);
    public static final Register r59  = new Register(60, 60,  "r59",  CPU);
    public static final Register r60  = new Register(61, 61,  "r60",  CPU);
    public static final Register r61  = new Register(62, 62,  "r61",  CPU);
    public static final Register r62  = new Register(63, 63,  "r62",  CPU);
    public static final Register r63  = new Register(64, 64,  "r63",  CPU);

    public static final Register[] gprRegisters = {
        r15, r16, r17, r18, r19, r20, r21, r22, r23, r24, r25, r26, r27,
        r32, r33, r34, r35, r36, r37, r38, r39, r40, r41, r42, r43, r44, r45, r46, r47, r48,
        r49, r50, r51, r52, r53, r54, r55, r56, r57, r58, r59, r60, r61, r62, r63
    };

    public static final Register[] allRegisters = {
        a1, a2, a3, a4, v1, v2, v3, v4, v5, v6, v7, v8, pcs, sp, lr,
        r15, r16, r17, r18, r19, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, r31,
        r32, r33, r34, r35, r36, r37, r38, r39, r40, r41, r42, r43, r44, r45, r46, r47, r48,
        r49, r50, r51, r52, r53, r54, r55, r56, r57, r58, r59, r60, r61, r62, r63
    };

    public Epiphany() {
        /**
         * About machine code call displacement offset:
         * Epiphany has relative as well as absolute branching. Each offers a 16-bit and 32-bit opcodes.
         * For relative branch, the offset is either 8-bit or 24-bit (signed) accordingly.
         * It can also branch based on an absolute register value, where the offset can be 32-bit.
         */
        super("Epiphany", 4, ByteOrder.LITTLE_ENDIAN, false, allRegisters, NO_MEM_BAR, 1, r63.encoding + 1, 4);
    }

    @Override
    public boolean canStoreValue(Register.RegisterCategory category, PlatformKind platformKind) {
        if (!(platformKind instanceof Kind)) {
            return false;
        }
        Kind kind = (Kind) platformKind;
        if (category == CPU) {
            switch (kind) {
                case Boolean:
                case Byte:
                case Char:
                case Short:
                case Int:
                case Long:
                case Object:
                    return true;
            }
        } else if (category == FPU) {
            switch (kind) {
                case Float:
                case Double:
                    return true;
            }
        }
        return false;
    }

    @Override
    public PlatformKind getLargestStorableKind(Register.RegisterCategory category) {
        if (category == CPU) {
            return Kind.Long;
        } else if (category == FPU) {
            return Kind.Double;
        } else {
            return Kind.Illegal;
        }
    }
}
  • compiler: as the name implies, this components is used for code compilation,
  • ASM: generates architecture-specific assembly code,
  • LIR: aims to manage intermediate representation graph.

OpenJDK Sumatra

This project is even more interesting. The primary purpose of it is to use graphics processing unit (GPU) and accelerated processing unit (APU – i.e. Epiphany) in JVM. It is so awesome that it creeps me out :-).
Graphics cards have computations power of thousands of GFlops, so not using this power is just a waste of money. But heterogeneous architecture is not so easy to implement. Guys from AMD and Oracle wants to make it easier to offload logic written in JVM based languages to GPUs and APUs. Currently, it is possible to redirect code but in the future such offload will be possible dynamically (i.e. using C2).
Example of redirecting Java code:
IntStread.range(0, players.length).parallel().forEach(n → {
    Player p = players[n];
    if(p.getAb() > 0) {
        p.setBa((float)p.getHits() / (float) p.getAb());
     } else {
        p.setBa((float)0.0);
     }
}); 
I this example, we can see that each calculation of player batting average is executed in parallel on different processing unit where n is the unique identifier of this unit.

Summary

If you like what I do and would like to participate, please send me a message. It will be motivating!