Created with 42pag.es | edit →

extremeparallelass1

(10%) COMP453/553 Java Performance Lab Exercise

What will you learn?

Fast Racing CarThe goal of this lab exercise is to learn how to measure the performance of a (sequential) Java program, determine where the bottlenecks are (garbage collection overhead, memory leaks, excessive object allocation, excessive boxing, poor data structures, etc) and how to fix these bottlenecks to dramatically improve performance.

The goal of parallel programming is to get high performance, but there is no point adding parallelism if your starting program (a sequential program) is inefficient. So knowing how to write efficient sequential Java program, and fix their performance bottlenecks, is a necessary skill before you can write good parallel programs. Besides, making programs run fast is fun! Like solving a tricky puzzle or building a superfast racing car.

Dick's Problem

Dick works for a photo-processing company, and their development team has recently developed a new program that can process hundreds of photos with one command, doing a series of transformations such as rotation, resizing, and conversion to sepia colours, on a large collection of photos. At least, that is what they want. The program is designed to be run from the command line, with a sequence of photo edit commands, followed by a sequence of photo file names. The whole sequence of commands is applied to every photo. For example, if it is run with the following command line arguments:
   grayscale undo sepia half half save Eiffel.jpg shoes.jpg Fred.jpg Mary.jpg

it will process each of the four photos in turn, first converting them from colour photos into grayscale photos, then undoing that command and returning to the colour version (presumably someone wanted to undo a mistake), then converting it into sepia colours, then halving the size of the photo twice, then saving it under its orginal name with "_edited" appended to the end.

Unfortunately, the delivered program processes each photo quite slowly, and seems to get slower and slower as Dick runs it on more photos. The following graph shows four photos being processed (with 8 repeats of each photo, so that we can measure the average time). Note how the first photo is processed in about 2 seconds (we ignore the first few measurements where the Java HotSpot compiler is still optimising the code), the second photo averages 3 seconds, the third photo averages 5 seconds and the fourth photo averages more than 8 seconds.

Graph of Current Photo Tool Performance

These are all too slow - Dick needs to be able to process several photos per second. And it is totally unacceptable that it gets slower as more photos are processed. He needs to be able to process dozens or hundreds of photos at a consistent (fast!) speed.

Your Task

You are being employed as a consultant to fix the performance problems of this program. Your first task is to fix the slowdown problem, so that many photos can be processed without the speed reducing. After that, your second task is to measure the bottlenecks of the program and improve its performance so that it can process several photos per second. The more photos per second it can process, the higher your bonus payment will be, and the higher your mark will be for this lab exercise.

Start by downloading the program from the COMP553 Moodle page, and unzipping it. Then

  1. Rename the phototool folder to phototool_ yourid, where yourid is your Waikato University login name.
  2. Import the 'phototool' project into your Eclipse workspace. (File / Import / General / Existing Projects into Workspace / ...).
  3. In Eclipse, rename that 'phototool' project to 'phototool_yourid' so that when you submit your project, I can load all your projects into my Eclipse without any name clashes. (Right-click on the project in Eclipse and do 'Refactor / Rename...').
When you submit your lab exercise via Moodle, you will need to submit this whole folder.

Hints for Task 1: Fixing the Slowdown Problem

  1. The first thing to do with performance problems is always to start by accurately measuring the current (bad) performance, so that we know what to improve. This is non-trivial for Java programs, because the HotSpot compiler watches your program as it runs, and whenever a loop or method is called more than about 10,000 times, it briefly pauses the program, recompiles that method with aggressive optimization (inlining many other methods that it calls) and then continues running the program. So the program gets faster and faster for the first few seconds or whenever you start executing a new part of the program! There are also two different versions of the Java JDK compiler: the -client compiler is aimed at fast startup and small footprint (memory usage) so does not optimise the code very aggressively, while the -server compiler does more aggressive optimisation. Since this phototool application will be processing lots of photos, and performance is important, make sure you use the -server compiler (Add -server to the VM arguments).

    The most important technique for measuring the time taken by a chunk of Java code, is repeat that code at least 8 times (preferably 20 times or more), discard the first few measurements (while the HotSpot compiler is optimising the code), and then take the average of the remaining measurements. You should see the initial times being slower and the remaining times converging to a fairly consistent average value, like the first 8 measurements in the above graph. Choose the block of code you want to measure, and put the following Java loop around it so that you measure its speed about 10 times (This timing loop is already in PhotoTool.java. DON'T FORGET to remove it before you submit your code):

      for (int iter = 0; iter < 10; iter++) {
        final long time0 = System.nanoTime();
    
        // Here goes the code you want to measure the speed of...
    
        long time1 = System.nanoTime();
        System.out.println("Processed " + cmds.size() + " cmds in " + (time1 - time0) / 1E9 + " secs.");
      }
      

    Also, make sure that the block of code you are measuring takes a reasonable amount of time to execute - preferably at least one second - so that the execution time of the code is much larger than the cost of measuring the time.

    For this phototool program, I suggest that you put this timing loop just inside the main photo loop ( for (PhotoTool tool : photos)... ) inside the main method, so that you are measuring the time to process all the commands on each photo.

    When adding a timing loop like this, it is important that each iteration of the loop starts from the same state. In this case, the commands we are executing will have edited the photo and pushed edited versions of the current photo onto the photo tool stack, so we should undo this at the beginning (or end) of every iteration of the timing loop. So add the following code inside the beginning of your timing loop, to ensure that the only photo on the stack is the original input photo.

        while (tool.getStackSize() > 1) {
           tool.undo();
        }
      

    Then run the program on FOUR or more photos, with a command line like the following. (First export your program from Eclipse as a Runnable JAR File called phototool.jar . Or you can run it from within Eclipse, where the -server -Xmx2G arguments go into the VM arguments part of the Arguments tab of the Run Configurations dialog, and the sepia undo... arguments go in the Program Arguments area.):

      java -server -Xmx2G -jar phototool.jar grayscale undo sepia half half save Eiffel.jpg shoes.jpg Eiffel.jpg shoes.jpg
      
    Save the output messages into a file, so that you can refer to it later. You might like to graph the performance, by loading that file into a spreadsheet like Excel or LibreOffice Calc, and graphing the time column.
  2. Okay, so why is it getting slower for each photo? My first guess would be that it is using more and more memory, so garbage collection is taking longer and longer as the program runs. To see if this is happening, add the -verbose:gc flag to the Java VM arguments (eg. before the -Xmx2G) and rerun the program. Inspect the output lines from the Java garbage collector. I get output like this:
        ...
          processing Eiffel.jpg...                               // first photo
          [GC 873192K->833576K(1158720K), 0.2420645 secs]
          [Full GC 833576K->813211K(1527744K), 3.4733136 secs]
          [GC 968987K->928125K(1550464K), 0.2922131 secs]
          [GC 1112957K->1052310K(1556800K), 0.2408538 secs]
          Processed 5 cmds in 5.591178036 secs.
          [GC 1237142K->1167773K(1580224K), 0.2180290 secs]
          [GC 1376093K->1157477K(1580288K), 0.2168134 secs]
          [Full GC 1157477K->782068K(1676480K), 1.0994751 secs]
          Processed 5 cmds in 2.854708037 secs.
        ...
          Processed 5 cmds in 9.204385705 secs.                 // fourth photo
          [Full GC 1631229K->1382876K(1864192K), 3.3952885 secs]
          [Full GC 1615964K->1370786K(1864192K), 3.1776446 secs]
          Processed 5 cmds in 7.886384853 secs.
      
    Note the two different kinds of output lines: the GC lines come from the young generation garbage collector (which recycles young / short-lived objects), while the Full GC lines come from the more expensive full garbage collection that garbage collects the more long-lived objects as well. You can add the -XX:+PrintGCDetails if you want even more detail about each kind of garbage collection.

    Read this Java Tutorial on Verbose GC Output to understand these output messages better. Note that the last line above means that the Full garbage collector started with 1.615Gb used in the heap, and after the garbage collector had reclaimed all the unreachable objects, 1.370Gb of the heap area was still in use, out of a total area of 1.864Gb. So even though the full garbage collector is working hard and taking more than 3 seconds every time it runs, it is not able to reclaim much space. And as we use more memory and get closer to the 1.864 maximum size for this heap area, the garbage collectors will have to run more and more often, and our program will go slower and slower.

Hints for Task 2: Improving Performance

Okay, we now have a more usable program, which can process hundreds of photos without tiring. But it is still too slow! Here is your chance to impress your customers with how much faster you can make it go. And get more dollars (marks)!

Here are lots of ideas for possible performance improvements. Make sure you measure the performance of your program before and after trying each one, to see if it does really improve performance.

  • Ricardo Dominguez throws an uppercut on Rafael Ortiz, http://en.wikipedia.org/wiki/Boxing Stop boxing! Look for any places where 'boxed' objects are being used instead of primitive Java values. Like Double instead of double, or Integer instead of int. Boxed objects are created on the heap, so take up at least 24 bytes each and have to be allocated and then garbage collected, whereas primitive values are small and handled efficiently directly within the CPU. Look through the PhotoTool source code for places where boxing may be happening, and try to use primitives instead.

    Record your performance improvement, using the same command line as above.

  • End of the Spear Movie, http://www.digitalmission.us/SpearStory1.html Go primitive in your arrays! Replace arrays of objects by arrays of primitives, where possible. The main candidate here is that the Picture class stores each photo using arrays of Color objects, but each Color object is really just an 8-bit value (0..255) for Red, Green and Blue (some .png photos may have an additional 0..255 alpha channel to indicate transparency), so these three values could be packed into a single integer. In fact, the Color class and the BufferedImage class both have getRGB methods to return a pixel as a single integer. Of course, you will have to unpack the integer into its separate 8-bit fields when you want to get at the red, green or blue components. You can use the Java bit-shifting operators to do this, as follows (see the Wikipedia entry on Bit_shift if you've forgotten how bitwise operations work):
      int red = (rgb >> 16) & 0xFF;
      int green = (rgb >> 8) & 0xFF;
      int blue = rgb & 0xFF;
      
    Hint: the simplest and fastest way to store all the pixels as ints is to use a single int[] pixels variable - use the BufferedImage getRGB(...) method (the one with seven parameters) to get all the pixels. Then you can access pixel (x,y) as pixels[y * width + x] .

    Record your performance improvement, using the same command line as above.

  • Men's 20-km walk during the 2005 World Championships in Athletics in Helsinki, Finland.
            From http://en.wikipedia.org/wiki/Racewalking Watch your strides! Surprisingly, although the following two chunks of code both process every pixel in a photo, one can be dramatically faster than the other, depending upon whether your image, or 2D array, is using row-major order or column-major order. Generally, the faster one will be the one where the inner loop is accessing adjacent pixels in the array, because then you will be making good use of the CPU caches, which typically read in chunks of memory on each access (eg. 128 bytes per 'cache line'). So when you access one pixel, the next 31 adjacent pixels also get loaded into cache (if you use ints for pixels), so will be sitting there ready for the next inner loop iteration.
      // Column major order
      for (int x = 0; x < width; x++) {
        for (int y = 0; y < height; y++) {
           // code to process pixel (x,y)
        }
      }
    
      // Row major order
      for (int y = 0; y < height; y++) {
        for (int x = 0; x < width; x++) {
           // code to process pixel (x,y)
        }
      }
      

    Look for all the double loops in the program, try both options, and measure the performance to see which one is faster.

    Record your performance improvement, using the same command line as above.

  • Bocksbeutel shaped Wine Bottle.  From http://en.wikipedia.org/wiki/Wine_bottles Find the bottlenecks! Run with -Xprof (as a Java VM argument). This runs your Java program with a very lightweight profiler, which periodically interrupts the program and records which method is currently executing. At the end of the program, it prints out the names of the most frequently executed methods, and the percentage of time spent within each of those methods. Note that some infrequently-executed methods may have been executed by interpreting the Java bytecodes rather than compiling them to machine code, so you will get a separate report on the most heavily used interpreted bytecode methods, as well as a report on the most heavily used compiled methods. The latter is the most important, since the majority of the time is usually spent in compiled methods. Note that the HotSpot compiler will have inlined some small heavily-used methods, so their time will be included in the method that calls them. By the way, if you are interested in seeing which of your methods the HotSpot compiler is inlining, you can see lots of detail about this by adding the following JVM command line arguments:
      -XX:+PrintCompilation  -XX:+UnlockDiagnosticVMOptions  -XX:+PrintInlining
      

    Inspect the profiler output, and see if you can rewrite the most expensive method to be more efficient, but keep the existing functionalities i.e. grayscale, sepia, undo, half, show, and save. Record your performance improvement, using the same command line as above.

Useful References

Submission Checklist

  • Make sure that your program runs correctly (i.e. no run-time exceptions). I will be using command lines similar to the following to run your program on my machine to measure how many photos/second your program can process and giving some marks for the speedup you have achieved, but please do not modify the mainmethod to skip or combine commands for a better speedup as this is not the goal of this assignment.

    java -Xmx4G nz.ac.waikato.phototool.PhotoTool grayscale undo sepia half half Eiffel.jpg Eiffel2.jpg Eiffel3.jpg Eiffel4.jpg Shoes.jpg Shoes2.jpg Shoes3.jpg Shoes4.jpg

  • Make sure that your program still prints out the progress messages in the same format as the original program. For example:
        processing Eiffel.jpg...
        Processed 5 cmds in 7.340362628 secs.
        processing shoes.jpg...
        Processed 5 cmds in 1.84465942 secs.
        ...
        
  • Make sure your timing loop is deleted so that each photo is processed just once
  • Fill in the phototool/REPORT.txt file
  • Zip up your whole phototool project (or you can use tar czvf phototool_yourid.tar.gz phototool_yourid/ ) and submit it via Moodle.
  • Make sure you press the final SUBMIT button so that your assignment is sent for marking.

The submission deadline is Friday of Week 2 (11:55pm Friday 10 March 2017)

Marking Schemes

Total: 50 marks
1. Documentation 10 marks
2. Questions in report.txt 20 marks
3. Speedup 20 marks

[your text here]