TSEA81 - Datorteknik och realtidssystem

Notera att dessa sidor enbart finns i engelsk version

TSEA81 - Computer Engineering and Real-time Systems

Assignment 5 : Performance analysis

Introduction

In this assignment we will analyze the performance of the applications that you have written in assignment three and four. You should also improve the performance of these applications by modifying them to for example handle synchronization primitives more efficiently.

Requirements to pass

The requirements to pass this assignment are:

A modified, unoptimized version of the software (for both assignment 3 and assignment 4) where traveltimes are measured.
A modified, optimized (Performance optimizations for assignment 3 and Performance optimizations for assignment 4) version of the software (for both assignment 3 and assignment 4) where traveltimes are measured.
A task of your choice (Selection of tasks).
The source code shall be well structured and commented, and verbally described.
The performance analysis results shall be presented graphically, and verbally explained.

You may choose to present your results from this lab in a written report, if you don't find time to present your results during lab hours. See below of what to include in your lab report. Please note that due to the christmas and new years eve holidays it may take quite long to get feedback and approval of your report.

Presentation of lab results

The simplest way to graphically present data, is to print runtimes in the terminal window (using printf statement in the program) and then copy and paste those data into a spreadsheet of Libreoffice, and then create a graph (Insert/Object/Chart : XY (scatter)).

There need to be results from the unoptimized version and the optimized version, preferably in the same graph for easy comparison, for assignment 3 and 4 respectively. That is, one graph of each assignment. Also there need to be results from the task of your choice.

You should also show your modified source code and explain what you have done, and preferably be able to explain the results.

Modifications

Your first task is to analyze the performance of slightly modified versions of the applications written in assignment three and four. The reason these applications needs to be modified is that they are not very interesting targets for performance analysis due to the fact that most of the running time is most likely spent in calls to usleep() or pthread_cond_wait(). In order to create an interesting scenario you should modify the source code of both of the earlier applications in the following ways:

Modify the person thread/process so that the thread/process measures how long time it takes to travel (from the point of waiting on the floor until having arrived at the destination floor) for a configurable number of iterations, say 10000 iterations per person.
Remove or comment out all calls to usleep()
Remove or comment out all calls to draw_lift
Remove or comment out all unnecessary print statements within the "hot loop"
Automatically create MAX_N_PERSONS, preferably at startup, but possibly when a specific command is sent via the java gui, so that time measurement can be made for a varying number of persons, say 10, 20, 30 ... 70, 80.

Performance analysis

Once you have modified the applications in the manner described above, you can analyze the performance of the application under different conditions. At this point we are particularly interested in how well your solutions scales when the number of allowed persons in the system (MAX_N_PERSONS) changes.

Hint 1: You can assume that MAX_N_PERSONS will always be less than 300, although you will most likely be able to figure out whether your running time scales linearly or exponentially with the number of persons without having to try this extreme case. (That is, if you are approaching run times of over five minutes you should rethink your approach to this problem). Also, limitations in the message passing system might be reached when there are too many persons (somewhere between 70-80 persons) in the system, and you will get an error like "cannot allocate memory".

Hint 2: It has been found that a good value for the number of iterations to use throughout this experiment is 10000 per person as this avoids excessive running time while still taking long enough to allow for repeatable results. However, don't be afraid to change this if your running time is growing either exceedingly long or short in some of your test cases.

For the unoptimized and optimized solutions of assignment 3 and assignment 4, make several runs of the program, with 10 persons, 20 person, 30 persons and so on up to maybe 70-80 persons in the system. At each run, measure the time it takes for each individual to make 10000 journeys, and when all persons in the system have made their journeys add up the runtimes and divide with the number of persons in the system to get an average runtime for the current test case. The results are best presented graphically in a plot with the number of persons in the system on the x axis and the runtimes on the y axis.

Performance optimizations for assignment 3

The application you wrote in assignment 3 has a conceptually very simple synchronization model. A single mutex protects all shared data and a single condition variable is used to signal that the state of the shared data has changed. However, this leads to fairly obvious inefficiencies as all person threads will wake up due to such a broadcast, even though only some of those threads are waiting on the floor where the lift is currently located.

To improve this case it is possible to introduce more condition variables. Our suggestion is that you:

Add one condition variable for every floor (typically an array of condition variables). This condition variable will signal that the lift has arrived at a certain floor, thus only signaling to persons waiting at that floor.
Another array of condition variables can be used to signal to passengers on the lift itself that the lift has moved to a new floor, thus only signaling to passengers who shall exit at the current floor. It might seem possible to use the previous array of condition variables for this (and it is), but it is more efficient to let passengers exit the lift before persons enter the lift in a scenario when the lift is full, thus this second array of condition variables is needed.
Finally, a condition variable associated with the lift itself is needed, to signal that a person has entered or exited the lift.

You might think of other optimizations with condition variables, in cooperation with or instead of the points above. In that case have a discussion with the lab assistant before implementing, so that we can assess the feasibility of your approach.

Performance optimizations for assignment 4

The application you wrote in assignment 4 has overhead associated with the fact that each message needs to be copied from one process to another. Since these processes do not share the same memory space this context switch is conceptually a more expensive operation than the context switches between threads in assignment 3.

A possible improvement in this situation is to assume that persons can plan ahead and schedule more than one trip on the elevator at the same time.

Modify the program in such a way that a message from a person process to the lift process contains information about more than one trip. However, you need to ensure that the size of the message is less than 1024 bytes, as this is the maximum message size as set by messages.c. This will limit the number of journeys per message to maybe 100 journeys (slightly depending your implementation), in any case you will not be able to send all 10000 journeys by one message, but possibly 100 messages with 100 journeys each. Incidentally, the messages sent between the lift process and the uidraw process also needs to be less than 1024 bytes if you want to draw the lift at some points for debugging purposes.

Having removed all sleep functions, the liftmove_process will now be "spamming" the lift_process with messages about moving the lift, as fast as these messages can be produced. This is of course unnecessary. An extra optimization step (voluntary) might be to introduce a return message from the lift_process to the liftmove_process to acknowledge that the lift has been moved, and to let the liftmove_process wait for that return message before generating another message about moving the lift.

Selection of tasks

Select at least one of the following tasks. You may of course select several if you want.

Modify one (a lab group of one person) or both (a lab group of two persons) implementations (assignment 3 and 4) so that one person is classified as a VIP (i.e., the implementation should try to ensure that a VIP passenger waiting on a particular floor will always board the elevator if there is space in the elevator). (See hints below for how a thread can change its own priority.) How does this influence the overall traveltime of all persons? What happens when you add more and more VIPs? Is having all as VIPs the same as having no VIPs?
Various sanity checks of the lift implementation when running without a GUI. For example, Is the average travel time of each person similar? What is the standard deviation of the traveling time? What is the maximum and minimum traveling time? You might also want to include an analysis of the slow down that may or may not have been caused by the addition of code necessary to analyze the above characteristics.
Various sanity checks of pthreads and message queues. E.g., how long does it take to send a message from one process to another? How long does it take to send a broadcast on a condition variable to another thread? How about 40 threads? What about the performance of other synchronization primitives? (mutexes and semaphores?) Note that this data can be useful as a sanity check of your runtime if you are able to estimate roughly how many broadcasts or messages that will happen when you run your application.
How do your algorithms scale on machines with differing amounts of processors? (See below)
What happens with the performance if you change the algorithm of the pthreads based system so that the lift thread is responsible for more than it is in the original version (e.g., moving passengers from the current floor onto the lift itself)? (This will make the algorithm more similar to the algorithm used in the message passing qase.) Similarly, what will happen if you modify the algorithm in the message passing case so that it is more similar to the algorithm in the pthreads based application? (E.g. the person processes are responsible for sending a message to the lift process whenever they want to enter and exit the lift.)
Investigate assignment 4, use the same number of persons in the system (say 50 persons) but vary the number of journeys sent per message between 10 journeys, 20 journeys, 30 journeys and so on up to maybe 100 journeys per message. How does this solution scale? Are there an optimal number of journeys for a message?

It is quite likely that you will encounter some counter-intuitive results during this assignment. You might want to investigate the cause of these...

Helpful functionality

This section contain information about functionality that may be helpful to you when doing assignment 5

Overall considerations

There are many things that affect the performance when running an application. The available amount of memory, other processes consuming CPU cycles, the number of available cores, the clocking speed of the CPU etc.

To narrow down the sources that might disturb your measurements you can check and try a few things.

Check if something is consuming processing power on your computer. Use the command "top" to se the top applications using your CPU. If the %CPU field has a high percentage number then that process is consuming processing power. If that is one of your own processes you kan kill it by pressing the 'k' key and then enter the process PID. Exit the "top" command by pressing the 'q' key.
Check if someone else is logged in remotely on your computer. Use the "users" command to get a list of users on your computer. You can't kill other users (no, it's wrong), but at least you are aware of the fact that someone else also is using your computer and possibly affecting your measurements.
To get more consistent values during a measurement it might be useful to limit the number of available cores/processors to an application, since that number otherwise might vary (depending on load from other applications, the operating system etc) during your measurement. Try locking your applications to just one CPU (according to instructions below) and se if that gives you more stable values. Reducing the number of available CPUs will of course affect your applications total runtime, for better or worse.
Another way to get more consistent values is to make sure that all threads/processes are started at the same time using a barrier function. A barrier function can also be useful to collect data from all threads/processes to do post processing (average number calculations, maximum or minimum time, standard deviation etc), before releasing the thread/processes from the barrier for another batch of measurements.
During this lab it might be possible to get zombie processes (especially when using processes, not threads) as a result of the program having crashed or because of forked processes not having been terminated properly. You will most likely want to kill these zombies using the killall command in a fashion like "killall lift_messages".
At some point it might be useful to make copies of arrays. Using ordinary loops copying element by element is a waste of time and might affect your measurements. A better way is to use the memory copy "memcpy" function in C, in a fashion like "memcpy(destination_buffer, source_buffer, sizeof(element_type));"

Setting thread priorities

In order to set the priority of a certain thread you need to use the pthread_setschedparam function as shown in the following example:

// Set the real-time priority of the current thread to 5
// If you want to set the priority of another thread, specify the
// appropriate pthread_t instead of the pthread_self() function.
    struct sched_param p;
    p.sched_priority = 5;
    if(pthread_setschedparam(pthread_self(), SCHED_RR, &p) != 0){
         perror("Could not set the thread priority");
    }

The example code above will set the thread to real-time priority and the thread will now have priority over all normal threads and processes in the system. Since this would allow a malicious (or buggy) program to completely use up all available processor time a normal user is fortunately not allowed to run this command on the ISY computers in the normal computer labs. However, the computers in Muxen 2 have been reconfigured to allow you to set a real time priority of up to 10. If you want to investigate the impact of running various threads at different real time priorities on your own computer you need to add the following lines to /etc/security/limits.conf and logging out and logging in again:

* hard rtprio 10
* soft rtprio 10

(The example above works on Fedora and should probably work on all recent linux distributions.) You could of course also run your application as root, but that is something that should typically be avoided.

Timing Measurements

To measure the time it takes to run parts of your program you should use the gettimeofday function in Linux. The following piece of code will print the number of microseconds that a call to the function do_work() took to execute:

#include <sys/time.h>

// ...
    struct timeval starttime;
    struct timeval endtime;
    long long int timediff;
    gettimeofday(&starttime, NULL);
    do_work();
    gettimeofday(&endtime, NULL);
    timediff = (endtime.tv_sec*1000000ULL + endtime.tv_usec) -
               (starttime.tv_sec*1000000ULL + starttime.tv_usec);
    printf("  time difference: %lld\n", timediff);

Locking threads to a CPU

To investigate how well a parallel program scales when adding CPU:s you can artificially limit a program to use only some of the CPU:s present in the computer. To do this you can use the taskset command.

Example: To limit the lift_pthreads application to the first CPU, run the following command: taskset -c 0 ./lift_pthreads . Similarly, to limit the same command to only CPU 0 and 1, run the following command: taskset -c 0,1 ./lift_pthreads . (To list information about how many CPU:s you have in your system, run the command cat /proc/cpuinfo.)

What to include in your lab report (if you need to do one)

Your lab report need contain a more detailed description than you what need to show in the lab.

The lab report should be written in a formal manner (e.g. it should include an introduction, conclusion and, if appropriate, properly cited references). Please use the PDF-file format for the lab report. You should aim for a minimum of six pages for a group of two (and four pages for a single person), including figures and any relevant *short* pieces of source code. (Most lab reports will probably end up as about 10 pages, depending on the size of the figures.) The complete source code of all modified (or new) source files should also be included as a separate tar ball. Create a tar ball by going to the folder where your files are and run the following commands in the terminal:

	make clean
	tar -czf lab5source.tgz *

Check the contents of the tar ball before handing it in, using the command:

	tar -tf lab5source.tgz

I expect you to hand in one lab report per lab group. The lab report and source code (tar ball) should be handed in via mail to the course examiner.

Observe! The lab report and source code may be cross-referenced for plagiarism using Urkund.