Thready McParallel: the patron devil of programming

Afair while back, I had a call from someone who wanted to know if I could fit in some consultancy time to help solve some development problems. It turned out to be the least profitable work that I have ever done. I had an immediate hunch and after just five minutes on the blower to their programmer it was problem solved. Honestly, I am really, really shit at capitalism.

They were developing a client-server game for the web and had a terrible capacity issue with the server: one connection was good, two was OK but unreliable but above three and it was unusable. Given that they were aiming for many thousands, this was clearly not good and they were desperately worried: they had interfaced correctly with the server software that they had licensed in and couldn’t figure out what was going on.

Snake In Threads

Mr Snake here is wearing some nifty threads. His idea of threads, though, is somewhat different to mine

The server software that they had selected worked like you might expect: it handled connections, sending of data backwards and forwards over the Internet and dealt with disconnections. All they had to do was tie into a callback that would fire every time something happened. They then processed whatever it was and returned back to the server software. Simples, right? Nope. The server software had to be able to deal with many thousands of connections and lots and lots of stuff going on at unpredictable times so, as is appropriate, it was heavily multi-threaded. It had a “pool” of threads that handled receiving data from clients over the Internet. When data was received, the user specified callback was fired with a note to say “hey, I received this data from this person, so deal with it, sucker!”. Since the thread from the pool owned that data, it waited until the callback finished before freeing the memory and returning to a listening state. So, any guesses yet?

What’s a thread? Is it knitting related? Threads are like little mini-mes that you spawn so that you can do more an one thing at once. They get on with jobs whilst you do something else allowing you to be super duper efficient. Problems arise if the mini-mes and you are doing jobs that require touching the same thing at the same time. Imagine this: three people engaged in a “sitting down visit” in the same toilet at the same time. That grimacing picture in your mind? That is multi-threaded programming, screw it up and the mess is… unpleasant.

Their callback was receiving and processing each packet as it arrived. Seems cool, right? Packet received, callback fired, do what needs to be done, return, pour brandy, sit in front of fire basking in a job superbly done. Well, not quite. Some of the jobs that were performed took a very long time to complete. Take those that involved database access: some took seconds to complete. Now, in a world where tens of thousands of events could occur every second, spending over a second doing stuff is simply outrageous. My guess is that their server software created a thread pool of approximately number_of_CPUs x 4 threads. For dual CPU, dual core rack mounted test server they had, that is about 16 threads. Here is how their system collapsed: with a few people online all moving around in the world doing stuff, there were more than a few database accesses going on. The moment that 16 of these jobs were in progress at the same time, ALL THE THREADS WERE BUSY. That meant that nobody was manning the phones. Furthermore, the more people who were connected, the worse it got: an exponential growth in crapness.

One of the problems that human beings have is misunderstanding what “a long time” is when it comes to computers. The chaps that were working on this web application were web programmers: Flash, JavaScript, Java and other such web stuff were their weapons of development and when the server’s documentation said “don’t spend too much time in the callback” they interpreted this as “a few seconds is fine, right?”. In fact, “too much time” was any amount of time greater an a few hundred MICROseconds. They were spending several orders of magnitude too long processing their stuff.

Their programmer, not used to real-time asynchronous performance code, immediately understood the problem but dramatically underestimated the scale of solution. His first answer? “Ooooooooh, OK, I will optimise the database stored procedures and that should do it!” Bzzzzzzzt, wrong answer. I explained that by fast, it had to be lightning fast. His callback need to do this:

  1. Take copy of the data received
  2. Store that copy in a list of jobs to do ASAP
  3. Return RIGHT NOW

It should take microseconds. Now, the thread returns, goes back into the pool and can start answering phones again. Of course, you then end up with a rapidly growing list of jobs that you need to get done, but you can create a thread to deal with those. Most importantly, though, the long tasks need to be parallelised. Instead of doing database operations one at a time, jobs like that need to be fired off in their own threads so that they can TAKE AS LONG AS THEY WANT.

Parallel processing is a tough cookie for programmers to grasp. Humans are used to doing things sequentially even though we are massively distributed parallel processing systems internally. Here is a little example program that creates two threads each of which adds one to a memory location ten times. The threads are adding to the same memory location. The inexperienced would expect the answer to be 20. Indeed, mostly it is. Occasionally, though, it is 19, or 18 or even less. So many programmers make this mistake that I wonder what computer science graduates are actually taught these days beyond drinking beer, how to chat up the chicks and an encyclopaedic knowledge of Monty Python. So, the code:

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

static int badVibes = 0;

void* PeacefulThread(void *argument)
{
  for (int i = 0; i < 10; ++i)
  {
    ++badVibes;
  }
  return NULL;
}

int main (int argc, char *argv[])
{
  pthread_t thread1;
  pthread_t thread2;

  // No error detection, don't try this at home, kids!
  pthread_create(&thread1, NULL, PeacefulThread, NULL);
  pthread_create(&thread1, NULL, PeacefulThread, NULL);

  // Wait for them to complete:
  pthread_join(thread1, NULL);
  pthread_join(thread2, NULL);

  // Show result:
  printf("Yes, this number [%-d] should be 20!\n", badVibes);
  return (0)
}

You can compile and run this code yourself if you have a Mac or a Linux machine. Run it from the command line and see what you get. Add more threads using the miracle of copy-paste and you will see less reliability in getting 20. The issue here is that incrementing is a read-modify-write operation. The two threads could perform the read at the same time (if you have two cores this is even more likely): so they both read a value, say, 5. They both then add one to it and get 6. Then they both store the result: 6. Oops. One should have made it 6 and the next made it 7 but WHICH ONE? Welcome to the nightmare, it gets so, so much worse.

Operating systems such as Linux, Unix and Windows have armies of special features1 that allow the programmer to defend against such issues. One such mechanism is that of critical sections, code that can only be executed by one thread at once. They act as a nice locked door around things that would be bad to touch by two threads simultaneously: you can enter the toilet, lock the door, drop the kids off at the pool, unlock the door and leave. The chap outside can then pop in and do his business. The trick with these locks is for them to be in use for the absolute minimum amount of time. For example, if everyone taking a dump reads the newspaper whilst sitting on the throne, then the efficiency of the toilet drops dramatically. Efficient multi-threaded programming architectures work just like this: not locking is best, if you must lock, do it briefly and read the paper later. Take the above example program. It is dead easy to fix the issue with critical sections, here is the new code for the thread:

static pthread_mutex_t cs_mutex = PTHREAD_MUTEX_INITIALIZER;

void* PeacefulThread(void *argument)
{
  // Take the critical section:
  pthread_mutex_lock(&cs_mutex );

  for (int i = 0; i < 10; ++i)
  {
    ++badVibes;
  }

  // We are done, release the critical section:
  pthread_mutex_unlock(&cs_mutex );
  return NULL;
}

Neat, eh? The loop is now protected as a critical section! Of course, I have also completely disabled any benefit of having two threads working on the problem as one blocks the other out until it is finished. Beginners playing with threads tend to end up in this position or at the opposite end: protection EVERYWHERE, and believe me, if you are desperate to visit the facilities, working your way through ten locked doors is not going to help. Furthermore, such over-zealous application of locks can lead to my favourite example: the deadlock. This is where you are waiting to enter a critical section protected piece of code but cannot because the code that has that section locked is waiting for a lock to free up that you are in. Between the possibility of nesting yourself into disaster, overdoing it and doing it in the wrong place I will be chuffed to bits if your braiiiiins are hurting at this point.

Just one wafer thin additional problem

As well as the maze of mutexes, it is also common to see examples of threads where threads are not needed or worse still, no threads where they are needed… and it was is that led to the second call they gave me a week or two later. After a while, the latency between client and server was building up to incredible levels – a few hours into a run, latency was measured in minutes. Furthermore, the longest they had been able to run the server with people connected without it failing was a day. Their detective work had led them to the problem and they wanted advice on the solution. The server failed because it ran out of memory and it ran out of memory because the queued list of jobs to do from the thread pool was longer each time that it was processed. The only way of sorting the issue was for everyone to disconnect and then to wait for a few minutes whilst everything cleared.

They had installed a double-buffered type solution to their long list of jobs. They had two lists: one that was being filled by the callbacks whilst the other was being processed. This dealt with only half of the “long lock” issue mentioned above – only a tiny lock was required to swap over the lists which was superbly efficient. Their new issue was that they then dealt with the items in the list synchronously. One at a time. So, 100 database operations that would take a second each would therefore take a minute to process. In the meanwhile, the other “live” list had got real, real big. This would repeat with processing of a list that got longer each time because clearing a list took longer than filling it.

This betrayed a certain naivety for asynchronous multi-threaded software architecture that is shockingly common in software development. Getting into a true asynchronous parallel mindset is terribly difficult: you have to really think about how to break jobs down and how to effectively parallelise them. Some do need to be executed in sequence but others do not. Some can be completed in micro or milliseconds, but some cannot. In the end, they created a clever sequencer for per-client database accesses and ended up with their own thread-pools for processing long duration jobs and then firing events when those jobs completed. The processing of the list became lightning quick and then a whole army of threads would go to work dealing with the database accesses, game logic and other such bits and pieces. When I last heard, their servers were running pretty much indefinitely and their biggest problem now is bugs in the licensed server software. Regrettably, it took as long to provide my advice as it has taken you to read this; total earnings, well, they had to have that on the house. However, I did get a nice note from the programmer thanking me for turning his brain into spaghetti. He said that all this stuff is like peering into a TARDIS2: it looks so small from the outside but one peers in the door and realises the full size of the nightmare!

It’s almost rocket science

Thinking asynchronously is hard. Designing software that works in a hugely parallel fashion is harder still. Putting locks in the wrong place and for the wrong durations can be catastrophic. The worst I had was a bug that took over six months to find in live servers. It turned out to be a lock command that I had copy-pasted from some other piece of code many months earlier and I had it ONE LINE down from where it should be. The error, when it rarely occurred, caused a minor fault that snowballed. The eventual crash sprung up somewhere else in the code perhaps minutes later. Debugging that, regardless of the data that you have is as close to impossible as I believe you can get. In the end, I found the issue using old-fashioned debugging: lots of logging (printf for the win…) and exhaustive searching through “suspicious” code, i.e., I thoroughly checked my locks.

Good architecture and structure is the best defence against crap software in all cases: put down the right foundations and you can modify the building later without having to rip the whole thing down. This applies even more so to multi-threaded code. In these days of increased parallelism in CPUs, it is even more important to consider how to take advantage of this for performance and usability reasons. It is tough to retrofit and it is tough to scale top-down slicing of jobs into threads to infinite CPUs, but nearly every application has some job that would traditionally have slapped a “please wait” on the screen that could be removed with good use of a background thread or two even if it is just to display a groovy interactive display whilst something is happening.

A learning experience is one of those things that say, “You know that thing you just did? Don’t do that.”
– Douglas Adams.

Experience leads to good solutions: you make your mistakes, you learn and you move forwards (unless you already know everything, of course). Programs in all languages by all programmers show experience levels primarily in their architecture: I saw one so-called expert demonstrate that his experience was about 8 years shorter than he had suggested by creating an application component that looked like a check-list of C++ features. I had never seen so many redundant uses of templates, classes and inheritance before (or since) in my entire life. I have used that code as an example of “how not to program C++” ever since.

Incidentally, our example can be made bug free without any use of critical sections:

void* PeacefulThread(void* argument)
{
  // Do the calculation (yes, obviously it could return 10):
  int* pResult = (int*)argument;
  for (int i = 0; i < 10; ++i)
  {
    *pResult = *pResult + 1;
  }
  return NULL;
}

int main (int argc, char *argv[])
{
  int thread1Result = 0, thread2Result = 0;
  pthread_t thread1;
  pthread_t thread2;

  // No error detection, don't try this at home, kids!
  pthread_create(&thread1, NULL, PeacefulThread, (void*)&thread1Result);
  pthread_create(&thread1, NULL, PeacefulThread, (void*)&thread2Result);

  // Wait for them to complete:
  pthread_join(thread1, NULL);
  pthread_join(thread2, NULL);

  // Show result:
  const int finalResult = thread1Result + thread2Result;
  printf("Yes, this number [%-d] should be 20!\n", finalResult);
  return (0)
}

The cost of adding the results together synchronously still allows parallel benefits and we all win. Code is simpler, less likely to go tits up and we have not need to play pthread bingo on our feature use.

I will leave you with a final thought. Humans tend to chop problems into known slices when designing for multiple CPUs. This scales fine for one, two, four or even eight or sixteen CPUs but diminishing returns set in fast. The ultimate solution is a more bottom-up approach: create very large populations of small virtual computers that can combine their work to solve larger problems. This approach, if taken correctly, leads to software that can scale to an infinite number of CPU cores. It borrows its philosophy from biology: the most incredible distributed parallel processing system we know of. I humbly suggest that this is the future of software development.

One day your computer will contain life.

One day, your computer will be smarter than you.

… in the meanwhile, though, the smartest component of a computer is the person sitting in front of it. Relish this time, it’s temporary.


1 This article, as over-bloated, vast and tedious as it is, can merely touch at the surface of this stuff. One programmer I know reckons a life-time is probably not sufficient to grasp all the intricacies of multi-threaded architecture.

2 A Dr Who reference. It is supposed to stand for Time And Space Blah Blah Relative Whatever, but I like to think of it as “Twice As Roomy, Despite Its Size”

This entry was posted in Software development and tagged , , , , , , , , . Bookmark the permalink.

2 Responses to Thready McParallel: the patron devil of programming