vision scalability social networking revelation source

C++ Multithreading

Computers have to handle many different things at once, for example screen, keyboard, drives, database, internet.

These are best represented as communicating concurrent processes, with channels, as in Go routines. Even algorithms that are not really handling many things at once, but are doing a single thing, such as everyone’s sample program, the sieve of Eratosthenes, are cleanly represented as communicating concurrent processes with channels.

On the other hand, also, not quite so cleanly, represented by asynch await which makes for much lighter weight code, more cleanly interfaceable with C++.

Concurrency is not the same thing as parallelism.

A node.js program is typically thousands of communicating concurrent processes, with absolutely no parallelism, in the sense that node.js is single threaded, but a node.js program typically has an enormous number of code continuations, each of which is in effect the state of a concurrent communicating process. Lightweight threads as in Go are threads that on hitting a pause get their stack state stashed into an event handler and executed by event oriented code, so one can always accomplish the same effect more efficiently by writing directly in event oriented code.

And it is frequently the case that when you cleverly implement many concurrent processes with more than one thread of execution, so that some of your many concurrent processes are executed in parallel, your program runs slower, rather than faster.

C++ multithreading is written around a way of coding that in practice does not seem all that useful – parallel bitbashing. The idea is that you are doing one thing, but dividing that one thing up between several threads to get more bits bashed per second, the archetypical example being a for loop performed in parallel, and then all the threads join after the loop is complete.

The normal case however is that you want to manage a thousand things at once, for example a thousand connections to the server. You are not worried about how many millions of floating point operations per second, but you are worried about processes sitting around doing nothing while waiting for network or disk operations to complete.

For this, you need concurrent communicating processes, as in Go or event orientation as in node.js or nginx, node.js, not necessarily parallelism, which C++ threads are designed around.

The need to deal with many peers and a potentially enormous number of clients suggests multiprocessing in the style of Go and node.js, rather than what C++ multiprocessing is designed around, suggests a very large number of processes that are concurrent, but not all that parallel, rather than a small number of processes that are concurrent and also substantially parallel. Representing a process by a thread runs into troubles at around sixty four threads.

It is probably efficient to represent interactions between peers as threads, but client/peer are going to need either events or Go lightweight threads, and client/client interactions are going to need events.

Existing operating systems run far more than sixty four threads, but this only works because grouped into processes, and most of those processes inactive. If you have more than sixty four concurrently active threads in an active process, with the intent that half a dozen or so of those active concurrent threads will be actually executing in parallel, as for example a browser with a thread for each tab, and sixty four tabs, that active process is likely to be not very active.

Thus scaling Apache, whether as threads on windows or processes under Linux, is apt to die.

1 Need the solutions implemented by Tokio, Actix, Node.js and Go

Not the solutions supplied by the C++ libraries, because we are worrying about servers, not massive bit bashing.

Go routines and channels can cleanly express both the kind of problems that node.js addresses, and also address the kind of problem that C++ threads address, typically that you divide a task into a dozen subtasks, and then wait for them all to complete before you take the next step, which are hard to express as node.js continuations. Goroutines are a more flexible and general solution, that make it easier to express a wider range of algorithms concisely and transparently, but I am not seeing any mass rush from node.js to Go. Most of the time, it is easy enough to write in code continuations inside an event handler.

The general concurrent task that Google’s massively distributed database is intended to express is that you have a thousand tasks each of which generate a thousand outputs, which get sorted, and each of the enormous number of items that sort into the same equivalence group gets aggregated in a commutative operation, which can therefore be handled by any number of processes in any order, and possibly the entire sort sequence gets aggregated in an associative operation, which can therefore be handled by any number of processes in any order.

The magic in the Google massively parallel database is that one can define a a massively parallel operation on a large number of items in a database simultaneously, much as one defines a join in SQL, and one can define another massively parallel operation as commutative and or associative operations on the sorted output of such a massively parallel operation. But we are not much interested in this capability. Though something resembling that is going to be needed when we have to shard.

2 doing node.js in C++

Dumb idea. We already have the node.js solution in a Rust library.

Actix and Tokio are the (somewhat Cish) solutions. But Rust async is infamously hard. The borrow checker goes mad trying figure lifetimes in async

2.1 callbacks

In C, a callback is implemented as an ordinary function pointer, and a pointer to void, which is then cast to a data structure of the desired type.

What the heavy C++ machinery of std::function does is bind the two together and then do memory management after the fashion of std::string.

(but we probably need to do our own memory management, so need to write our own equivalent of std funcction supporting a C rather than C++ api)

And std::function, used correctly, should compile to the identical code merely wrapping the function pointer and the void pointer in a single struct – but you had better use compiler explorer to make sure that you are using it correctly.

Write a callback in C, an an std::function in c++, and make sure that the compiler generates what it should.

Ownership is going to be complicated – since after createing and passing a callback, we probably do not want ownership any more – the thread is going to return and be applied to some entirely different task. So the call that is passed the callback as an argument by reference uses move to ensure that when the std::function stack value in its caller pointing to the heap gets destroyed, it does not free the value on the heap, and then stashes the moved std::function in some safe place.

Another issue is that rust, python, and all the rest cannot talk to C++, they can only talk C. On the other hand, the compiler will probably optimise the std::function that consists of a lamda that is a call to function pointer and that captures a pointer to void.

Again, since compiler is designed for arcane optimization issues, have to see what happens in compiler explorer.

But rather than guessing about the compiler correctly guessing intent, make the callback a C union type implementing std variant in C, being a union of std:monostate a C callback taking no arguments, a C++ callback taking no arguments, C and C++ callbacks taking a void pointer argument, a c++ callback that is a pointer to method, and a C++ callback that is an std::function

In the old win32 apis, which were designed for C, and then retrofitted for C++ they would have a static member function that took an LPARAM, which was a pointer to void pointing at the actual object, and then the static member function would directly call the appropriate, usually virtual, actual member function.

Member function pointers have syntax that no one can wrap their brains around so people wrap them in layers of typedefs.

Sometimes you want to have indefinitely many data structures, which are dynamically allocated and then discarded.

Sometimes you want to have a single data structure that gets overwritten frequently. The latter is preferable when it suffices, since it means that asynch callback code is more like sync code.

In one case, you would allocate the object every time, and when does with it, discard it.

In the other case it would be a member variable of struct that hangs around and is continually re-used.

2.1.1 C compatibility

Node.js turns out to be be humanly comprehensible. Under the hood, it corresponds to what would require a vast amount of arcane syntax to do in C, which vast amount of arcane syntax is not humanly comprehensible, and has to be sugared away.

Bind the event and the relevant object together in a way that C can understand.

Then, in C++, or whatever language, you can then make that binding automatic and transparent to the programmer, so that he is writing code that looks like node.js code continuations.

This implies that a call that triggers the event machinery, and could provoke many events, for example a duplicated reply packet on the network and a timeout results in one and only one event being sent back to the handler that programmer defined, and he then has to decide which event it is is in his code continuation.

And when he does something that is going to trigger an event, which will eventually result in his code continuation being executed, that something has a thread affinity which may well differ from the currently executing thread.

So the library routine that he is calling, which will activate the event machinery to direct the event to its proper destination when it eventually happens, has to detect if it is in a thread with the wrong affinity, and if it is, instead of just executing the library code, has create an event to be handled by a particular thread and stash that event in that thread’s channel. Which will result in that thread doing something that will eventually result in the event that the programmer has written the code continuation to handle.

In the event of a thread affinity issue, the programmer code does not immediately do something that will result in the return event. Instead it creates an event that when it happens will result in another event, that the programmer actually cares about. And the programmer does not need to know that his code continuation got wrapped in another code continuation, and then unwrapped, that the code continuation he supplied got passed to one event, then passed to another, then passed back to him.

Thread affinity is not resolved in the event handling code, but in the event triggering code. An event does not have a thread affinity. When the code the programmer wrote to handle the event is executed, it winds up calling a library routine that triggers an event, and that library routine knows its proper thread affinity. If the library routine is being executed with the wrong affinity, it creates a wrapper event, and pushes the wrapper event, which has no thread affinity, into the channel of a thread with the right affinity, a channel that does have thread affinity.

At the bare iron level, when an event happens, it happens in a thread that is waiting on that event, for example the thread that is forever waiting for incoming packets, or the thread that is forever waiting for timeouts. And we do not want that to be able to mutate data that might be global to all threads. So if a single event could be triggered in two different threads, the thread waiting for data to arrive, and the thread waiting on timeouts, each thread has to hold an std::weak_ptr to that event, the weak pointer points to an std::shared_ptr in the event store for events that could be handled at the iron level by multiple threads , and the space allocated in the event store cannot be freed until the std::shared_ptr is destroyed. The std::weak_ptr gets invalidated when the shared_ptr is destroyed, but that might happen in another thread, so the current thread might have an invalid copy of the std::weak_ptr in cache, so the shared pointer has to be mutex protected, and the weak pointer checking access to that thread global shared pointer, because when it decides on validity, it is accessing data global to all threads.

So, the timer thread grabs exclusive access to the entire data structure that holds event objecst that could be subject iron level events happening in two threads. gets a copy of the original shared pointer, which points to data containing the original shared pointer, gets an ordinary pointer to the actual object in the event management code, makrks that object with its own thread id (which will tells any other thread to forget about this event, because it is about to be destroyed) and immediately releases the mutex, (because it now has exclusive access to this particular object within that data structure and copies the event data structure into a new event data structure that describes the timeout event, grabs the mutex for the event channelof the same thread that also handles data that has already been received and already stashed in a buffer, allocates itself some space in that channel, releases the mutex writes into that space, grabs the mutex again and marks the data ready for reading, releases that mutex grabs the mutex for the data structure of events global to more than one thread, frees the entire object thus destroying the shared pointer, thus invalidating all weak pointers pointing to this object for everyone.

We don’t want to be holding two mutexes at once, so we release the mutex for global events before we grab the mutex for the event handler channel, and release the mutex for the event handler channel before we destroyed the object. So while this is ongoing, another thread could successfully dereference the weak pointer, but would then see destruction was under way, and immediately release the mutex and abandon the data.

To hold mutexes only for a very short and deterministic time, we only manipulate pointers to buffers while under mutex, we dont do any buffer copying. Grab mutex, say you own this chunk of formerly free space, release mutex, write to that space, grab mutex again, say it is ready for reading.

And of course, test to see if this actually saves time. Which for copying small data structures it will not. Indeed, given how long an atomic operation takes, it is likely to be useful only with huge amounts of data, which an iron level event in the network thread might have, but an event in the timer thread certainly will not have.

So, if the timer thread successfully creates a temporary shared pointer, it then has to handle the event inside the mutex, and handle it fast

And of course, debugging multthreaded code is infamously hard. You need a harness that stresses the hell out of it, and continually mixes up what kind of stress it is subjected to, and which also records timing data indicative of delays resulting from thread contention in operations that should in theory be fast.

To debug all this arcane machinery, every object that might be subject to thread contention has to have a compile time option that gives it mutex protected magic number when it is constructed, erases that magic number when it is destroyed, and gives it the thread id of the thread that currently has access to it.

The code that calls the callback knows nothing about how the blob is structured. The event management code knows nothing about how the blob is structured. But the function pointer in the blob does know how the blob is structured.

This requires recursive definitions, which, as always, we handle in C by the non obvious method of forward definitions.

// p points to or into a blob of data containing a pointer to the callback
// and the data that the callback needs is in a position relative to the pointer
// that is known to the callback function.
  struct event;
  typedef extern "C" void (*eventCfunction)(event * pp);
  struct event
  {
     uint64_t actualRunTimeSize;
     eventCfunction p;
  };

// Within the actual function in the event handling code,
// one has to cast the pointer from its base type to its actual type
// that has the rest of the data, which the event despatcher code knows nothing of.
// on the other hand, a queue of events, a channel of events for a thread or thread group
// of a particular affinity, has the 
// static_cast<byte*>(p + p->actualRunTimeSize) point to the start of the next event
// and just past the last byte in this event, which information it does need to
// know.


// The event despatch code should not include the headers of the event handling code,
// as this would make possible breach of separation of responsibilities.
try{
(*(*pp).p)(pp);
}
catch(...){
    // log error and release event handler.
    // if an exception propagates from the event handling code into the event despatch code
    // it is programming error, a violation of separation of responsibilities
    // and the event despatch code cannot do anything with the error.
}

pp points into a blob that contains the data needed for handling the event when it happens, and a pointer to the code that will handle it when it happens, ptrEvent is a ptr to a struct containing the event.

But, that C code will be quite happy if given a class whose first field is a pointer to a C calling convention static member function that calls the next field, the next field being a lambda whose unnameable type is known to the templated object when it was defined, or if it is given a class whose first field is a pointer to a C calling convention static function that does any esoteric C++, or Rust, or Lua thing.

The runtime despatches an event to an object of type ReplyTo once and only once and then it is freed. Thus if for example the object is waiting for a packet that has a handle to it, or a timeout, and two such packets arrive it is only called with the first such packet, the next packet is silently discarded, and the timeout event cancelled or ignored.

The object of type RegardsTo has a longer lifetime, which the runtime does not manage. The runtime ensures that if two events reference the same RegardsTo object, they are handled serially, except for the case that the RegardsTo object is a nullpointer.

The next event referencing the same RegardsTo object goes into a queue waiting on completion of the previous message.

If an event has an InRegards to, but no InReply, it goes to the static default ReplyTo handler of the InRegards object, which gets called many times and whose lifetime is not managed by the runtime.

If a message should result in changes to many InRegards to objects one of them has to handle it, and then send messages to the others.

Code called by the async runtime must refrain from updating or reading data that could be changed by other code called by the asynch runtime unless the data is atomically changed from one valid state to another valid state. (For example a pointer pointing to the previous valid state is atomically updated to a pointer to a newly created valid state.)

In the normal case the ReplyTo callback is sticking to data that is in its ReplyTo and RegardsTo object.

When an RegardsTo object tells the runtime it has finalized, then the runtime will no longer do callbacks referencing it.

The finalization is itself an event, and results in a callback to another event data is a pointer to the finalized object and whose ReplyTo and RegardsTo objects are the objects that created the now finalized object.

Thus one RegardsTo object can spawn many such objects, and can read their contents when they finalise. (But not until they finalise)

2.2 Use Go

Throw up hands in despair, and provide an interface linking Go to secure Zooko ids, similar to the existing interface linking it to Quic and SSL.

This solution has the substantial advantage that it would then be relatively easy to drop in the existing social networking software written in Go, such as Gitea.

We probably don’t want Go to start managing C++ spawned threads, but the Go documentation seems to claim that when a Go heavyweight thread gets stuck at a C mutex while executing C code, Go just spawns another to deal with the lightweight threads when the lightweight threads start piling up.

When a C++ thread wants to despatch an event to Go, it calls a Go routine with a select and a default, so that the Go routine will never attempt to pause the C++ spawned thread on the assumption that it is a Go spawned thread. But it would likely be safer to call Goroutines on a thread that was originally spawned by Go.

2.3 doing it in C the C way

Processes represented as threads. Channels have a mutex. A thread grabs total exclusive ownership of a channel whenever it takes something out or puts something in. If a channel is empty or full, it then waits on a condition on the mutex, and when the other thread grabs the mutex and makes the channel ready, it notices that the other process or processes are waiting on condition, the condition is now fulfilled, and sends a notify_one.

Or, when the channel is neither empty nor full, we have an atomic spin lock, and when sleeping might become necessary, then we go to full mutex resolution.

Which implies a whole pile of data global to all threads, which will have to be atomically changed.

This can be done by giving each thread two buffers for this global data subject to atomic operations, and single pointer or index that points to the currently ruling global data set. (The mutex is also of course global, but the flag saying whether to use atomics or mutex is located in a data structure managed by atomics.)

When a thread wants to atomically update a large object (which should be sixty four byte aligned) it constructs a copy of the current object, and atomically updates the pointer to the copy, if the pointer was not changed while it was constructing. The object is immutable while being pointed at.

Or we could have two such objects, with the thread spinning if one is in use and the other already grabbed, or momentarily sleeping if an atomic count indicates other threads are spinning on a switch awaiting completion.

The read thread, having read, stores its read pointer atomically with memory_order_release, ored with the flag saying if it is going to full mutex resolution. It then reads the write pointer with memory_order_acquire, that the write thread atomically wrote with memory_order_release, and if all is well, keeps on reading, and if it is blocked, or the write thread has gone to mutex resolution, sets its mutex resolution flag and proceeds to mutex resolution. When it is coming out of mutex resolution, about to release the mutex, it clears its mutex resolution flag. The mutex is near the flags by memory location, all part of one object that contains a mutex and atomic variables.

So the mutex flag is atomically set when the mutex has not yet been acquired, but the thread is unconditionally going to acquire it, but non atomically cleared when the mutex still belongs to the thread, but is unconditionally going to release it.

If many read threads reading from one channel, then each thread has to memory_order_acquire the read pointer, and then, instead of memory_order_releaseing it, has to do an atomic_compare_exchange_weak_explicit, and if it changed while it was reading abort its reads and start over.

Similarly if many write threads writing to one channel, each write thread will have first spin lock acquire the privilege of being the sole write thread writing, or spin lock acquire a range to write to. Thus in the most general case, we have a spin locked atomic write state that specifies an area that has been written to, an area that is being written to, and an area that is available to be acquired for writing, a spin locked atomic read state, and mutex that holds both the write state and the read state. In the case of a vector buffer with multiple writers, the atomic states are three wrapping atomic pointers that go through the buffer in the same direction,

We would like to use direct memory addresses, rather than vector or deque addresses, which might require us to write our own vector or deque. See the thread safe deque, which however relies entirely on locks and mutexes, and whose extension to atomic locks is not obvious.

Suppose you are doing atomic operations, but some operations might be expensive and lengthy. You really only want to spin lock on amending data that is small and all in close together in memory, so on your second spin, the lock has likely been released.

Well, if you might need to sleep a thread, you need a regular mutex, but how are you going to interface spin locks and regular mutexes?

You could cleverly do it with notifies, but I suspect it is costly compared to just using a plain old vanilla mutex. Instead you have some data protected by atomic locks, and some data protected by regular old mutexes, and any time the data protected by the regular old mutex might change, you atomically flag a change coming up, and every thread then grabs the mutex in order to look amend or even look at the data, until on coming out of the mutex with the data, they see the flag saying the mutex protected data might change is now clear.

After one has flagged the change coming up, and grabbed the mutex, wha happens if another thread is cheerfully amending the data in a fast operation, having started before you grabbed the mutex? The other thread has to be able to back out of that, and then try again, this try likely to be with mutex resolution. But what if the other thread wants to write into a great big vector, and reallocations of the vector are mutex protected. And we want atomic operations so that not everyone has to grab the mutex every time.

Well, any time you want to do something to the vector, it fits or it does not. And if it does not fit, then mutex time. You want all threads to switch to mutex resolution, before any thread actually goes to work reallocating the vector. So you are going to have to use the costly notify pattern. “I am out of space, so going to sleep until I can use the mutex to amend the vector. Wake me up when last thread using atomics has stopped using atomics that directly reference memory, and has switched to reading the mutex protected data, so that I can change the mutex protected data.”

The std::vector documentation says that vector access is just as efficient as array access, but I am a little puzzled by this claim, as a vector can be moved, and specifically requests that you have a no throw move operation for optimization, and having a no copy is standard where it contains things that might have ownership. (Which leads to complications when one has containers of containers, since C++ is apt to helpfully generate a broken copy implementation.)

Which would suggest that vector access is through indirection, and indirects with threading create problems.

2.4 Wrapping this in C++ the C++ way

Our events are not the events the operating system provides. We get an operating system event in a thread waiting on that event, which then wraps it one of our events, and then stuffs it into the event input channel of the appropriate thread or thread group.

Which gets complicated if we are waiting on both the timer and the network adapter card, but I addressed that above, it is a C level problem, not a C++ level problem.

Our events get stuffed into the event input channel of a thread or thread group. And if our event handling code continuation is calling a library routine that does something that will eventually result in a callback, the library routine knows the thread affinity of the event it will end up generating.

But sometimes, often, the application programmer wants to specify the thread affinity of the continuation, for example an continuation that does read operations on the primary and permanent database should only happen in a thread that has a a read or read write database connection open, and a continuation that does read write access should only happen the one thread that has read write access to the primary database.

Sometimes, often, we have event handlers that want to read write to the primary database, and assemble messages that will be sent to to their counterparty.

And, we will want to know what response to that message is. Which can only be learned in a code continuation, a code continuation that probably does not need write access, or likely any any access, to the primary database. Code application that only handles messages has no need for any special thread affinity, since the event handling begins with the incoming message buffer that has already been allocated, and ends by generating the next message and stuffing that message into the event queue of the message sending thread.

We don’t want an instance of a class to ever be potentially used by two threads simultaneously, so we have to make sure that the operation that will result in the next event being pushed into an event channel always ends the code continuation. Which we can do by making sure that the template metacode for a code continuation generates the push into the event channel. The code continuation returns the parameters needed to generate the event and push it to the correct channel, and template generated code then actually constructs the event in the correct channel, adding to the event a reference to the object that represents the connection and conversation with the counterparty.

Unfortunately there are lots of wiggles on this idea. If the code takes a const reference to the object, for example we are just displaying information to the counterparty, obviously we could handle lots of events simultaneously.

Or an conversation object could contain another conversation object, giving the effect of subroutines.

The endless nesting of code continuations stops when the code continuation is a named method that you have defined for the class.

(All other references being default captured by value. All local values are destroyed, so get recreated by value so to use RAII you have to give the RAII object move semantics, like std::unique_ptr. To reference non local values, values that were not defined in the previous continuation, you have to explicitly capture them by const reference.)

If it is waiting for an event that is one of many possible events, then either when the first event arrives, all others must be discarded, or all others must be queued up until it asks for the next.

If the counterparty can send quite a few messages that pile up, then when the code continuation has to say for all messages in queue do the thing.

The code of the code continuation looks to the application programmer as if it is part of the message continuation, but actually of course it is complicated into the definition of the class whose instances each represent a conversation, and only the actual data is passed along as an object managed by the event handling code.

The instance initialization may well create a temp database, though this could result in a lot of empty temp databases sitting around on inactive connections, and it would be better to construct it in the first code continuation whose child events are going to need it, pass the pointer along by value, (as an std::unique that gets moved), which destroys the database on the code continuation that does not take the temp database as an argument.

Every thread will have in its its stack an std:unique_ptr that in one thread is the read write database connection, in some threads is a read only database connection, and in all the other threads is null. Similarly for compiled queries On a thread affinity group that has read only access to the database, the compiled write queries will be null, but the read queries will be recompiled for each connection. In a thread that does not have access to the database, all the pointers will exist, but they will all be null.

The continuation should happen on the same thread as triggered the event, if the thread is free, and if it is of the required class thread affinity. But if the previous event took place and needed to take place on the thread with read/write to database affinity, and the continuation does not need that affinity, much of the time the thread with read/write to database affinity is going to be busy.

If we cannot execute a continuation on the same thread as before, it get executed on the lowest numbered thread of the pool that has an empty input queue, failing that, if they are all busy, on whichever thread has the least number of things in its queue.

We only create a lot of threads with no affinity if we have a lot of hardware threads. We don’t want the pool of threads able to perform continuations to be larger than the number of hardware threads.

The channels are variables in the main program, which are initialized by the main program before any threads exist and this data remains unchanged as long as any thread exists so is globally readable, though adding anything to the channel or removing anything from the channel is going require its mutex, or mucking with atomics, but each thread will know where to find that mutex from the global data in the main program.

2.5 lightweight threads in C

A lightweight thread is just a thread where, whenever a lightweight thread needs to be paused by its heavyweight thread, the heavyweight thread stores the current stack state in the heap, and move on to deal with other lightweight threads that need to be taken care of. Which collection of preserved lightweight thread stack states amount to a pile of event handlers that are awaiting events, and having received events, are then waiting for a heavyweight thread to process that event handler.

Thus one winds up with what suspect it the Tokio solution, a stack that is a tree, rather than a stack.

Hence the equivalence between node.js and nginx event oriented programming, and Go concurrent programming.

3 costs

Windows 10 is limited to sixty four threads total. If you attempt to create more threads than that, it still works, but performance is apt to bite, with arbitrary and artificial thread blocking. Hence goroutines, that implement unofficial threads inside the official threads.

Linux will die with something like a thousand threads, which is not too bad.

Thread creation and destruction is fast, five to twenty microseconds, so thread pools do not buy you much, except that your memory is already going to be cached. Another source says 40 microseconds on windows, and fifty kilobytes per thread. So, a gigabyte of ram could have twenty thousand threads hanging around. Except that the windows thread scheduler dies on its ass.

There is a reasonable discussion of thread costs here

General message is that lots of languages have done it better, often immensely better, Go among them.

Checking the C++ threading libraries, they all single mindedly focus on the particular goal of parallelizing computationally intensive work. Which is not in fact terribly useful for anything you are interested in doing.

4 Atomics and mutexes

Atomics are costly, because they are the equivalent of a cache miss, and mutexes are costly, because they are an ease of use programmer interface built on top of atomics.

Using atomics directly is slightly faster, but not enough faster to matter much. Further, using atomics directly is a lot harder, and if you screw up the complexity, likely to be woefully inefficient.

The main cost of a mutex is usually the underlying atomic, and you are unlikely to reduce the number of atomics by using them directly.

If something is simple to accomplish with an atomic, then use an atomic. It is likely to be a modest efficiency improvement over a mutex. If clever programming is required, atomics will likely bite you.

In the use case I care about, likely one thread sitting on the network card, despatching and receiving messages from the network, one thread sitting on the database, despatching messages to and from the database, and one thread running a bunch of state machines, each state machine representing a single conversation with a sngle peer over the network, and the state machines waiting on state transitions caused by messages received from the database and network threads, and in response to these state transitions sending out packages to be sent by the network thread, or handled by the database thread.

Because we don’t want the network being ignored while waiting for the disk, nor the disk to be ignored while waiting for the network.

So, what is the efficient way to handle a message queue with one producer thread and one consumer thread? Seems easy to do with atomics, but if all input message queues are empty, you want the recipient thread to sleep on condition, which you cannot really do with atomics, and if an output message queue is full, you want the sending message thread to sleep on condition, with you cannot really do with atomics. (Life is a lot simple if your message buffer sizes are fixed at compile time.)

typedef enum memory_order {
    memory_order_relaxed,   // relaxed
    memory_order_consume,   // consume
    /* No one, least of all compiler writers, understands what
    "consume" does.
    It has consequences which are difficult to understand or predict,
    and which are apt to be inconsistent between architectures,
    libraries,  and compilers. */
    memory_order_acquire,   // acquire
    memory_order_release,   // release
    memory_order_acq_rel,   // acquire/release
    memory_order_seq_cst    // sequentially consistent
    /*  "sequentially consistent" interacts with the more commonly\
    used acquire and release in ways difficult to  understand or
    predict, and in ways  that  compiler and library writers
    disagree on. */
} memory_order;

I don’t think I understand how to use atomics correctly.

Atomic_compare_exchange_weak_explicit inside a while loop is a spin lock, and spin locks are complicated, apt to be inefficient, potentially catastrophic, and avoiding catastrophe is subtle and complex.

To cleanly express a concurrent algorithm you need a thousand communicating processes, as goroutines or node.js continuations, nearly all of which are sitting around waiting for the another thing to send them a message or be ready to receive their message, while atomics give you a fixed small number of threads all barreling full speed ahead. Whereupon you find yourself using spin locks.

Rather than moving data between threads, you need to move threads between data, between one continuation and the next.

Well, if you have a process that interacts with Sqlite, each thread has to have its own database connection, in which case it needs to be a pool of threads maybe you have a pool of database threads that do work received from a bunch of asynch tasks through a single fixed sized fifo queue, and send the results back through another fifo queue, with threads waking up when the queue gets more stuff in it, and going to sleep when the queue empties, with the last thread signalling “wake me up when there is something to do”, and pushback happening when buffer is full.

Go demonstrates that you can cleanly express algorithms as concurrent communicating processes using fixed size channels. An unbuffered channel is just a coprocess, with a single thread of execution switching between the two coprocesses, without any need for locks or atomics, but with a need for stack fixups. But Node.js seems to get by fine with code continuations instead of Go’s stack fixups.

A buffered channel is just a fixed size block of memory with alignment, size, and atomic wrapping read and write pointers.

Why do they need to be atomic?

So that the read thread can acquire the write pointer to see how much data is available, and release the read pointer so that the write thread can acquire the read pointer to see how much space is available, and conversely the write thread acquires the read pointer and releases the write pointer.And when write thread updates the write pointer it updates it after writing the data and does a release on the write pointer atomic, so that when the read thread does an acquire on the write pointer, all the data that the write pointer says was written will actually be there in the memory that read thread is looking at.

Multiple routines can send data into a single channel, and, with select, a single channel can receive data from any channels.

But, with go style programming, you are apt to have far more routines than actual hardware threads servicing them, so you are still going to need to sleep your threads, making atomic channels an optimization of limited value.

Your input buffer is empty. If you have one thread handling the one process for that input stream, going to have to sleep it. But this is costly. Better to have continuations that get executed when data is available in the channel, which means your channels are all piping to one thread, that then calls the appropriate code continuation. So how is one thread going to do a select on a thousand channels?

Well, we have a channel full of channels that need to be serviced. And when that channel empties, mutex.

Trouble is, I have not figured out how to have a thread wait on multiple channels. The C++ wait function does not implement a select. Well, it does, but you need a condition statement that looks over all the possible wake conditions. And it looks like all those wake conditions have to be on a single mutex, on which there is likely to be a lot of contention.

It seems that every thread grabs the lock, modifies the data protected by the lock, performs waits on potentially many condition variables all using the same lock and protected by the same lock, condition variables that look at conditions protected by the lock, then releases the lock immediately after firing the notify.

But it could happen that if we try to avoid unnecessarily grabbing the mutex, one thread sees the other thread awake, just when it is going to sleep, so I fear I have missed a spin lock somewhere in this story.

If we want to avoid unnecessary resort to mutex, we have to spin lock on a state machine that governs entry into mutex resolution. Each thread makes its decision based on the current state of channel and state machine, an does a Atomic_compare_exchange_weak_explicit to amend the state of the state machine. If the state machine has not changed, the decision goes through. If the state machine was changed, presumably by the other thread, it re-evaluates its decision and tries again.

Condition variables are designed to support the case where you have one thread or a potentially vast pool of threads waiting for work, but are not really designed to address the case where one thread is waiting for work from a potentially vast pool of threads, and I rather think I will have to handcraft a handler for this case from atomics and, ugh, dangerous spin loops implemented in atomics.

A zero capacity Go channel sort of corresponds to a C++ binary semaphore. A finite and small Go channel sort of corresponds to C++ finite and small semaphore. Maybe the solution is semaphores, rather than atomic variables. But I am just not seeing a match.

I notice that notifications seems to be built out of a critical section, with lots of grabbing a mutex and releasing a mutex, with far too much grabbing a mutex and releasing a mutex. Under the hood, likely a too-clever and complicated use of threads piling up on the same critical section. So maybe we need some spin state atomic state machine system that drops spinning threads to wait on a semaphore. Each thread on a channel drops the most recent state channel after reading, and most recent state after writing, onto an atomic variable.

But the most general case is many to many, with many processes doing a select on many channels. We want a thread to sleep if all the channels on which it is doing a select are blocked on the operation it wants to do, and we want processes waiting on a channel to keep being woken up, one at a time, as long a channel has stuff that processes are waiting on.

5 C++ Multithreading

std:aysnc is designed to support the case where threads spawn more threads if there is more work to do, and the pool of threads is not too large, and threads terminate when they are out of work, or do the work sequentially if doing it in parallel seems unlikely do yield benefits. C++ by default manages the decision for you.

Maybe the solution is to use threads where we need stack state, and continuations serviced by a single thread where we expect to handle one and only one reply. Node.js gets by fine on one thread and one database connection.

 #include &t;thread>
static_assert(__STDCPP_THREADS__==1, "Needs threads");
//  As thread resources have to be managed, need to be wrapped in
//  RAII
class ThreadRAII {
    std::thread & m_thread;
public:
//  As a thread object is moveable but not copyable, the thread obj
//  needs to be constructed inside the invocation of the ThreadRAII
//  constructor. */
    ThreadRAII(std::thread  & threadObj) : m_thread(threadObj){}
    ~ThreadRAII(){
        // Check if thread is joinable then detach the thread
        if(m_thread.joinable()){
            m_thread.detach();
            }
        }
    };

Examples of thread construction

    void foo(char *){
        …
    }

    class foo_functor
    {
    public:
        void operator()(char *){
            …
        }
    };


    int main(){
        ThreadRAII thread_one(std::thread (foo, "one"));
        ThreadRAII thread_two(
            std::thread (
                (foo_functor()),
                "two"
                )
            );
        const char three[]{"three"};
        ThreadRAII thread_lambda(
            std::thread(
                [three](){
                    …
                }
            )
        );
    }

C++ has a bunch of threading facilities that are designed for the case that a normal procedural program forks a bunch of tasks to do stuff in parallel, and then when they are all done, merges the results with join or promise and future, and then the main program does its thing.

This is not so useful when the main program is a event oriented, rather than procedural.

If the main program is event oriented, then each thread has to stick around for the duration, and has to have its own event queue, which C++ does not directly provide.

In this case threads communicate by posting events, and primitives that do thread synchronization (promise, future, join) are not terribly useful.

A thread grabs its event queue, using the mutex, pops out the next event, releases the mutex, and does its thing.

If the event queue is empty, then, without releasing it, the thread processing events waits on a condition variable. (which wait releases the mutex). When another thread grabs the event queue mutex and stuffs something into into the event queue, it fires the condition variable, which wakes up and restores the mutex of the thread that will process the event queue.

Mutexes need to construct RAII objects, one of which we will use in constructing the condition object.

reaction.la gpg key 154588427F2709CD9D7146B01C99BB982002C39F
This work is licensed under the Creative Commons Attribution 4.0 International License.