c++multithreadingboost-asiowebsocket++

Shouldn't I see a difference in CPU usage between a single-threaded vs a multi-threaded websocketpp server?


I'm using a mulithreaded websocketpp server that I configured like this:

Server::Server(int ep) {
    using websocketpp::lib::placeholders::_1;
    using websocketpp::lib::placeholders::_2;
    using websocketpp::lib::bind;

    Server::wspp_server.clear_access_channels(websocketpp::log::alevel::all);

    Server::wspp_server.init_asio();

    Server::wspp_server.set_open_handler(bind(&Server::on_open, this, _1));;
    Server::wspp_server.set_close_handler(bind(&Server::on_close, this, _1));
    Server::wspp_server.set_message_handler(bind(&Server::on_message, this, _1, _2));

    try {
        Server::wspp_server.listen(ep);
    } catch (const websocketpp::exception &e){
        std::cout << "Error in Server::Server(int): " << e.what() << std::endl;
    }
    Server::wspp_server.start_accept();
}

void Server::run(int threadCount) {
    boost::thread_group tg;

    for (int i = 0; i < threadCount; i++) {
        tg.add_thread(new boost::thread(
            &websocketpp::server<websocketpp::config::asio>::run,
            &Server::wspp_server));
        std::cout << "Spawning thread " << (i + 1) << std::endl;
    }

    tg.join_all();
}

void Server::updateClients() {
    /*
       run updates
    */
    for (websocketpp::connection_hdl hdl : Server::conns) {
        try {
            std::string message = "personalized message for this client from the ran update above";
            wspp_server.send(hdl, message, websocketpp::frame::opcode::text);
        } catch (const websocketpp::exception &e) {
            std::cout << "Error in Server::updateClients(): " << e.what() << std::endl;
        }
    }
}

void Server::on_open(websocketpp::connection_hdl hdl) {
    boost::lock_guard<boost::shared_mutex> lock(Server::conns_mutex);
    Server::conns.insert(hdl);

    //do stuff


    //when the first client connects, start the update routine
    if (conns.size() == 1) {
        Server::run = true;
        bool *run = &(Server::run);
        std::thread([run] () {
            while (*run) {
                auto nextTime = std::chrono::steady_clock::now() + std::chrono::milliseconds(15);
                Server::updateClients();
                std::this_thread::sleep_until(nextTime);
            }
        }).detach();
    }
}

void Server::on_close(websocketpp::connection_hdl hdl) {
    boost::lock_guard<boost::shared_mutex> lock(Server::conns_mutex);
    Server::conns.erase(hdl);

    //do stuff

    //stop the update loop when all clients are gone
    if (conns.size() < 1)
        Server::run = false;
}

void Server::on_message(
        websocketpp::connection_hdl hdl,
        websocketpp::server<websocketpp::config::asio>::message_ptr msg) {
    boost::lock_guard<boost::shared_mutex> lock(Server::conns_mutex);

    //do stuff
}

I start the server with:

int port = 9000;
Server server(port);
server.run(/* number of threads */);

The only substantial difference when you add connections is in the message emission [wssp.send(...)]. The increasing number of clients doesn't really add anything to the internal computation. It's only the amount of message to be emitted that augments.

My problem is that the CPU usage doesn't seem to be that much different whether I use 1 or more threads.

It doesn't matter that I start the server with server.run(1) or server.run(4) (both on a 4 core CPU dedicated server). For a similar load, the CPU usage graph shows approximately the same percentage. I was expecting the usage to be lower with 4 threads running in parallel. Am I thinking of this the wrong way?

At some point, I got the sense that the parallelism really applies to the listening part more than the emission. So, I tried enclosing the send within a new thread (that I detach) so it's independent of the sequence that requires it, but it didn't change anything on the graph.

Am I not supposed to see a difference in the work that the CPU produces? Otherwise, what am I doing wrong? Is there another step that I'm missing in order to force the messages to be emitted from different threads?


Solution

  • "My problem is that the CPU usage doesn't seem to be that much different whether I use 1 or more threads."

    That's not a problem. That's a fact. It just means that the whole thing isn't CPU bound. Which should be quite obvious, since it's network IO. In fact, high-performance servers often dedicate only 1 thread to all IO tasks, for this reason.

    "I was expecting the usage to be lower with 4 threads running in parallel. Am I thinking of this the wrong way?"

    Yes, it seems to. You don't expect to pay less if you split the bill 4 ways either.

    In fact, much like at the diner, you often end up paying more due the overhead of splitting the load (cost/tasks). Unless you require more CPU capacity/lower reaction times than a single thread can deliver, a single IO thread is (obviously) more efficient because there is no scheduling overhead and/or context switch penalty.

    Another mental exercise:

    Background: What is the difference between concurrency, parallelism and asynchronous methods?