Most Powerful Open Source ERP

Python multi-core benchmark in uvloop context

The existence of cython combined with a clean concurrency model based on technologies such as gevent and pygolang could change the situation if both can be tightly integrated into cython's static cdef code rather than scattered as it is today.
  • Last Update:2019-03-24
  • Version:002
  • Language:en

The text below is based on my 2016 notes on Python, uvloop and Go after a colleague came back from Europython conference. It shows that on python side, even though uvloop lightly wraps libuv that is known to be fast, the performance drops significantly right after pure python code starts to be added into request handler, and the end result is that python server becomes more than an order of magnitude slower compared to go version.

However in 2018 we still hope that python world could be improved for implementing modern networked servers via bringing in combination of Cython and coroutine/stackless-based approaches such as gevent and pygolang. All the pieces are there, but scattered. We just need to have the focal point where they all could be tightly integrated together.

Back from Europython 2016

It is good to hear that Python world is trying to catch. As I was using Python for many years and also was playing with Go in recent times and know both, on my side I'd like to provide some feedback. But before we begin, let me show you something:

https://lab.nexedi.com/kirr/misc/raw/e2306922/servers/t/pyuv_vs_go.png

Above are benchmarks for HTTP servers taken from uvloop blogpost for which HTTP handlers were modified to do some simple text processing instead of doing only 100% I/O. The graph shows that once we start adding python code - even very simple one - to server when handling requests, the performance is killed in Python case.

Now here is the explanation:

Modern web-servers are built around so-called reactors. These are specialized event-loops which communicate with kernel in efficient ways (solving c10k problem) using things like epoll() on Linux, kqueue() on FreeBSD etc. Reactors are basically loops when you subscribe for events on file-descriptors, and receive notifications via getting corresponding callbacks. Then every CPU is tried to be loaded which leads to something like M:N schemes where M is threads spawned for every CPU and N is many connections which corresponding thread handle via callback-events.

libuv is a C library which wraps OS-specific mechanisms for organizing reactors in uniform way. uvloop wraps libuv via Cython and expose its service to Python. Go has builtin network scheduler in its runtime which is similar to what libuv is doing.

Now original MagicStack benchmarks (python, go) actually only receive request and send back the payload. Since those two paths are all I/O and are handled 99% by only reactors, what those benchmarks really measure is how well underlying reactor libraries perform. I'm sure libuv should be a good library as well as Go runtime is well done and tuned, and MagicStack benchmarks confirm that.

However any real server has some control logic and things to do in its HTTP handlers and that is not there in MagicStack benchmarks. And once we are starting to add code for handling requests it becomes not only I/O even for cases when people tend to think performance should be I/O bound.

In Python case executing pure-python code is known to be slow. The fact that in original MagicStack benchmarks performance dropped on the floor while they were using HTTP parser written in Python justifies it -- performance recovered only when they actually switched to using C library to parse HTTP requests. For python the trend is: whenever we need performance we need to move that code to C and wrap it. But experience also shows that not all code can be so well localized and slowness often remains scattered throughout whole python codebase. (I'm not considering here cases when we move everything to C and wrap only something like top-level main() because then there is nothing left on Python side)

So let's simulate, at least in part, of being real web-server and doing some work in HTTP handler. For an example workload I choose to analyze characters from request path and see how close they are to name of our several websites. Here is the workload for Python case (full source):

    def handle(self, request, response):
        parsed_url = httptools.parse_url(self._current_url)

        xc = XClassifier()
        for char in parsed_url.path:
            xc.nextChar(char)

        resp = b'%s:\t%d\n%s:\t%d\n%s:\t%d\n%s:\t%d\ntotal:\t%d\n' % (
                navytux, xc.nnavytux, nexedi, xc.nnexedi,
                lab, xc.nlab, erp5, xc.nerp5,
                xc.ntotal)

        response.write(resp)
        if not self._current_parser.should_keep_alive():
            self._transport.close()
        self._current_parser = None
        self._current_request = None


navytux = b'navytux.spb.ru'
nexedi  = b'www.nexedi.com'
lab     = b'lab.nexedi.com'
erp5    = b'www.erp5.com'

# whether character ch is close to string s.
# character is close to a string if it is close to any of characters in it
# character is close to a character if their distance <= 1
def isclose(ch, s):
    for ch2 in s:
        if abs(ch - ch2) <= 1:
            return True
    return False

class XClassifier:

    def __init__(self):
        self.nnavytux   = 0
        self.nnexedi    = 0
        self.nlab       = 0
        self.nerp5      = 0
        self.ntotal     = 0

    def nextChar(self, ch):
        if isclose(ch, navytux):
            self.nnavytux += 1
        if isclose(ch, nexedi):
            self.nnexedi  += 1
        if isclose(ch, lab):
            self.nlab     += 1
        if isclose(ch, erp5):
            self.nerp5    += 1

        self.ntotal += 1

and the same for Go (full source):

func handler(w http.ResponseWriter, r *http.Request) {
    xc := NewXClassifier()
    path := r.URL.Path
    for i := range path {
        xc.nextChar(path[i])
    }

    fmt.Fprintf(w, "%s:\t%d\n%s:\t%d\n%s:\t%d\n%s:\t%d\ntotal:\t%d\n",
                navytux, xc.nnavytux, nexedi, xc.nnexedi,
                lab, xc.nlab, erp5, xc.nerp5,
                xc.ntotal)
}

const (
    navytux = "navytux.spb.ru"
    nexedi  = "www.nexedi.com"
    lab     = "lab.nexedi.com"
    erp5    = "www.erp5.com"
)

func abs8(v int8) int8 {
    if v >= 0 {
        return v
    }
    return -v
}

// whether character ch is close to string s.
// character is close to a string if it is close to any of characters in it
// character is close to a character if their distance <= 1
func isclose(ch byte, s string) bool {
    for i := 0; i < len(s) ; i++ {
        ch2 := s[i]
        if abs8(int8(ch - ch2)) <= 1 {
            return true
        }
    }
    return false
}


type XClassifier struct {
    nnavytux int
    nnexedi  int
    nlab     int
    nerp5    int
    ntotal   int
}

func NewXClassifier() *XClassifier {
    return &XClassifier{}
}

func (xc *XClassifier) nextChar(ch byte) {
    if isclose(ch, navytux) {
        xc.nnavytux += 1
    }
    if isclose(ch, nexedi) {
        xc.nnexedi  += 1
    }
    if isclose(ch, lab) {
        xc.nlab     += 1
    }
    if isclose(ch, erp5) {
        xc.nerp5    += 1
    }

    xc.ntotal += 1
}

For every request we create classifier object, and then for characters in request path do some several method/function calls and lookups in website-name strings. In the end classification statistic is returned to client:

$ curl -v http://127.0.0.1:25000/helloworld
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 25000 (#0)
> GET /helloworld HTTP/1.1
> Host: 127.0.0.1:25000
> User-Agent: curl/7.50.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: text/plain
< Content-Length: 83
<
navytux.spb.ru: 5
www.nexedi.com: 10
lab.nexedi.com: 10
www.erp5.com:   10
total:  11
* Connection #0 to host 127.0.0.1 left intact

This is not a big workload - rather small one - and doubtfully it is useful, but it shows what start to happen performance-wise when there is some non-trivial work in the handler. About performance: benchmarking was done via e.g.:

wrk -t 1 -c 40 -d 10 http://127.0.0.1:25000/helloworld

on the same machine (my old Core2-Duo notebook) with output like:

Running 10s test @ http://127.0.0.1:25000/helloworld
  1 threads and 40 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.20ms    2.16ms  31.72ms   90.58%
    Req/Sec    21.72k     1.24k   27.43k    89.00%
  216080 requests in 10.00s, 41.21MB read
Requests/sec:  21601.66
Transfer/sec:      4.12MB

For N(input characters) > 11 "helloworld" was simply repeated on request path needed number of times. For those interested actual benchmarks run and numbers are here.

At this point I'd like to ask you to look once again at the picture in the beginning of the post: when there is only 1 input character (only "/" in path) Python and Go perform close to each other, though python is already slower. For 11 input characters ("/helloworld" in path) the difference is ~ 2.8x. For 101 input characters the difference is ~14x -- python becomes more than an order of magnitude slower.

So to me this once again shows that even with having libuv integration, for any real use-case, as long as there is python code in the request handler performance will be much slower compared to Go.

On the other hand, the Go case shows that it performs rather well, usually without wrapping anything as most of the things can be and are actually written in Go itself. For example 90% of Go runtime and in particular libuv analog - Go's network scheduler - is implemented in Go itself which shows the language can be used to get good close to native speed while working on high-level-enough-language similar to python.

I'd like to also add that MagicStacks benchmarks are not reasonable as they set GOMAXPROCS=1. In simple words this means that while Go support for multicore machines is very good, it is artificially limited to be using only 1 CPU on the system. In other words Go is artificially constrained to behave with something like GIL in Python world. Without above GOMAXPROCS=1 setting, Go by default uses all available CPUs and for cases when there is not much contention between handlers performance usually scales close to be linearly. I'd like to remind that we are already using 8-CPU machines at Vifib and Python practically cannot use more than 1 CPU in single process because of GIL.

Some thoughts on concurrency

I want to also add some words about Python approach to concurrency and building parallel servers. To me with asyncio / async/await they are just creating different world for no reason - as every part of software has to be adapted to async & co stuff, and there becomes, yes a bit mitigated, callback spaghetti.

On the other hand in Go it is still the same serial world and we add channels via which you can connect goroutines. Each goroutine runs serially but can send data via channels just like via pipe. To me it is significantly more better and uniform approach even in terms of human thinking so in this case Go adds not only performance but more productivity. And I can tell this with confidence, because I was in reactors/asynchronous programming for a while even implementing my own reactors sometimes.

The thing is: computer is already a state machine (it runs each assembler instruction via leveraging state machine inside CPU) then we have state machine in OS to run several programs / threads. But it is harder for humans to implement state machines compared to serial programming and communication. I mean it is harder for humans to understand and harder to implement state machines. Thus what makes sense is to implement state machine at lowest-possible level and then give programmers the feeling of that they have a lot of serial processes and adequate communication primitives.

The reactor is itself a state machine. Go does it at runtime and hides providing to user serial goroutines and channels while other parties just throw the complexity of "asynchronousity" to developers. For me as a developer, what would be a better approach for python is to actually integrate libuv deeply in its runtime and give developers green (= very cheap) threads and communication primitives. The sad thing about it is that stackless (2) was doing things in this direction for years (it started around beginning of 2000's - the same time when first GIL removal patches started to appear), but despite this approach was with us for a long time there is seemingly almost zero chances for it be merged into CPython and people reinvent "cute" things (async/await asyncio) and throw twisted complexity of programming to developers.

Future directions

In 2016, I would have said that low performance and lack of adequate approach for concurrent programming sadly makes Python to be a not so appropriate language to implement loaded webservers today.

However, the existence of cython combined with a clean concurrency model based on technologies such as gevent and pygolang could change the situation if both can be tightly integrated into cython's static cdef code rather than being all scattered as it is today.