Friday, October 16, 2009

jwhois-4.0 and 100% CPU usage on whois lookups

A few months ago I started to notice that every now and then a whois lookup would eat one of my CPUs. Running whois under strace quickly revealed something interesting:



$ strace whois $FOO
...
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(3, {sa_family=AF_INET, sin_port=htons(43), sin_addr=inet_addr("149.17.192.7")}, 16) = -1 EINPROGRESS (Operation now in progress)
select(1024, NULL, [3], NULL, {75, 0})  = 1 (out [3], left {74, 974605})
getsockopt(3, SOL_SOCKET, SO_ERROR, [-1482919303840792576], [4]) = 0
...
read(3, 0x7fffeb6b97b0, 1023)           = -1 EAGAIN (Resource temporarily unavailable)
... (repeat the above a lot, varies from run to run) ...
read(3, "NOT FOUND\n"..., 1023)         = 10
read(3, ""..., 1023)                    = 0
write(1, "[whois.publicinterestregistry.net"..., 45[whois.publicinterestregistry.net]
NOT FOUND
) = 45
exit_group(0)                           = ?

And then it would exit.  So, for some reason jwhois switched from blocking sockets in v3 to non-blocking in v4.  Why?  I have no idea.  It is, after all, only going to be dealing with one IO operation at a time (non-blocking fds are usually used when you need to poll multiple "things" at the same time).

Let's find out where this read() is via gdb:
$ gdb jwhois
...
(gdb) break getsockopt
Breakpoint 1 at 0x401c30
(gdb) run $FOO
Starting program: /usr/bin/jwhois $FOO
[Querying whois.publicinterestregistry.net]

Breakpoint 1, 0x0000003e2f0df1a0 in getsockopt () from /lib64/libc.so.6
Current language:  auto; currently minimal
(gdb) clear getsockopt
Deleted breakpoint 1
(gdb) break read
Breakpoint 2 at 0x3e2f0d1470
(gdb) c
Continuing.

Breakpoint 2, 0x0000003e2f0d1470 in read () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003e2f0d1470 in read () from /lib64/libc.so.6
#1  0x000000000040c1b0 in read (__nbytes=, __buf=, __fd=)
    at /usr/include/bits/unistd.h:45
#2  whois_read (__nbytes=, __buf=, __fd=) at whois.c:127
#3  0x000000000040c316 in whois_query (wq=0x7fffffffe0d0, text=0x7fffffffe078) at whois.c:84
#4  0x000000000040bde1 in jwhois_query (wq=0x7fffffffe0d0, text=0x7fffffffe0f8) at jwhois.c:298
#5  0x000000000040c08e in main (argc=2, argv=0x7fffffffe218) at jwhois.c:181
(gdb)

Above, I decided to break on getsockopt() - it was the closest (unique) system call to the read() I wanted to examine.  And now we can see that whois.c, line 127 is our culprit.  I decided to stick a select() right above the offending read() in order to sleep until the socket was ready to be read, instead of busy-wait looping.  Here's the patch:

diff -ur jwhois-4.0.old/src/whois.c jwhois-4.0/src/whois.c
--- jwhois-4.0.old/src/whois.c  2007-06-26 03:00:29.000000000 -0400
+++ jwhois-4.0/src/whois.c      2009-04-29 11:42:56.000000000 -0400
@@ -115,6 +115,7 @@
   unsigned int count, start_count;
   int ret;
   char data[MAXBUFSIZE];
+  fd_set rfds;

   count = 0;

@@ -124,7 +125,15 @@

   do
     {
+      FD_ZERO(&rfds);
+      FD_SET(fd, &rfds);
+      ret = select(fd + 1, &rfds, NULL, NULL, NULL);
+
+      if (ret <= 0)
+        return -1;
+
       ret = read(fd, data, MAXBUFSIZE-1);
+
       if (ret >= 0)
        {
          count += ret;

No comments:

Post a Comment