launchpad-reviewers team mailing list archive

Thread
Date
Re: [Merge] lp:~jameinel/launchpad/lp-service into lp:launchpad

To: John A Meinel <john@xxxxxxxxxxxxxxxxx>
From: Michael Hudson-Doyle <michael.hudson@xxxxxxxxxx>
Date: Thu, 07 Oct 2010 04:46:41 -0000
In-reply-to: <20101006203859.27166.57251.codereview@soybean.canonical.com>
Reply-to: mp+37531@xxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx
On Wed, 06 Oct 2010 20:39:01 -0000, John A Meinel <john@xxxxxxxxxxxxxxxxx> wrote:
> For starters, thanks for the thorough review. It is really nice to get feedback on it, and you have certainly thought a lot about the code.
> 
> On 10/6/2010 12:02 AM, Michael Hudson-Doyle wrote:
> > Review: Needs Information
> > Wow, what a branch.  I've been quite picky in this, that's not because
> > I hate you <wink> but because it's subtle code and has potential to
> > cause issues, so I wanted to make sure all the comments make sense and
> > there's no redundant code, etc.
> > 
> > I worry slightly about what would happen if the forking service fell
> > over -- but clearly, we rely on the sftp server itself staying up, so
> > this probably isn't such a big deal.  I guess we also need to make
> > sure the forking service is running in production.
> 
> 
> We could trap failures to connect to the forking server and fall back
> to regular spawnProcess. The code is already a simple if/else
> block. However, I wouldn't really want silent degradation of
> connection.

Yeah, I don't really like that idea either.

> >> +        "fork <command>\n": Request a new subprocess to be started.
> >> +            <command> is the bzr command to be run, such as "rocks" or
> >> +            "lp-serve --inet 12".
> >> +            The immediate response will be the path-on-disk to a directory full
> >> +            of named pipes (fifos) that will be the stdout/stderr/stdin of the
> >> +            new process.
> > 
> > This doesn't quite make it clear what the names of the files will be.
> > The obvious guess is correct, but I'd rather be certain.
> > 
> >> +            If a client holds the socket open, when the child process exits,
> >> +            the exit status (as given by 'wait()') will be written to the
> >> +            socket.
> > 
> > This appears to be out of date -- there's also a fork-env command now.
> > Is the fork command still useful?
> 
> It has been useful in testing, it isn't used in the 'production' code.

If the fork command is still implemented, I think it should still be
documented...

> ...
> 
> 
> > 
> >> +    DEFAULT_PATH = '/var/run/launchpad_forking_service.sock'
> > 
> > I'm not sure of the value of providing a default here, as it seems
> > that its always in practice going to be overridden by the config.  If
> > it makes testing easier, it's worth it I guess.
> 
> Well, we override it for testing as well. I wanted to support "bzr
> lp-forking-service" as something you could run. Also, that seemed to
> be the 'best' place to put it, which gives hints as to where you would
> want to put it.

OK, let's leave it.

> >> +        self._children_spawned = 0
> >> +
> >> +    def _create_master_socket(self):
> >> +        self._server_socket = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
> >> +        self._server_socket.bind(self.master_socket_path)
> >> +        if self._perms is not None:
> >> +            os.chmod(self.master_socket_path, self._perms)
> > 
> > The pedant in me thinks this is a bit racy.  But I don't think it's
> > important (in fact, I wonder if the permission stuff is necessary
> > really).
> 
> I'm sure it isn't necessary in the "what do we need in Launchpad"
> sense. It seemed reasonable from a "what do you want from a service
> that allows arbitrary commands to be run from data written to its
> socket".

In that case shouldn't we do it more safely?  Not in this branch, I
guess...

> > 
> >> +        self._server_socket.setsockopt(socket.SOL_SOCKET,
> >> +            socket.SO_REUSEADDR, 1)
> > 
> > Does this work with unix domain sockets?  I didn't think it did, and
> > your comments in _cleanup_master_socket lend weight to my position :-)
> > 
> > Another option would be to use a socket in the 'abstract namespace',
> > see man 7 unix, although that might be a bit magical.
> 
> I think it is carry over from
>  a) When this was a port (AF_INET)
>  b) trying to figure out why I couldn't start a new service. It does 
>     seem that you have to delete the socket on disk, or you can't bind 
>     to that address.
> 
> So we can just get rid of it.

Awesome.  I like deleting code.

> >> +        self._sockname = self._server_socket.getsockname()
> >> +        # self.host = self._sockname[0]
> >> +        self.port = self._sockname[1]
> > 
> > Is there any point to these three lines?
> 
> It isn't used, and I got rid of them (as above). It certainly was used
> when it was an actual random port during testing.

Right, that's what I guessed.

> >> +        stdin_path = os.path.join(base_path, 'stdin')
> >> +        stdout_path = os.path.join(base_path, 'stdout')
> >> +        stderr_path = os.path.join(base_path, 'stderr')
> >> +        # Opening for writing blocks (or fails), so do those last
> > 
> > man 3 mkfifo says that opening for both reading and writing blocks.  I
> > guess this means it's very important that the client and child process
> > open the fifos in the same order!
> 
> Well, you can open(O_NBLOCK) for reading. if you do that for writing,
> it raises an exception if something isn't on the other end. However,
> it ends up having the file handle in non-blocking mode for the rest of
> the time, so future bzr+ssh reads fail with "EAGAIN" (would have
> blocked). I can clarify.

Thanks.

> >> +        # We force os._exit() here, because we don't want to unwind the stack,
> >> +        # which has complex results. (We can get it to unwind back to the
> >> +        # cmd_launchpad_forking_service code, and even back to main() reporting
> >> +        # thereturn code, but after that, suddenly the return code changes from
> >> +        # a '0' to a '1', with no logging of info.
> >> +        # TODO: Should we call sys.exitfunc() here? it allows atexit functions
> >> +        #       to fire, however, some of those may be still around from the
> >> +        #       parent process, which we don't really want.
> > 
> > Does bzr install atexit functions?
> 
> I know the test suite does. I have very little clue as to what the
> launchpad oops code does. And the python logging module uses atexit
> functionality as well.

I guess it's safest to leave it like that then.

> >> +    def fork_one_request(self, conn, client_addr, command_argv, env):
> >> +        """Fork myself and serve a request."""
> >> +        temp_name = tempfile.mkdtemp(prefix='lp-forking-service-child-')
> >> +        # Now that we've set everything up, send the response to the client we
> >> +        # create them first, so the client can start trying to connect to them,
> >> +        # while we fork and have the child do the same.
> >> +        self._children_spawned += 1
> >> +        pid = self._fork_function()
> >> +        if pid == 0:
> >> +            pid = os.getpid()
> >> +            trace.mutter('%d spawned' % (pid,))
> >> +            self._server_socket.close()
> >> +            for env_var, value in env.iteritems():
> >> +                osutils.set_or_unset_env(env_var, value)
> > 
> > It's bascially by chance, but I wanted to mention that this provides a
> > place to do something I've wanted to do for a long time: have bzr
> > lp-serve processes have BZR_LOG set to something based on their pid.
> > Not in this branch though :-)
> 
> It is slightly more involved than that, since you have to
> "trace._push_log_filename()", etc and not just set BZR_LOG.

Ah right.  Still, it's a clear place to do this now.

> Also, in discussing with Martin, it is probably better to just have
> mutter() include the os.getpid() information. Consider Apache logs,
> where everything is put into a single file, but you can still easily
> filter it back out again into individual requests.

Well, I'm not sure Apache logs are that good an analogy (http requests
tend to be shorter than bzr+ssh connections!) but I see the point.
Certainly, the ~codehost/.bzr.log files we currently have aren't much
use.

> >> +            # See [Decision #3]
> >> +            self._create_child_file_descriptors(temp_name)
> >> +            conn.sendall('ok\n%d\n%s\n' % (pid, temp_name))
> >> +            conn.close()
> >> +            self.become_child(command_argv, temp_name)
> >> +            trace.warning('become_child returned!!!')
> >> +            sys.exit(1)
> > 
> > Would os._exit() not be more appropriate here?
> 
> If we get here, something bad has happened (we didn't hit the earlier os._exit() state). I don't have any strong hints, but sys.exit() means we'll can get a traceback.

Fair enough.

> >> +            except self._socket_error, e:
> >> +                if e.args[0] == errno.EINTR:
> >> +                    pass # run shutdown and children checks
> >> +                elif e.args[0] != errno.EBADF:
> >> +                    # We can get EBADF here while we are shutting down
> >> +                    # So we just ignore it for now
> >> +                    pass
> >> +                else:
> >> +                    # Log any other failure mode
> >> +                    trace.warning("listening socket error: %s", e)
> >> +            else:
> >> +                self.log(client_addr, 'connected')
> >> +                # TODO: We should probably trap exceptions coming out of this
> >> +                #       and log them, so that we don't kill the service because
> >> +                #       of an unhandled error
> > 
> > Er.  Yes.
> 
> I'm not 100% sure that having a service just start logging "unhandled
> exception" is going to get attention. If you are sure that there will
> be a way for this information to become visible, then I'm more willing
> to suppress it at this point.
> 
> hmm... maybe having "hello" report back if there have been unhandled
> errors?

Hmm.  I guess it could generate an OOPS?

I'm just very leery of any code path that could kill the forking
service, because that's "get a losa out of bed" territory if it happens
on the weekend.

> >> +    def _wait_for_children(self, secs):
> >> +        start = time.time()
> >> +        end = start + secs
> >> +        while self._child_processes:
> >> +            self._poll_children()
> >> +            if secs > 0 and time.time() > end:
> >> +                break
> >> +            time.sleep(self.SLEEP_FOR_CHILDREN_TIMEOUT)
> > 
> > I find the practice of calling this function with an argument of
> > self.SLEEP_FOR_CHILDREN_TIMEOUT a little strange.  Isn't that just the
> > same as calling self._poll_children()?
> > 
> 
> The important calls use self.WAIT_FOR_CHILDREN_TIMEOUT. I can see your
> point, though. Though actually it doesn't. If you read the logic, the
> "break" comes *after* _poll_children. So the code does:
> 
> _poll_children()
> sleep(SLEEP_FOR_CHILDREN_TIMEOUT)
> _poll_children()
> break

Oh right.  Is that an important distinction?

> >> +    def _shutdown_children(self):
> >> +        self._wait_for_children(self.WAIT_FOR_CHILDREN_TIMEOUT)
> >> +        if self._child_processes:
> >> +            trace.warning('Failed to stop children: %s'
> >> +                % ', '.join(map(str, self._child_processes)))
> > 
> > This comment seems a bit wrong -- there's not actually been an attempt
> > to close the children down yet has there?  All we did was wait, and
> > they haven't exited.
> > 
> > A nice feature of the general architecture is that it would be quite
> > easy to disconnect listening for more connections from exiting the
> > daemon, something that will help one day with reducing the downtime
> > associated with an update.  But that can be sorted out later.
> 
> Changed it to "Children still running: %s".
> 
> I agree on both points. It would also potentially be useful to change
> the other setup code to check more frequently for whether
> "use_forking_daemon = True" is set. Which would mean you could disable
> it without restarting the daemon.
> 
> Also, if you used SIGKILL on the daemon, it won't try to kill its
> children (I think, unless SIGKILL gets inherited?).

Um, I guess this gets a bit complicated, but generally sending a signal
to a process doesn't have any immediate impact on its children.

SIGKILL will also leave the socket on disk I think, which would be
worse.

We should probably think about what will happen if the socket exists
when the process starts, actually: if the machine crashes, say, it's
likely to be there, but stale.

> But yes a "stop without killing children" would be useful. Or a "stop
> accepting new connections", bring up the new daemon which will accept
> new connections, and the other slowly exits.

Yep.  Though to be useful, we'd need to add such a thing to the sftp
daemon too.

> Actually, I think it would do that now if you SIGINT/SIGTERM this
> process, and start another one right away. With the main caveat that
> this process needs to delete the socket before the children are gone,
> so the new one can create it.

I guess you could just set the WAIT_FOR_CHILDREN_TIMEOUT to like a day,
yeah.

> >> +    def _parse_fork_request(self, conn, client_addr, request):
> >> +        if request.startswith('fork-env '):
> >> +            while not request.endswith('end\n'):
> >> +                request += osutils.read_bytes_from_socket(conn)
> > 
> > Hang on.  You don't have any guarantee here that
> > read_bytes_from_socket won't return 'end\n' and then the start of the
> > next request do you?  Hmm, I guess you do really, because 'end\n' is
> > always the last thing the client sends.  But that does seem a little
> > delicate -- it should at least be documented.
> 
> Every connection gets a single request, and doesn't have much more to
> say. If we want to say more, then we can, but that will just be an
> evolution of the interface.
> 
> I changed the docstring from:
> 
> Clients connect to the socket and make a simple request, which then ...
> 
> to be
> 
> Clients connect to the socket and make a single request, which then ...

Yeah, this makes sense.  It's just something we'll need to be aware of
if we change this detail.

> >> +                request = request.replace('\r\n', '\n')
> >> +            command, env = request[9:].split('\n', 1)
> >> +        else:
> >> +            command = request[5:].strip()
> >> +            env = 'end\n' # No env set
> >> +        try:
> >> +            command_argv = self.command_to_argv(command)
> >> +            env = self.parse_env(env)
> >> +        except Exception, e:
> >> +            # TODO: Log the traceback?
> >> +            self.log(client_addr, 'command or env parsing failed: %r'
> >> +                                  % (str(e),))
> >> +            conn.sendall('FAILURE\ncommand or env parsing failed: %r'
> >> +                         % (str(e),))
> >> +        else:
> >> +            return command_argv, env
> >> +        return None, None
> > 
> > I think it would be clearer if the "return None, None" was in the
> > except clause (which *is* equivalent, I think?).
> 
> So it is equivalent, but because falling off the end of a function == "return None" I prefer to be explicit that all return paths return a tuple. I can put a redundant "return None, None" if you think that would be more obvious.

OK.  I don't really care :-)

> >> +    def serve_one_connection(self, conn, client_addr):
> >> +        request = ''
> >> +        while '\n' not in request:
> >> +            request += osutils.read_bytes_from_socket(conn)
> >> +        # telnet likes to use '\r\n' rather than '\n', and it is nice to have
> >> +        # an easy way to debug.
> >> +        request = request.replace('\r\n', '\n')
> > 
> > This must be a relic of listening to an INET port?  Can it be deleted now?
> 
> It can, though being generous in what you accept does make it easier
> to play with the system. You seem to feel strongly, so I've removed it
> and the associated tests.

I think conditionality where it's not needed makes systems hard to
understand.  This particular thing wasn't really worth the fuss I made
over it, perhaps :-)

> >>
> >> +class _WaitForExit(process.ProcessReader):
> > 
> > I don't really understand why this inherits from ProcessReader.  But
> > I'm 1600 lines into this review now, so maybe the problem is at my end
> > :-)
> 
> Because ProcessReader already calls back to its accompanied "Process" object to let it know that data has been read, the handle has been closed, etc.
> 
> It also allows me to implement "dataReceived" instead of doRead(), IIRC.
> 
> 
> > 
> >> +    """Wait on a socket for the exit status."""
> >> +
> >> +    def __init__(self, reactor, proc, sock):
> >> +        super(_WaitForExit, self).__init__(reactor, proc, 'exit',
> >> +                                           sock.fileno())
> >> +        self._sock = sock
> >> +        self.connected = 1
> >> +
> >> +    def close(self):
> >> +        self._sock.close()
> >> +
> >> +    def dataReceived(self, data):
> >> +        # TODO: how do we handle getting only *some* of the content?, Maybe we
> >> +        #       need to read more bytes first...
> > 
> > Yeah.  You want lineReceiver's logic really.
> 
> Which is... where?

In twisted.protocols.basic.LineOnlyReceiver.dataReceived, I think.
Sadly, I don't really know an easy way to reuse that.  Maybe multiple
inheritance would work, or maybe have an instance of (a subclass of)
LineOnlyReceiver on self and feeding bytes from your dataReceived to
it.  Both seem a bit over-engineered.

> >> +        # This is the only thing we do differently from the standard
> >> +        # ProcessReader. When we get data on this socket, we need to treat it
> >> +        # as a return code, or a failure.
> >> +        if not data.startswith('exited'):
> >> +            # Bad data, we want to signal that we are closing the connection
> >> +            # TODO: How?
> >> +            # self.proc.?
> > 
> > childConnectionLost ?
> 
> I'll call it, I really don't know how to test it, or how this sort of thing would happen in reality, etc.

Yeah.  I don't know either.

> >> +    def _sendMessageToService(self, message):
> >> +        """Send a message to the Forking service and get the response"""
> >> +        path = config.codehosting.forking_daemon_socket
> >> +        client_sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
> >> +        log.msg('Connecting to Forking Service @ socket: %s for %r'
> >> +                % (path, message))
> >> +        try:
> >> +            client_sock.connect(path)
> >> +            client_sock.sendall(message)
> >> +            # We define the requests to be no bigger than 1kB. (For now)
> >> +            response = client_sock.recv(1024)
> >> +        except socket.error, e:
> >> +            # TODO: What exceptions should be raised?
> >> +            #       Raising the raw exception seems to kill the twisted reactor
> >> +            #       Note that if the connection is refused, we *could* just
> >> +            #       fall back on a regular 'spawnProcess' call.
> > 
> > What does the client see if the forking service is not running?
> 
> Unable to "open channel 0" or some other such ssh failure. ("exec request on channel 0" ?)

Hmm.  Is it possible to output some messages on the stderr of the ssh
channel from here?  I can see that this might be difficult.

> >> +    # Implemented because of childConnectionLost
> >> +    # Adapted from twisted.internet.Process
> >> +    # Note: Process.maybeCallProcessEnded() tries to reapProcess() at this
> >> +    #       point, but the daemon will be doing the reaping for us (we can't
> >> +    #       because the process isn't a direct child.)
> >> +    def maybeCallProcessEnded(self):
> >> +        if self.pipes:
> >> +            # Not done if we still have open pipes
> >> +            return
> >> +        if not self.lostProcess:
> >> +            return
> >> +        process.BaseProcess.maybeCallProcessEnded(self)
> >> +
> >> +    # pauseProducing, present in process.py, not a IProcessTransport interface
> > 
> > It's a shame there's so much copy & paste here.  Also, the use of
> > blocking code in a Twisted application always makes me uneasy, but I
> > don't know how serious a problem it is here.
> 
> I could do a little bit less by inheriting from Process and not
> calling Process.__init__ (since that auto-registers with reapProcess
> which doesn't work here.)
> 
> What blocking code concerns you? At some point, you *do* need to write
> to the socket. And these paths are generally called once socket.poll()
> tells us it is ok.
> 
> I followed pretty closely to what Process et al are doing. I'm
> certainly open to suggestions, as the Twisted code is still a bit
> beyond my understanding.

I think I need to think a little bit more about the Twisted code I'm
afraid (this says more about me than your code!).

Aside from that, it all looks superb now.

Sorry for not finishing the review still, I've just run out of time :(

Cheers,
mwh
-- 
https://code.launchpad.net/~jameinel/launchpad/lp-service/+merge/37531
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~jameinel/launchpad/lp-service into lp:launchpad.
References

Re: [Merge] lp:~jameinel/launchpad/lp-service into lp:launchpad
From: John A Meinel, 2010-10-06