For the past few years I have been an OpenStack Neutron contributor, core reviewer and now even the project’s PTL. One of my responsibilities in the community is taking care of our CI system for Neutron. As part of this job I have to constantly check how various CI jobs are working and if the reasons for their failure are related to the patch or the CI itself.
This is in general a very time consuming task as Neutron a wide variety of jobs in both check and gate queues in Zuul. As I write this I’m writing this post, there are 22 voting jobs in check queue and 19 jobs in gate queue.
The reason why we have so many jobs is that Neutron is very pluggable, so there
are many different L2 backends (like
linuxbridge and soon
will also have
ovn as in-tree driver which has to be tested as well),
different firewall drivers (e.g. for openvswitch backend there is
openvswitch driver). Next there are also different
L3 agent topologies (
DVR-HA). Because all of
that we ended up with many scenario jobs running tempest or
tests on different topologies of Neutron.
Part of my regular daily tasks is preparation of Neutron CI
meetings and I then always
check our Grafana
to check the overall stability of each job in both check and gate queues.
For a long time I have had a feeling that this does not give us a full picture
of the overall state of the Neutron CI. It was like that because even if
all of the voting jobs had failure rates around (or below) 10%, I saw many
-1 votes from Zuul and many
recheck comments in various patches. That is
pretty obvious because if we have more than 20 voting jobs, even if each of them
fails around 10% of the time for random reasons, some jobs will fail almost
every run of Zuul.
Idea of new metric
So I thought that having a metric which will show just how many times a patch
needed to be
rechecked before it was merged would be useful to get the
status of CI.
The number of
rechecks is in fact something that directly impacts
every developer who is doing patches to the project so it is also useful to
check how annoying it is for contributors to get their patches merged.
And that might also be an important reason why we don’t have many new
such people can be disappointed and discouraged after sending their first patch
will not continue contributing to the project later.
Script to get data from Gerrit
Unfortunately I didn’t find such a metric directly in Graphite. But good news for me was that Gerrit gives us query API which can be used to get e.g. all patches for a certain time period, and then the comments on these patches can be parsed to check how many times it got “Build failed” comment from “Zuul” user.
Based on that and on Assaf Muller’s script I wrote a simple script which is available in My github repo This is simple script which takes from Gerrit data about patches merged in last X days in a given project (e.g. “openstack/neutron”). For each patch it then checks the last revision number and checks how many times on last revision Zuul commented “Build failed”.
This script is checking only last patches because if build failed on this last revision it clearly means that this wasn’t a problem with the patch itself but with the CI job. I also didn’t want to check how many times someone wrote “recheck” comment as sometimes people are doing recheck even if the build was fine, for example to check if their change fixes some intermittent problem in CI.
Also this script calculates average number of failed builds per week.
As I have the script I ran it on patches from the last 500 days. First I of course did it for the Neutron project. To get this data I ran the script with command like:
python3 rechecks.py openstack/neutron --newer-than 500 > neutron_500d.csv
Script normally prints comma separated values on stdout so I saved it in file.
Results are on the image below:
Those results aren’t good. In fact in Neutron we really sucks with our CI as in the best weeks developers needs to recheck about 1 or 2 times before patch is merged.
I was curious how same metric will look like for smaller project with much less CI jobs in check and gate queues in Zuul. I checked same data for neutron-lib project. Results are below:
Here results are much better. In most weeks, patches were merged at first CI run.
My next step was to compare this data with other “big” OpenStack projects. I checked same data for Nova, Glance and than Cinder. Results are on charts below.
It seems that this isn’t only problem of Neutron. Other projects with many CI jobs in the queues are also failing quite often. But indeed Neutron seems to be one of the “bad leaders” in that metric. One of the reasons why it is like that are probably slow nodes on which failed job is running. But there are probably also different reasons of such bad CI state and I think that our next steps should be to think about what are the main reasons of it and to try to find some ways to improve it in near future.
At the end I want also to thanks Nate Johnston for reviewing draft of this post :)