Collaborator collects a variety of raw metrics automatically.

Metrics: Definitions

Lines of Code

The most obvious raw metric is "number of lines of source code". This is "lines" in a text-file context. Often this is abbreviated "LOC".

Collaborator does not distinguish between different kinds of lines. For example, it does not separately track source lines versus comment lines versus whitespace lines.

For code review metrics, often you usually want to use general lines of code and not break it down by type. Often the code comments are just as much a part of the review as the code itself -- check for consistency and ensure that other developers will be able to understand what is happening and why.

The lines of code metrics (LOC metrics) are calculated only for source code and other text-based files. For other types of review materials (Word, Excel, PDF or Image files) the metrics are not calculated and return 0.

The LOC metrics displayed on the Review Screen include added lines , changed lines and removed lines . If the Overlay view is selected (default), the LOC metrics are calculated comparing the latest uploaded revision of file against the baseline revision of that file. Here, the baseline revision stands for the revision at the moment the review was created. If the Separate view is selected, the LOC metrics are calculated comparing each individual file revision against its previous revision.

For reviews created by pull requests, file changes made by merge commits (if any) are only displayed in Separate view and they are not taken into account when calculating overall LOC metrics. Besides, file changes made by merge commits do not affect the overall rework count of a file.

The Customizable Review Reports may provide you with more line-related metrics. Additionally to added, changed and removed lines, they can display total number of all lines of all uploaded files (LOC Uploaded), number of reworked lines (sum of added, changed and removed lines) (LOC Reworked) and the difference between number of added and removed lines (LOC Delta).

Keep in mind that Ignore Whitespace, Ignore Sequence Number and other Diff Viewer settings do not affect how line metrics are calculated. They only affect how line differences are displayed.

Time in Review

How much time (person-hours) did each person spend doing the review? Collaborator computes this automatically. This raw metric is useful in several other contexts, usually when compared to the amount of file content reviewed.

Developers (rightly) hate using stopwatches to track their activity, but how can Collaborator -- a web server -- automatically compute this number properly?

Our technique for accurately computing person-hours came from an empirical study we did at a mid-sized customer site. The goal was to create a heuristic for predicting on-task person-hours from detailed web logs alone.

We gave all review authors and reviewers physical stop-watches and had them carefully time their use of the tool. Start the stopwatch when they began a review, pause if they break for any reason -- email, bathroom, instant messenger. The times were recorded with each review and brought together in a spreadsheet.

At the same time, we collected detailed logs of web server activity. Who accessed which pages, when, and so forth. Log data could easily be correlated with reviews and people so we could line up this amalgamation of server data with the empirical stopwatch times.

Then we sat down to see if we could make a heuristic. We determined two interesting things:

First, a formula did appear. It goes along these lines: If a person hits a web page, then 7 seconds later hits another page, it is clear that the person was on-task on the review for the whole 7 seconds. If a person hits a web page, then 4 hours later hits another page, it is clear that the person was not doing the review for the vast majority of that time. By playing with various threshold values for timings, we created a formula that worked very well -- error on the order of 15%.

Technically, Collaborator server queries for review activity every 15 seconds and updates the time counter if any of these requests were successful during specified time interval (60 seconds, by default). Also if user does not perform any actions after 5 minutes, time tracking is stopped.

Second, it turns out that humans are awful at collecting timing metrics. The stopwatch numbers were all over the map. People constantly forgot to start them and to stop them. Then they would make up numbers that "felt right," but it was clear upon close inspection that their guesses were wrong. Some people intentionally submitted different numbers, thinking this would make them look good (that is, "Look how fast I am at reviewing!").

So the bottom line is: Our automated technique is not only accurate, it is more accurate than actually having reviewers use stopwatches. The intrinsic error of the prediction heuristic is less than the error humans introduce when asked to do this themselves.

Total Person-Time

The total of all recorded time that all the users were looking at review (includes time spent in annotation, planing, inspection and rework phases). Total Person-Time is an aggregate value for all users taking part in a review, while Time in Review is counted for each separate user.

Reviewer Time and Author Time are subsets of Total Person-Time, limited to the time that was spent in the reviewer and author roles, respectively. Reviewer role means any participant with a role that can move review to the next phase or complete it.

Defect Count

How many defects did we find during this review? Because reviewers explicitly create defects during reviews, it is easy for the server to maintain a count of how many defects were found.

Furthermore, the system administrator can establish any number of custom fields for each defect, usually in the form of a drop-down list. This can be used to subdivide defects by severity, type, phase-injected, and so on.

File Count

How many files did we review? Usually the LOC metric is a better measure of "how much did we review," but sometimes having both LOC and number of files is helpful together.

For example, a review of 100 files, each with a one-line change, is quite different from a review of one file with 100 lines changed. In the former case, this might be a relatively simple refactoring; with tool support, this might require only a brief scan by a human. In the latter case, several methods might have been added or rewritten; this would require much more attention from a reviewer.

Wall-Clock Time, Review Wall-Clock Duration

How much time has passed since the review was created and till the review was completed (or now, if the review is still in progress). This is a useful metric if you want to make sure all reviews are completed in a timely manner.

Metrics: Analysis

It is fine to collect metrics, but what do they tell us? It is tempting to apply them in many different contexts, but when are metrics telling us something and when are we reading too much into the numbers?

Defect Density

Defect Density is computed by: ( number of defects ) / ( 1000 lines of code ).

This is the number of defects found, normalized to a unit amount of code. 1000 lines of code, or "kLOC" is often used as a standard base measure. The higher the defect density, the more defects you are uncovering.

It is impossible to give an "expected" value for defect density. Mature, stable code might have defect densities as low as 5 defects/kLOC; new code written by junior developers may have 100-200.

What can defect density tell us?

Let's make an experiment. We take some reviewers and have them inspect many different source files. Source files vary in size from 50 lines to 2000 lines. Reviewers inspect about 200 lines at a time so as not to get tired. We will record the number of defects found for each file.

What would we expect to find? First, longer files ought to have more defects than shorter ones, simply because there is more code. More code means more that could go wrong. Second, some files should contain more defects than others because they are "risky" -- maybe because they are complex, or because their routines are difficult to unit-test, or because their routines are reused by most of the system and therefore must be very accurately specified and implemented.

If we measure defect density here, we handle the first effect by normalizing "number of defects" to the amount of code under review, so now we can sensibly compare small and large files. So the remaining variation in defect density might have a lot to do with the file's "risk" in the system. This is, in fact, the effect we find from experiments in the field.

So defect density can, among other things, determine which files are risky, which in turn might help you plan how much code review, design work, testing, and time to allocate when modifying one of those files.

Now let's make another experiment. We will take a chunk of code with 5 known algorithm bugs and give it to various reviewers. We will see how many of the defects each review can find in 20 minutes. The more defects a reviewer finds, the more effective that reviewer was at finding the defects. This is a simple way to see how effective each reviewer is at reviewing that kind of code.

Of course in real life the nature of the code and the amount of code under review varies greatly, so you cannot just look at the number of defects found in each review -- you naturally expect more defects from a 200-line change than from a 2-line change. Defect density provides this normalization so you can compare reviewers across many reviews.

If you are comparing defect density across many reviews done by a single person, you are measuring the relative "risk" of various files and modules.

Inspection Rate

Inspection Rate is computed by: ( Lines of Code Reviewed ) / ( Total Person-Time ).

This is a measure of how fast we review code. A sensible rate for complex code might be 100 LOC/hour; generally good reviews will be in the range of 200-500 LOC/hour. Anything 800 LOC/hour or higher indicates the reviewer has not really looked at the code -- we have found by experiment that this is too fast to actually read and critique source code.

Some managers insist that their developers try to increase their inspection rate. After all this means review efficiency is improving. This is a fallacy. In fact, the slower the review is, the better job the reviewers are doing. Careful work means taking your time.

Instead, use inspection rate to help you predict the amount of time needed to complete some code change. If you know this is roughly a 1000-line change and your typical inspection rate is 200 LOC/hour, you can budget 5 hours for the code review step in your development.

If anything, a manager might insist on a slower inspection rate, especially on a stable branch, core module, or close to product release when everyone wants to be more careful about what changes in the code.

Inspection Rate (Changed) metric counts only lines of code that were changed (added, removed, or modified).

Inspection Rate (Uploaded) metric counts only lines of code that were uploaded in the review.

Defect Rate

Defect Rate is computed by: ( Number of defects ) / ( Total Person-Time ).

This is the speed at which reviewers uncover defects in code. Typical values range between 5 and 20 defects/hour, possibly less for mature code, but not usually much greater.

The same caveats about encouraging faster or slower inspection rates apply also to defect rates. Read the Inspection Rate section for details.

Metrics Applied

If we have learned one thing about metrics and code review it is: Every group is different, but most groups are self-consistent. This means that metrics and trends that apply to one group do not necessarily apply to another, but within a single group metrics are usually fairly consistent.

This between-group difference can be attributed to the myriad of variables that enter into software development: the background, experience, and domain knowledge of the authors and reviewers, programming languages and libraries, development patterns at different stages of a product's life-cycle, project management techniques, local culture, the number of developers on the team, whether the team members are physically together or separate, and so forth.

Metrics