Extreme Perl - Chapter 14: Refactoring

Extreme Perl: Chapter 14: Refactoring An Evolving Book
about Extreme Programming
with Perl

In refactoring, there is a premium on knowing when to quit.

-- Kent Beck

Refactoring is an iterative design strategy. Each refactoring improves an implementation in a small way without altering its observable behavior. You refactor to make the implementation easier to repair and to extend. The refactorings themselves add no business value.

Programming in XP is a four part process: listening, testing, coding, and designing. Design comes after you have a test for what the customer wants, and you have a simple implementation already coded. Of course, your simple implementation is based on your definition of simple. Your first cut may even be optimally coded in which the design is complete for the task you are working on. Your next task may require you to copy some similar code from another module, because there's no API to get at it. Copying code is usually the simplest way to see if that code actually works, that is, passes your tests. Copying code is not good design. After you get the copy working, you need to refactor the original module and your new code to use a common API. And, that's emergent design.

The designing emerges from the problem, completely naturally, like you organize a kitchen. The figure out where to put things depending on how frequently they are used. If you only use that fruitcake pan once a year, you probably tuck it away in the back of a hard to reach cabinet. Similarly, if a routine is used only in one place, you keep it private within a module. The first time it is used elsewhere, you may copy it. If you find another use for it, you refactor all three uses so that they call a single copy of the routine. In XP, we call this the Rule of Three, [1] and basically it says you only know some code is reusable if you copy it two times. Refactoring is the process by which you make the code reusable.

This chapter demonstrates several refactorings in the context of a single example, Mail::POP3Client. We discuss how and when to refactor and why refactoring is a good design strategy. We show how refactoring makes it easier to fix a couple of defects found by the unit test in Unit Testing.

Design on Demand

XP has no explicit design phase. We start coding right away. The design evolves through expansions (adding or exposing function) and contractions (eliminating redundant or unused code).

Refactoring occurs in both phases. During contraction, we clean up excesses of the past, usually caused by copy-and-paste. During expansion, we modularize function to make it accessible for the task at hand. Expansion solves today's problems. Contraction helps mitigate future problems.

`Mail::POP3Client`

The examples in this chapter use the foundation we laid in Unit Testing. Mail::POP3Client[2] is a CPAN class that implements the client side of the POP3 mail retrieval protocol. Since refactoring does not change observable behavior, you don't need to understand how POP3 works to follow the examples in this chapter.

I chose Mail::POP3Client, because it is well-written, mature, and solves a real problem. Some of the code is complex. The meat of the example is highlighted in the changed version, so skip ahead if code complexity interferes with comprehension. I have included entire subroutines for completeness, but the differences are just a few lines of code.

Remove Unused Code

We'll warm up with a very simple refactoring. Here's the Host accessor from Mail::POP3Client:

sub Host {
    my $me = shift;
    my $host = shift or return $me->{HOST};
#   $me->{INTERNET_ADDR} = inet_aton( $host ) or
#   $me->Message( "Could not inet_aton: $host, $!") and return;
    $me->{HOST} = $host;
}

This first refactoring removes the extraneous comment. Unused code and comments distract the reader from the salient bits. Should we need to resurrect the dead code, it remains in a file's history available from the source repository. For now, we don't need it, and the refactored version is shorter and more readable:

sub Host {
    my $me = shift;
    my $host = shift or return $me->{HOST};
    $me->{HOST} = $host;
}

After each refactoring, even as simple as this one, we run the unit test. In this case, the test will tell us if we deleted too much. Before we check in our changes, we run the entire unit test suite to make sure we didn't break anything.

How often to check in is a matter of preference. The more often you check in, the easier it is to back out an individual change. The cost of checking in is a function of the size of your test suite. Some people like to check in after every refactoring. Others will wait till the task is complete. I wouldn't check in now unless this change was unrelated to the work I'm doing. For example, I might have deleted the dead code while trying to figure out how to use Mail::POP3Client in another class.

Refactor Then Fix

As discussed in Distinguish Error Cases Uniquely, the Message method has a defect that doesn't let us clear its value. Its implementation is identical to Host and six other accessors:

sub Message {
    my $me = shift;
    my $msg = shift or return $me->{MESG};
    $me->{MESG} = $msg;
}

The method tests the wrong condition: $msg being false. It should test to see if $msg was passed at all: @_ greater than 1. For most of the other accessors, this isn't a real problem. However, the Debug method only allows you to turn on debugging output, since it only saves true values:

sub Debug
{
  my $me = shift;
  my $debug = shift or return $me->{DEBUG};
  $me->{DEBUG} = $debug;
}

We could fix all the accessors with the global replacement, but the repetition is noisy and distracting. It's better to contract the code to improve its design while fixing the defects.

We refactor first, so we can test the refactoring with the existing unit test. Once we determine the refactoring hasn't changed the observable behavior, we fix the fault. Usually, refactoring first cuts down on the work, because we only need to fix the fault in one place.

The first step is to split out the existing logic into a separate subroutine, and update all accessors to use the new routine:

sub Debug {
    return _access('DEBUG', @_);
}
sub Host {
    return _access('HOST', @_);
}
sub LocalAddr {
    return _access('LOCALADDR', @_);
}
sub Message {
    return _access('MESG', @_);
}
sub Pass {
    return _access('PASSWORD', @_);
}
sub Port {
    return _access('PORT', @_);
}
sub State {
    return _access('STATE', @_);
}
sub User {
    return _access('USER', @_);
}
sub _access {
    my $field = shift;
    my $me = shift;
    my $value = shift or return $me->{$field};
    $me->{$field} = $value;
}

Perl allows us to pass on the actual parameters of a routine simply. Since the $value parameter is optional, we put the new parameter $field before the rest of the parameters. This simple trick reduces code complexity. By the way, it's not something I'd do if _access were public. If _access needed to be overriddable by a subclass, it would have been called as a method ($me->access). However, the XP rule "you aren't going to need it" applies here. We'll expose _access to subclasses when there's an explicit need to.

Consistent Names Ease Refactoring

The refactored accessors are repetitive. In Perl it's common to generate repetitive code, and we'll do so here.[3] First, we need to rename MESG to MESSAGE to simplify generation. There are only two references in Mail::POP3Client (one use not shown), so this refactoring is easy:

sub Message {
    return _access('MESSAGE', @_);
}

Next, we refactor the Pass accessor. As noted in Consistent APIs Ease Testing, the inconsistent naming (Pass method and PASSWORD configuration parameter) made testing a bit more repetitive than it needed to be. We don't change the parameter and instance field from PASSWORD to PASS, because abbreviations are less desirable. Rather, we refactor Pass as follows:

sub Pass {
    return shift->Password(@_);
}
sub Password {
    return _access('PASSWORD', @_);
}

As a part of this refactoring, we need document that Pass is deprecated (will eventually go away). Mail::POP3Client is available on CPAN, so we can't fix all the code that uses it. Deprecation is an important refactoring technique for very public APIs like this one.

Generate Repetitive Code

With the refactored names, we can now generate all eight accessors with this simple loop:

foreach my $a (qw(
    Debug
    Host
    LocalAddr
    Message
    Password
    Port
    State
    User
)) {
    eval(qq{
        sub $a {
            return _access('@{[uc($a)]}', \@_);
        }
        1;
    }) or die($@);
}

We check eval's result to be sure the generator works correctly. The syntax to insert the field name is a bit funky, but I prefer it to using a temporary variable. The @{[any-expression]} idiom allows arbitrary Perl expressions to be interpolated in double-quoted strings.

Fix Once and Only Once

With the refactoring complete, we can now fix all eight accessors with one line of code. _access should not check the value of its third argument, but should instead use the argument count:

sub _access {
    my $field = shift;
    my $me = shift;
    @_ or return $me->{$field};
    $me->{$field} = shift;
}

This is not a refactoring. It changes the observable behavior. Our unit test confirms this: three more test cases pass.

Stylin'

Coding style is important to consider when refactoring. In this chapter, I chose to maintain the style of the source.[4] To contrast, here's what _access would look like in bOP style:

sub _access {
    my($field, $self, $value) = @_;
    return @_ >= 3 ? $self->{$field} = $value : $self->{$field};
}

The differences between the two styles are subtle, but the bOP version would stand out in the code and be an unnecessary distraction. No one cares if you use $me, $self, or $i[5] as long as you are consistent.

Style updates are important, too. If for some reason, Mail::POP3Client were to be incorporated into bOP, we would consider updating the implementation to match bOP style. Refactoring coding style is fine as long as you refactor and test independently from other changes.

Tactics Versus Strategy

When we begin a task, tactics are important. We want to add business value as quickly as possible. Doing the simplest thing that could possibly work is a tactic to reach our goal.

Copy-and-paste is the weapon of choice when task completion is our only objective. Generalizations, such as the refactorings shown thus far, are not easy to come up with when you're under pressure to get the job done. Besides, you don't know if the code you copy solves the problem until the implementation is finished. It's much better to copy-and-paste to test an idea than to invent, to implement, and to use the wrong abstraction.

As a design strategy, copy-and-paste poses problems. The code is more difficult to comprehend, because it's difficult to see subtle differences. Faults fixed in one place do not propagate themselves automatically. For example, there's an alternative fix to the problem already embedded in two other accessors, Count and Size:

sub Count {
    my $me = shift;
    my $c = shift;
    if (defined $c and length($c) > 0) {
        $me->{COUNT} = $c;
    } else {
        return $me->{COUNT};
    }
}

sub Size {
    my $me = shift;
    my $c = shift;
    if (defined $c and length($c) > 0) {
        $me->{SIZE} = $c;
    } else {
        return $me->{SIZE};
    }
}

These accessors behave differently from the other eight that we refactored and fixed above. Count and Size need to be resettable to zero (mailboxes can be empty), which is why the accessors have an alternate implementation.

The thought and debugging that went into fixing Count and Size could have also applied to the other accessors. Since the code wasn't refactored at the time of the fix, it was probably easier to leave the other accessors alone. And, under the principle of "if it ain't broke, don't fix it" any change like this is not justified.

XP legitimatizes fixing non-broke code. It's something programmers do anyway, so XP gives us some structure and guidelines to do it safely. We can refactor Size and Count to use _access without fear.[6] The unit test covers the empty mailbox case. If we didn't have a test for this case, we could add one. Again, XP saves us. Since the programmers are the testers, we're free to modify the test to suit our needs.

Refactor With a Partner

Pair programming supports refactoring, too. Two people are better than one at assessing the need for, the side-effects of, and the difficulty of a change. The tradeoffs between tactics versus strategy are hard, and discussing them with a partner is both effective and natural. Switching partners often brings new perspectives to old code, too.

Sometimes I look at code and don't know where to begin refactoring. The complexity overwhelms my ability to identify commonality. For example, here's some code which needs refactoring:

sub List {
    my $me = shift;
    my $num = shift || '';
    my $CMD = shift || 'LIST';
    $CMD=~ tr/a-z/A-Z/;
    $me->Alive() or return;
    my @retarray = ();
    my $ret = '';
    $me->_checkstate('TRANSACTION', $CMD) or return;
    $me->_sockprint($CMD, $num ? " $num" : '', $me->EOL());
    my $line = $me->_sockread();
    unless (defined $line) {
        $me->Message("Socket read failed for LIST");
        return;
    }
    $line =~ /^\+OK/ or $me->Message("$line") and return;
    if ($num) {
        $line =~ s/^\+OK\s*//;
        return $line;
    }
    while (defined($line = $me->_sockread())) {
        $line =~ /^\.\s*$/ and last;
        $ret .= $line;
        chomp $line;
        push(@retarray, $line);
    }
    if ($ret) {
        return wantarray ? @retarray : $ret;
    }
}

sub ListArray {
    my $me = shift;
    my $num = shift || '';
    my $CMD = shift || 'LIST';
    $CMD=~ tr/a-z/A-Z/;
    $me->Alive() or return;
    my @retarray = ();
    my $ret = '';
    $me->_checkstate('TRANSACTION', $CMD) or return;
    $me->_sockprint($CMD, $num ? " $num" : '', $me->EOL());
    my $line = $me->_sockread();
    unless (defined $line) {
        $me->Message("Socket read failed for LIST");
        return;
    }
    $line =~ /^\+OK/ or $me->Message("$line") and return;
    if ($num) {
        $line =~ s/^\+OK\s*//;
        return $line;
    }
    while (defined($line = $me->_sockread())) {
        $line =~ /^\.\s*$/ and last;
        $ret .= $line;
        chomp $line;
        my($num, $uidl) = split ' ', $line;
        $retarray[$num] = $uidl;
    }
    if ($ret) {
        return wantarray ? @retarray : $ret;
    }
}

sub Uidl {
    my $me = shift;
    my $num = shift || '';
    $me->Alive() or return;
    my @retarray = ();
    my $ret = '';
    $me->_checkstate('TRANSACTION', 'UIDL') or return;
    $me->_sockprint('UIDL', $num ? " $num" : '', $me->EOL());
    my $line = $me->_sockread();
    unless (defined $line) {
        $me->Message("Socket read failed for UIDL");
        return;
    }
    $line =~ /^\+OK/ or $me->Message($line) and return;
    if ($num) {
        $line =~ s/^\+OK\s*//;
        return $line;
    }
    while (defined($line = $me->_sockread())) {
        $line =~ /^\.\s*$/ and last;
        $ret .= $line;
        chomp $line;
        my($num, $uidl) = split ' ', $line;
        $retarray[$num] = $uidl;
    }
    if ($ret) {
        return wantarray ? @retarray : $ret;
    }
}

Where are the differences? What's the first step? With a fresh perspective, the following stood out:

sub Uidl {
    my $me = shift;
    my $num = shift;
    return $me->ListArray($num, 'UIDL');
}

A partner helps you overcome familiarity and fear of change which make it hard to see simplifications like this one.

Sharing with Code References

It's clear that List and ListArray are almost identical. The problem is that they differ in the middle of the loop. Perl code references are a great way to factor out differences especially within loops:

sub List {
    return _list(
        sub {
            my $line = shift;
            my $retarray = shift;
            push(@$retarray, $line);
            return;
        },
        @_,
    );
}

sub ListArray {
    return _list(
        sub {
            my($num, $value) = split ' ', shift;
            my $retarray = shift;
            $retarray->[$num] = $value;
            return;
        },
        @_,
    );
}

sub _list {
    my $parse_line = shift;
    my $me = shift;
    my $num = shift || '';
    my $CMD = shift || 'LIST';
    $CMD =~ tr/a-z/A-Z/;
    $me->Alive() or return;
    my @retarray = ();
    my $ret = '';
    $me->_checkstate('TRANSACTION', $CMD) or return;
    $me->_sockprint($CMD, $num ? " $num" : '', $me->EOL());
    my $line = $me->_sockread();
    unless (defined $line) {
        $me->Message("Socket read failed for $CMD");
        return;
    }
    $line =~ /^\+OK/ or $me->Message("$line") and return;
    if ($num) {
        $line =~ s/^\+OK\s*//;
        return $line;
    }
    while (defined($line = $me->_sockread())) {
        $line =~ /^\.\s*$/ and last;
        $ret .= $line;
        chomp $line;
        $parse_line->($line, \@retarray);
    }
    if ($ret) {
        return wantarray ? @retarray : $ret;
    }
}

We pass an anonymous subroutine as _list's first parameter, $parse_line, for the reason described in Refactor Then Fix.

Refactoring Illuminates Fixes

The List, ListArray, and Uidl methods are bimodal. When called without arguments, they return a list of values. When passed a message number, the value for that message number alone is returned. The message number cases failed in our unit test.

The code reference refactoring shows us where the fault lies: $parse_line is not called when _list is called with an argument. It also needs to chomp the $line to match the behavior:

    if ($num) {
        $line =~ s/^\+OK\s*//;
        chomp $line;
        return $parse_line->($line);
    }

The anonymous subroutines in List and ListArray need to be bimodal for this refactoring to work:

sub List {
    return _list(
        sub {
            my $line = shift;
            my $retarray = shift or return $line;
            push(@$retarray, $line);
            return;
        },
        @_,
    );
}

sub ListArray {
    return _list(
        sub {
            my($num, $value) = split ' ', shift;
            my $retarray = shift or return $value;
            $retarray->[$num] = $value;
            return;
        },
        @_,
    );
}

By compressing the business logic, its essence and errors become apparent. Less code is almost always better than more code.

While advanced constructs like code references may be difficult to understand for those unfamiliar with them, dumbing down the code is not a good option. Defaulting to the least common denominator produces dumb code and ignorant programmers. In XP, change occurs not only the project artifacts but also in ourselves. Learning and teaching are an integral part of the XP methodology.

Brush and Floss Regularly

This chapter presents a glimpse of design on demand. Each refactoring was implemented and tested separately. The trick to refactoring successfully is taking baby steps.

I like to compare refactoring to brushing your teeth. Your best shot at preventing tooth decay is to brush briefly after each meal and to floss daily. Alternatively, you could just wait for cavities to appear and have them filled. The short term cost in pain, money, and time is much greater if you do. In the long term, too many fillings create structural problems and the tooth has to be replaced.

Refactoring is preventative maintenance, like tooth brushing. Quick fixes are like fillings. Eventually, they create structural problems and require the implementation to be replaced. Like teeth, complete rewrites are unnecessarily painful and costly.

XP encourages you to do the simplest thing that could possibly work (tactics) to address the immediate problem . You refactor your code to express concepts once and only once (strategy) to prevent structural problems. And, don't forget that brushing and flossing support pair programming.

Footnotes

Don Roberts came up with this rule, and it is noted in Refactoring: Improving the Design of Existing Code, Martin Fowler, Addison Wesley, 1999, p. 58.
http://search.cpan.org/author/SDOWD/POP3Client-2.12
There are numerous CPAN classes for generating accessors. The purpose of this refactoring is to demonstrate the technique of generating code to reduced repetition. In the real world, you'd use one of the CPAN classes or, better yet, eschew accessors in favor a super-class which implements accessors via get and put methods.
There are some minor changes to whitespace, indentation, and brace placement for consistency and brevity within this book.
Except in FORTRAN, where the name of the variable can determine its type.
When making changes to CPAN modules, XP, nor any other methodology, helps to validate uses in the (unknown) importers.

Previous: Chapter 13: Unit Testing

Next: Chapter 15: It's a SMOP