Alpengeist's Coding Massacres

Thursday, October 17, 2013

The Higgs Bug

Quantum Physics has an interesting relationship with programming. You may have heard of the Heisenbug. This kind of bug behaves according the Heisenberg Uncertainty Principle. When the diligent programmer is trying to precisely observe it, the damned thing seems to disappear and blur its true location.

I think there is yet another type of bug, which shows some parallels to quantum physics. I call it the Higgs Bug. It manifests itself like this:
You have spotted a bug and you are trying to get to it, wading eagerly through the code, which is between you and the bug. However, despite your effort, numerous other problems emerge on your way. You feel like being stuck in molasses, your momentum constantly drained, as if this bug was surrounded by a Higgs Field, making you heavy and slow.

Saturday, February 23, 2013

Coupling Factors in Systems Integration with Real Life Analogies

Where it came from

During my years in enterprise application integration (EAI) I have identified a number of coupling factors, which influence design decisions in the context of the constraints given in a specific situation. I realized that as an integration architect you have to answer the same questions every time. I'll be illustrating the coupling factors from an angle where one partner (system) communicates with another. Thus, it is kind of systems integration biased, nevertheless, I find myself applying the factors in all kinds of design situations. I feel that they are generally useful, and I hope that they will help you, too.

Loose and strong coupling

There has been written a lot about this, so I'll keep it short like the rest of this post.

Coupling in this context means to know things of others to be able to communicate with them. Strong coupling means to know more, loose coupling to know less.
Strong coupling is the easy way. It requires less effort and gets stuff done more quickly at first.
Decoupling gives partners more space. What another doesn't know, I can easily change.
Let me be clear (how Obama would say it): Loose coupling (decoupling) is usually more work and more expensive to build. It may require looking beyond the immediate gains of simpler solutions. The payoff may not be visible to everyone.
Like with all architectural decisions, there is no good and bad unless you include the relevant constraints in your judgement. Forget so-called best practices. There is no best, just more appropriate.

Factor 1: Peer-To-Peer

Real life analogy: I know who you are.

In P2P coupling the partners know each other. They know who is comunicating with them. An interface may even be individually tailored for one partner alone. This is very common in systems integration, unless publish-subscribe is used. Most interfaces start out as an individual solution for a first client of a system.

Factor 2: Location

Real life analogy: I know where you live.

Location coupling means to reveal an individual address to a partner. P2P coupling does not necessarily reveal an address, yet. A system may send messages, which are specific for a partner, through a middleware without knowing the location. Location coupling inevitably leads to P2P coupling.
Addresses can be defined on different levels of abstraction. For example, An MAC address is lower than an IP address, which is in turn lower than a DNS address. Each higher level introduces some decoupling.

Factor 3: Representation

Real life analogy: I can look through your window.

Coupling through representation means to reveal ones inner data structure through an interface. This is so common in Java WS-* implementations for example that it hurts. Just slam a bunch of annotations on your domain classes and presto - your webservice! Representation coupling severely restricts the space for change in a service. This can be totally ok in tightly coupled layers inside a system. Between systems it is a very bad idea.

Factor 4: Integrity

Real life analogy: I can enter through your window and rearrange your furniture.

While coupling through representation is the read-only case, coupling through integrity goes one step further. A system is able to directly manipulate internal data structures of another system. This is possible in database integration scenarios where a system writes to foreign or even shared tables. Terrible idea. To allow this coupling, there must be very solid reasons, for example enormous amounts of data to be delivered in short time with no option of transformation of representation. If the data provider screws up the integrity, the affected system may not work anymore. Consider at least a staging or versioning mechanism as a defense line.

Factor 5: Synchronicity

Real life analogy: You are using my time.

When a client is required to wait for the completion of its request, it is synchronously coupled. Most systems are coupled synchronously, especially over HTTP, which is synchronous unless special pseudo-asynchronous techniques are used. A messaging middleware infrastructure can establish asynchronous communication. However, this requires totally different strategies in the partner systems. Asynchronous processing is much more complicated. It even affects how a user can interact with a system through the UI.

Factor 6: Distributed Lifecycle

Real life analogy: A kid travels alone and calls mom every hour.

When a logical data concept (business data object) is created in one system and then travels across to other systems where it gets manipulated (a copy of it, that is), its lifecylcle is distributed. In many cases all participating systems like to be notified of changes in the lifecycle. This is a very fragile coupling. Distributed state is a delicate problem to solve and should be avoided. It may be an indication for a bad vertical slicing of system responsibility. However, in real life, where systems with overlapping data requirements evolve one after another without replacing each other, it happens quickly.

Where to go from here

Was that all? Of course not. There is plenty more things to consider while designing systems interaction. You may find more fine-grained coupling factors. From my experience though, the six factors I've presented pretty much cover the most important everyday coupling problems.

We have also not addressed the other numerous nonfunctional aspects, which coupling is just one part of. I thank you for reading up to here :-)

Wednesday, March 28, 2012

GitHub with cygwin git behind corporate firewall and proxy

GitHub: me too! Port 22 doesn't work! Bummer! Corporate firewalls suck!

I find it complicated enough to establish a tunnel to GitHub through a corporate firewall to share my results. The following steps describe what I have done on my Windows 7 to finally succeed.

Install

Install cygwin with git, ssh and corkscrew.
All commands are now entered in the cygwin shell, not the Windows command shell.
We use git from the cygwin package, not from the git bash.

Configure the proxy for git

make sure you have the HTTP_PROXY environment variable set:
echo $HTTP_PROXY
should print your proxy host and port

Now enter
git config --global http.proxy $HTTP_PROXY

On to the fun part: tunnel SSH through the https port 443

Follow the ssh key generation process as described in the GitHub installation page.
The command
ssh -T git@github.com
fails because port 22 is blocked. That's why you are reading this page after all.

Create the file /.ssh/config and put the following content in it

Host gitproxy
User git
HostName ssh.github.com
Port 443
ProxyCommand /usr/bin/corkscrew.exe proxyhost proxyport %h %p
IdentityFile /.ssh/id_rsa

If you don't like the host name gitproxy, feel free to choose your own.
The indentity file is you private key file, which has been generated with ssh-keygen.

Make sure the rights are set correctly, otherwise ssh wil complain
chmod 644 /.ssh/config

Now try
ssh gitproxy

ssh will ask you for the passphrase you have defined for your ssh key
GitHub says:
Hi (your GitHub name)! You've successfully authenticated, but GitHub does not provide shell access.

Use the host gitproxy instead of github.com for all further git commands. The passphrase is prompted for the git push command and likely a bunch of others. I am new to git and GitHub, so forgive me for my lack of precision.

The programs ssh-agent and ssh-add can automate the passphrase so that you don't have to enter it every time.

Set up ssh-agent

Add
eval `ssh-agent`
to your .bashrc

Reopen the cygwin shell and run
ssh-add /.ssh/id_rsa

I am sure I have forgotten something, but hopefully this will take you 95% ;-)

Did I already mention that I hate network plumbing with a passion?

Tuesday, March 27, 2012

Geographic Visualization of my Twitter Graph

(Source at https://github.com/alpengeist/Twitter-Crawler)

I am experimenting with the Neo4j graph database and as an exercise I loaded the friend graph from Twitter, starting with me (@alpengeist_de), in depth 4 and width 40 into the DB. I then enhanced the data with the geo location information from Yahoo Places. This process takes a while, because the Twitter REST API limits to 150 requests per hour for non-registered applications. Behind a corporate proxy it is even worse, because the tracking is on the IP address and my colleagues consume the quota as well.

My Java program caches the friend data in a simple CSV file, so I can start it anytime and get another bunch of users from Twitter. The geodata is cached in a properties file (the simplest K/V store there is :-)

I can quickly generate the Neo4j graph database with data from disk.

I have collected about 30000 users so far. About 18000 of them have the full data set. For visualization I started with the open source tool Gephi, whose Geo Layouter I have used to produce the images.

I did a node ranking by follower count to fatten the dots a little, and a color partitioning by country. Yahoo Places returns a quality measure, so I filtered on that as well. Et voilà, the continents emerge.

Not everyone in Twitter has entered a real location. Many say they are in "the internet", which is located in Brazil according to Yahoo. It seems like the Internet is enjoying itself in a pleasant climate and dancing Samba! Another good one is @artcika, who claims to be "in the middle of the map". Yahoo locates that in Papua Neuginea.
Those errors are not easy to filter out.

Twitter Graph layouted with Gephi Geo Layouter (Mercator projection), partitioned by country

The little spots outside the continents are mostly due to nonsense location data from Twitter where Yahoo Places was still confident to have delivered something useful. However, there are actually users in Alaska, Hawaii, and Iceland :-)

I have plans to get familiar with the D3 JavaScript library for visualization, but that will take some effort.

Finally, here is a version with the connections switched on. There is not much information one can draw from this. I just enjoy the emerging color patterns.

Monday, March 12, 2012

Even your Java VM is eventually consistent - or worse

When studying distributed systems, or in my case more specifically distributed databases, you encounter the term eventual consistency and the CAP theorem.

A multicore processor system is actually also a kind of distributed system. There is no unreliable network communication in between the cores, however it has nodes (cores) with isolated caches and registers, which are in fact distributed state like a distributed DB has.

The book Java Concurrency In Practice by Brian Goetz et al deals with Java concurrency in an almost overwhelming depth. In chapter 3.1 "Visibility" I found an interesting analogy to eventually consistent databases.

The story is that variables in a multithreaded program are not consistent among threads unless the access is synchronized. With unsynchronized variables the compiler may reorder statements and cache values in registers or processor-specific caches, so that the individual threads have inconsistent state.

With synchronization, we trade in a little of Availability of the CAP triangle to achieve Consistency. I won't go too far with this analogy, since Partition Tolerance has no place in a multicore one chip processor. Nevertheless, it is the same thing as in distributed systems.

I dare to publish their example here without permission, not without recommending the book as the standard literature for the subject, of course :-)

A JVM can even be worse than eventually consistent: never consistent!

public class NoVisibility {
private static boolean ready;
private static int number;

private static class ReaderThread extends Thread {
public void run() {
while (!ready)
Thread.yield();
System.out.println(number);
}
}
public static void main(String[] args) {
new ReaderThread().start();
number = 42;
ready = true;
}
}

In this example, the ReaderThread may never finish because the boolean ready flag may never be true, even if it is certainly true in the main thread. Also, if it was true for the ReaderThread, there is no guarantee that number is 42. It may as well be 0 because of statement reordering effects.

As the authors point out, this code is as simple as it gets and boom - disaster and terribly hard to find errors are waiting to give you a hard time.

A less strong alternative to synchronization is declaring a variable as volatile. This instructs the compiler to avoid caching in registers. The value is written to and fetched from memory, which is shared, consistent state for all threads.

Personally, I don't think that mastering concurrency is a subject, which the majority of Java developers will be capable of down to the finest details. It is even more difficult than generics, another dark corner for many. Serious concurrent programming in Java requires dropping the familiar standard data structures from java.util in favor of java.util.concurrent, which is presented in detail in the book. The language itself does not hide any complexity, it hits you in the face brutally. Future will show whether most use cases can be satisfied with java.util.concurrent (together with Java 8 lambdas maybe) or if Java concurrency becomes a bone breaker.

Tuesday, November 1, 2011

My Four Natural Laws Of Programming

(work in progress...)

I wrote my first program in BASIC on a Commodore 8032 in 1984. The computer was magical and I had no idea what was actually going on inside it.
Today the magic of the machine is gone. Yet I still feel the magic of the structure and aesthetics of programming.

Recently, I realized that the more I program, the fewer rules I use do decide what's good.

There are loads of rules out there describing what good code looks like. Too many of them. Put yourself into the mind of a beginner and try to figure how to learn them. I try to condense my rules to a small universal set, which include many others as special cases. I like to draw an analogy to the fundamental forces of physics, which became less numerous with time, embracing more forces previously seen as individual ones.

Further, I believe in the emergence of software architecture when the laws are applied continuously and code gets refactored accordingly.

Do you also think "I know good code when I see it"? I definitely do. Programming is a very visual activity. Good code always looks good.

When I do code reviews, I find myself applying a handful of universal laws, which reliably uncover flaws without having to know what the code is actually doing. The rest if covered by sufficient testing.

I'll write down my personal laws of programming to share them and to become aware of them myself. They lean towards object-oriented programming in some areas because that is what I do most. I also talk about how I check the laws during a code review.

#1 Redundancy

Redundancy means two things: repetition and waste.

Repeating code is an indicator for missing abstraction. It can appear everywhere, from data access patterns, which repeatedly follow the same chains, to algorithmic patterns, which differ in little details not properly abstracted.

Waste code adds no value, solves no relevant problem. However, it introduces bugs and maintenance cost. Waste conjures up workarounds due to missing understanding by programmers. Useless layering and over-abstraction fall into this category.
Using code generators to produce waste makes them waste generators.

The art is to find the right abstraction balance to avoid repetition without introducing waste.

Automated code duplication checkers help to find repetition. Waste can only be identified in the context of the given architecture constraints. Waste in one case may be a necessary construct somewhere else. Mechanical exercise of too limited rules is harmful here. This is where experience makes all the difference.

#2 Consistency

Consistency is about recurring patterns, which do not contradict each other.

Consistent code becomes familiar quickly. It has no exceptions to rules.

Consistency has many facets. Names are chosen consistently. An API uses a consistent level of abstraction. Data access patterns are similar. Things are identified consistently and not sometimes by IDs, sometimes as object refs, for example. A class hierarchy is not designed only to be torn apart with instanceof later. Consistency shows in method signatures in the order of parameters, and imperative vs. functional style. Countless examples are possible here.
Error handling is another large field for consistency hazards.

Checking consistency is something no machine can do for you (yet). As a code reader, I critically watch out for surprises, which introduce a change in patterns.

#3 Fat

With fat I am not referring to code, which has no use and is redundant. Fat is missing structure, it fails to provide a proper segmentation of responsibility. Too much is packed into one location.

Fat is easy to see. In methods for example, there are big code blocks or unrelated blocks. A method starts with one task and continues with some other and you cannot quickly assess where and why. If the programmer was diligent, he left some comments between the blocks. Classes contain too many methods and again you cannot see where the boundaries are between the tasks a class has.

You can tell by one look that something is not ok, regardless what the code actually does. It is purely a matter of form. Automated metrics like XS (excessive structural complexity) in Structure101 or the well-known cyclomatic complexity help to find the hot spots quickly.

#4 Coupling

Coupling addresses the problem of who knows what.

This is a huge field and I think the most important aspect of architecture on large and small scales.
The fundamental question is:

Why do I access this information here and now from that source?

Bad coupling restricts the source of the information in its ability to change. It is like pinning a tense wire to it. You find coupling problems in interfaces to external systems, which are too tight to be backwards compatible, in missing encapsulation, in badly organized separation of concerns causing wild dependencies (hair balls), in methods juggling with too many things at once, in packages with little coherence. Bad coupling produces dependency cycles between packages.

Coupling also comes in multiple dimensions,

Structure
Time
Location

being the most important ones.

There are many techniques to reduce coupling in specific situations. Knowing them, and as with abstractions, using them wisely, is a matter of experience.

Identifying cycles in non-trivial programs is impossible manually. For Java, Structure101 became an indispensable tool for me. I seem to be too dumb or lazy to understand a colored dependency matrix like IntelliJ or Sonar have to offer. You need the help of the machine here.

That's it. Really. The facets are numerous and need to be learned in each environment. Still, I believe it all comes down to a handful of principles. When I struggle with some code or design, I can always root the problem in the four laws. And every once in a while, it was my own fault :-)

Programming as technique is not magical, it is a craft. Sometimes we forget and make a big fuss about it ("Popanz und Gedöns" in German). The magic shines when stuff is done right, like in a painting of a master.

Friday, October 21, 2011

CDI Events solve fancy problems elegantly (Part 2)

Problem: Does your application have a heartbeat?

I used to spread the phrase "every decent application needs a timer bean". Since JEE6, I have rephrased it to "every decent application needs a heart beat".

Whaddaya?

The @Schedule annotation improves upon the old timer service because it has zero admin effort. However, it has issues:

The configuration of @Schedule is in the static code, you cannot modify it at runtime, for example with values from a configuration file outside the deployment package.

Further, I think most timed routine jobs in an application are of the simple sort every minute, every quarter hour, every hour, at time x every day. I rarely find a reason to be more granular than this.

@Scheduled has been inspired by cron, which is flexibly configured with a text file. This is way different to compiling the static config into a class file in an EAR and not being able to modify it in the deployed application. That's very "un-cronish".

How about this idea: a heartbeat event is fired every minute and subscribers can observe it. The event has inspection methods to find out about the event time properties. The observer can be configured at runtime using data from an external config file.

The HeartbeatEvent has a bunch of inquiry methods. isTime(), isMinuteInterval, isHourInterval() take parameters, which can be fetched from a configuration, for example. Having the Calendar at hand, it is possible to invent all kinds of inquiries. Those are the ones that I find most useful.

public class HeartbeatEvent {
    private Calendar calendar;
    public HeartbeatEvent() {
        this.calendar = new GregorianCalendar();
    }
    /**
     * "at x hour and y minutes"
     * Use minute == 0 to test against a specific full hour.
     * @param hour 0-23
     * @param minute 0-59
     * @return true if match
     */
    public boolean isTime(int hour, int minute) {
        return calendar.get(Calendar.HOUR_OF_DAY) == hour && calendar.get(Calendar.MINUTE) == minute;
    }
    /**
     * "every n minutes"
     * @param minutes minute interval 0-59
     * @return true if match
     */
    public boolean isMinuteInterval(int minutes) {
        return calendar.get(Calendar.MINUTE) % minutes == 0;
    }
    /**
     * "every n hours"
     * @param hour 0-23
     * @return true if match
     */
    public boolean isHourInterval(int hour) {
        return calendar.get(Calendar.HOUR_OF_DAY) % hour == 0 && isFullHour();
    }
    public boolean isFullHour() {
        return calendar.get(Calendar.MINUTE) == 0;
    }
    public boolean isHalfHour() {
        return isMinuteInterval(30);
    }
    public boolean isQuarterHour() {
        return isMinuteInterval(15);
    }
}

The HeartbeatEmitter:

@Singleton
public class HeartbeatEmitter {
    @Inject
    private Event<HeartbeatEvent> heartbeat;

    @Schedule(hour="*", minute="*", persistent = false)
    public void emit() {
        heartbeat.fire(new HeartbeatEvent());
    }
}

Example heartbeat observer:

public void fileCleanup(@Observes HeartbeatEvent heartbeat) {
  if (heartbeat.isFullHour()) {
      // do file cleanup every full hour
  }
}