Wednesday, March 28, 2012

GitHub with cygwin git behind corporate firewall and proxy

GitHub: me too! Port 22 doesn't work! Bummer! Corporate firewalls suck!

I find it complicated enough to establish a tunnel to GitHub through a corporate firewall to share my results. The following steps describe what I have done on my Windows 7 to finally succeed.

Install

Install cygwin with git, ssh and corkscrew.
All commands are now entered in the cygwin shell, not the Windows command shell.
We use git from the cygwin package, not from the git bash.

Configure the proxy for git


make sure you have the HTTP_PROXY environment variable set:
echo $HTTP_PROXY
should print your proxy host and port

Now enter
git config --global http.proxy $HTTP_PROXY

On to the fun part: tunnel SSH through the https port 443

Follow the ssh key generation process as described in the GitHub installation page.
The command
ssh -T git@github.com
fails because port 22 is blocked. That's why you are reading this page after all.

Create the file /.ssh/config and put the following content in it

Host gitproxy
User git
HostName ssh.github.com
Port 443
ProxyCommand /usr/bin/corkscrew.exe proxyhost proxyport %h %p
IdentityFile /.ssh/id_rsa


If you don't like the host name gitproxy, feel free to choose your own.
The indentity file is you private key file, which has been generated with ssh-keygen.

Make sure the rights are set correctly, otherwise ssh wil complain
chmod 644 /.ssh/config

Now try
ssh gitproxy

ssh will ask you for the passphrase you have defined for your ssh key
GitHub says:
Hi (your GitHub name)! You've successfully authenticated, but GitHub does not provide shell access.

Use the host gitproxy instead of github.com for all further git commands. The passphrase is prompted for the  git push command and likely a bunch of others. I am new to git and GitHub, so forgive me for my lack of precision.

The programs ssh-agent and ssh-add can automate the passphrase so that you don't have to enter it every time.

Set up ssh-agent

Add
eval `ssh-agent`
to your .bashrc

Reopen the cygwin shell and run
ssh-add /.ssh/id_rsa

I am sure I have forgotten something, but hopefully this will take you 95% ;-)

Did I already mention that I hate network plumbing with a passion?

Tuesday, March 27, 2012

Geographic Visualization of my Twitter Graph

(Source at https://github.com/alpengeist/Twitter-Crawler)

I am experimenting with the Neo4j graph database and as an exercise I loaded the friend graph from Twitter, starting with me (@alpengeist_de), in depth 4 and width 40 into the DB. I then enhanced the data with the geo location information from Yahoo Places. This process takes a while, because the Twitter REST API limits to 150 requests per hour for non-registered applications. Behind a corporate proxy it is even worse, because the tracking is on the IP address and my colleagues consume the quota as well.

My Java program caches the friend data in a simple CSV file, so I can start it anytime and get another bunch of  users from Twitter. The geodata is cached in a properties file (the simplest K/V store there is :-)
I can quickly generate the Neo4j graph database with data from disk.

I have collected about 30000 users so far. About 18000 of them have the full data set. For visualization I started with the open source tool Gephi, whose Geo Layouter I have used to produce the images.

I did a node ranking by follower count to fatten the dots a little, and a color partitioning by country. Yahoo Places returns a quality measure, so I filtered on that as well. Et voilĂ , the continents emerge.

Not everyone in Twitter has entered a real location. Many say they are in "the internet", which is located in Brazil according to Yahoo. It seems like the Internet is enjoying itself in a pleasant climate and dancing Samba! Another good one is @artcika, who claims to be "in the middle of the map". Yahoo locates that in Papua Neuginea.
Those errors are not easy to filter out.


Twitter Graph layouted with Gephi Geo Layouter (Mercator projection), partitioned by country
The little spots outside the continents are mostly due to nonsense location data from Twitter where Yahoo Places was still confident to have delivered something useful. However, there are actually users in Alaska, Hawaii, and Iceland :-)

I have plans to get familiar with the D3 JavaScript library for visualization, but that will take some effort.

Finally, here is a version with the connections switched on. There is not much information one can draw from this. I just enjoy the emerging color patterns.




Monday, March 12, 2012

Even your Java VM is eventually consistent - or worse

When studying distributed systems, or in my case more specifically distributed databases, you encounter the term eventual consistency and the CAP theorem.

A multicore processor system is actually also a kind of distributed system. There is no unreliable network communication in between the cores, however it has nodes (cores) with isolated caches and registers, which are in fact distributed state like a distributed DB has.

The book Java Concurrency In Practice by Brian Goetz et al deals with Java concurrency in an almost overwhelming depth. In chapter 3.1 "Visibility" I found an interesting analogy to eventually consistent databases.

The story is that variables in a multithreaded program are not consistent among threads unless the access is synchronized. With unsynchronized variables the compiler may reorder statements and cache values in registers or processor-specific caches, so that the individual threads have inconsistent state.

With synchronization, we trade in a little of Availability of the CAP triangle to achieve Consistency. I won't go too far with this analogy, since Partition Tolerance has no place in a multicore one chip processor. Nevertheless, it is the same thing as in distributed systems.

I dare to publish their example here without permission, not without recommending the book as the standard literature for the subject, of course :-)

A JVM can even be worse than eventually consistent: never consistent!

public class NoVisibility {
    private static boolean ready;
    private static int number;
 
    private static class ReaderThread extends Thread {
        public void run() {
            while (!ready)
                Thread.yield();
            System.out.println(number);
        }
    }
    public static void main(String[] args) {
        new ReaderThread().start();
        number = 42;
        ready = true;
    }
}

In this example, the ReaderThread may never finish because the boolean ready flag may never be true, even if it is certainly true in the main thread. Also, if it was true for the ReaderThread, there is no guarantee that number is 42. It may as well be 0 because of statement reordering effects.

As the authors point out, this code is as simple as it gets and boom - disaster and terribly hard to find errors are waiting to give you a hard time.

A less strong alternative to synchronization is declaring a variable as volatile. This instructs the compiler to avoid caching in registers. The value is written to and fetched from memory, which is shared, consistent state for all threads.

Personally, I don't think that mastering concurrency is a subject, which the majority of Java developers will be capable of down to the finest details. It is even more difficult than generics, another dark corner for many. Serious concurrent programming in Java requires dropping the familiar standard data structures from java.util in favor of  java.util.concurrent, which is presented in detail in the book. The language itself does not hide any complexity, it hits you in the face brutally. Future will show whether most use cases can be satisfied with java.util.concurrent (together with Java 8 lambdas maybe) or if Java concurrency becomes a bone breaker.