Selecting Random Lines From a File

February 4th, 2010 by admin
  1. Append a random number to each line in the file
  2. Sort by said number
  3. Take the top N lines as required

awk ‘BEGIN {srand()} {printf “%05.0f %s \n”,rand()*99999, $0; }’ filein \

| sort -n | head -100 | sed ’s/^[0-9]* //’

dnsjava - an in process DNS resolver for Java

October 3rd, 2009 by Joel

dnsjava is a multi-threaded in process DNS resolver for Java. It’s a must for anyone writing their own high throughput web crawler because:

  1. It runs in process and caches DNS resolve lookups, eliminating subsequent  network calls to a DNS resolver.
  2. Java’s inbuilt DNS resolver code, InetAddress, is single-threaded. This can become a serious bottleneck if  crawling multiple sites concurrently.  If your average  DNS resolution time is 50ms theny our crawl throughput will be limited at 20 page requests per second.

The latest version of dnsjava is compatible with Java6 and is a drop in replacement for the default Java implementation which itself calls into the operating systems dns resolver. Using it is a simple matter of dropping in the dnsjava Jar file and setting the following system property:

sun.net.spi.nameservice.provider.1=dns,dnsjava

Thanks to Brian Wellington and Paul Cowan for writing & supporting this software.

DNSMasq

June 29th, 2009 by admin

If your generating a lot of programatic http requests, say your running a web crawler, it’s worth caching DNS lookups locally to relieve the load on your usual DNS servers as well as speed things up a little. This is easily done by running an instance of Dnsmasq locally.

On Debian or Ubuntu:

Install DNSMasq:
sudo apt-get install dnsmasq

Edit /etc/dnsmasq.conf to uncomment the following line:
listen-address=127.0.0.1
Edit /etc/dhcp3/dhclient.conf to ensure you have the following section:
#supersede domain-name “fugue.com home.vix.com”;
prepend domain-name-servers 127.0.0.1;
request subnet-mask, broadcast-address, time-offset, routers,
domain-name, domain-name-servers, host-name,
netbios-name-servers, netbios-scope;

You will likely just need to ensure you have the line enabled:
prepend domain-name-servers 127.0.0.1;

Whenever dhclient3 is allocated an ip address, it also updates /etc/resolv.conf with the DNS servers to use. The config line we added above ensures that a line at the top of /etc/resolv.conf  is inserted to use 127.0.0.1 first.

This is all the configuration that is required, to start using our new local DNS server we can:

Make sure dnsmasq is running:
ps -ef | grep ndsmasq

Restart it (to pick up the config we added):
sudo /etc/init.d/dnsmasq restart

Re-allocate ourselves an IP address - to force dhclient.conf to be updated:
# release ip address
sudo dhclient -r <your interface e.g. eth0>
# acquire ip address
sudo dhclient <your interface e.g. eth0>

Post Configuration Checks:

1. Check that our local DNS resolver is being used:

less /etc/resolv.conf
search lan
nameserver 127.0.0.1
nameserver 212.135.1.36
nameserver 192.168.60.20

2. Query DNS server, using dig, for some domain twice, second time should be faster as it uses the local cache copy:

dig nasa.org
;; Query time: 91 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Mon Jun 29 09:16:57 2009
;; MSG SIZE  rcvd: 42

then:

dig nasa.org
;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Mon Jun 29 09:16:59 2009
;; MSG SIZE  rcvd: 42

and as expected, the second request is much faster.

If you were crawling a single domain of 10000 pages this could save 17mins of network wait time given a 100ms DNS query time, and given that during crawl you could be looking at many thousands of domains the savings are not insignificant.


Bash & Shell Snippets

June 27th, 2009 by admin

This is simply a list of useful bash snippets I copy and paste as and when I need them. I add to it periodically.

Performing some command on every file in a directory
for file in *.foo; do something "$file"; done 

Iterating a number of times:
for i in {1..10};do something ;done;

Sorting lines in a file by length:
awk '{ print length($0),$0 | "sort -n"}'  filename.txt

Monitoring Tools

June 17th, 2009 by admin

This is simply a list of useful tools available on most Linux systems that can provide useful and in depth information to aid in application performance debugging and tuning. I add to it as and when I find and use new things.

Monitoring network traffic on a machine or LAN - use ntop.
Comes with an internal webserver out of the box that can be used to
view the collected stats.
To start ntop with a webserver listening on port 9090:  
ntop -w 9090
Monitoring Memory and Paging stats - use vmstat:
vmstat -a 1
(-a shows active & inactive memory)

Monitoring Disk IO - use iostat:
iostat -x -d 1 10

Determine file type & information
Attempts to show file information such as encoding & compression type
file filename

How long does something take to execute? - use time:
time wget http://www.needsa303.com

JVM Shutdown Hooks

June 10th, 2009 by admin

Today I discovered a really handy tool  in the JVM: the  shutdown hook.

            Runtime.getRuntime().addShutdownHook(new Thread() {
                public void run() {
                   // do your cleanup logic here
                   // e.g. close db connections
                   conn.close();
                }
            });

These are particularly useful if you’re writing a server application which uses some persistent storage (in my case it was Berkeley DB) and you want to ensure that you close all file handles / commit any extant transactions before the JVM terminates.

This callback will be invoked on normal System exit as well as user interrupts which makes it really handy !

Converting file encoding using iconv

June 6th, 2009 by admin

To convert a file from one encoding to another use  a great shell tool called iconv.

So, to convert from ISO-8859-1 to UTF-8:

iconv --from-code=ISO-8859-1 --to-code=UTF-8 ./oldfile > ./newfile

This can be especially useful when handing text which has been downloaded from websites in varying encodings but which needs to be handled uniformly in UTF-8.

Configuring Java Beans using BeanShell

June 4th, 2009 by Joel

Inversion of control and dependency injection frameworks have been in use for some years as a way of configuring and managing  beans  in Java applications. Frameworks such as Spring, Pico, Hivemind and Guice generally work well enough but often rely on cumbersome configuration or opaque annotations. An alternate way of  managing beans, simply and transparently, is by  using a dynamic scripting language. A dynamic scripting language that can run within an interpreter in the JVM can be used  as the basis for a bean factory which can perform dependency injection and control runtime configuration. In this example we use BeanShell but the same principle can be extended to other scripting languages such as Jython, JRuby and Groovy.

Objectives

  • Clean bean configuration for both “Singleton” and “Prototype” beans
  • Bean dependency injection
  • Simple way of expressing differences in configuration between environments (e.g. dev, qa, prod)

Why Bean Shell?

BeanShell is a dyanmic language scripting that runs within an interpreter in the JVM. It uses a syntax familiar to Java programmers. I selected it because of it’s simplicity and familiarity. By the looks of the Bean Shell mailing list and source code repository  it seems that activity on the project has dwindled. However, there is a branch of the code being developed here. Regardless of activity, BeanShell just works and is very easy to learn and integrate into a project. The only requirement is the bsh jar file. Once downloaded it’s worth starting up the BeanShell Console and running some example code. The docs are comprehensive enough. In writing our Bean factory we will make use of the following BeanShell features:

  1. Calling the BeanShell interpreter from Java - docs
  2. Method Closures and scripted objects - docs

Defining Beans - beans.bsh

Beans and their configuration are defined within a BeanShell configuration file which we name beans.bsh. This file will be loaded from the classpath into an in-process BeanShell interpreter. It will contain all the necessary information for constructing each bean i.e.  singleton or prototype, configuration parameters, other bean dependencies. The first bean we define is a JDBC DataSource, as a singleton.

// File beans.bsh

// JDBC DataSource - Singleton Bean
dataSourceDef () {

    // JDBC Datasource,  singleton.
    ds = new com.mysql.jdbc.jdbc2.optional.MysqlConnectionPoolDataSource();
    ds.databaseName = "test";
    ds.port = 3306;
    ds.serverName = "localhost";
    ds.user = "test";
    ds.password = "test";

    create() { return ds; }

    return this;
}
dataSource = dataSourceDef();

The bean definition should be self-explanatory, we are defining our data source to be an instance of MysqlConnectionPoolDataSource and are providing the required configuration settings. This configuration is contained within a BeanShell closure. We have defined a method called dataSourceDef() that returns an instance of itself. This instance holds a reference to a datasource object. The create() method will be  used to return the datasource instance. The reason for using the closure is not yet obvious since we could just as well have more simply defined the bean as follows, removing the closure altogether:

dataSource = new com.mysql.jdbc.jdbc2.optional.MysqlConnectionPoolDataSource();
dataSource.databaseName = "test";
dataSource.port = 3306;
dataSource.serverName = "localhost";
dataSource.user = "test";
dataSource.password = "test";

The reason for using the level of indirection allowed by the  closure and the create() method will be demonstrated when we create our first prototype bean. Our bean definition is now complete, we can proceed to write the code that will allow us to access it.

Loading Beans - BeanFactory.java

To load beans  defined in beans.bsh we need to create a Bean Factory that can start a BeanShell interpreter, source beans.bsh  and use it to construct the appropriate Java objects.  BeanFactory.java does this for us:

package demo.bsh.beans;
import org.apache.log4j.Logger;

import bsh.EvalError;
import bsh.Interpreter;

/**
* A singleton class used to create beans defined in a
* bean shell configuration file.
*/
public class BeanFactory {

  private static BeanFactory instance = getInstance();
  private Interpreter i;

  public static synchronized BeanFactory getInstance() {
    if (instance == null) {
        instance = new BeanFactory();
        instance.init();
    }

    return instance;
  }

  private BeanFactory() {}

  private void init() {
    try {
    // global variables - substituted into config
    boolean dev = Boolean.parseBoolean(System.getProperty("dev", "true"));

    // Construct an interpreter
    i = new Interpreter();
    i.set("dev", dev);

    // Source the configuration script file
    i.source("config/beans.bsh");

    } catch (Exception e) {
       throw new RuntimeException(e);
    }
  }

public synchronized Object getBean(String name) {
    try {
        return i.eval(name+".create()");

    } catch (EvalError e) {
        throw new RuntimeException(e);
    }
 }
}

The job of the singleton class BeanFactory is to 1) construct a BeanShell Interpreter and 2)  use the interpreter to look up beans, keyed by name.

Creating the bean shell interpreter

The following four lines of code in the BeanFactory.init() method take care of creating the bean shell interpreter and seeding it with our config file, beans.bsh. It is important that they are executed from the init() method after constructor intialisation to avoid circular dependencies - beans that depend on the BeanFactory - and the creation of multiple BeanFactory instances. By initialising our Beans within the init() method we sidestep this issue.

// a variable the we use to determine runtime mode
// e.g. Prod, QA, Dev. Defaults to Dev.
boolean dev = Boolean.parseBoolean(System.getProperty("dev", "true"));

// Construct an interpreter
i = new Interpreter();
// Set the runtime mode on the interpreter
// this variable can now be used in our bean shell configuration
i.set("dev", dev);

// Source the configuration script file
i.source("config/beans.bsh");

Instantiating the bean shell environment is a simple matter of creating an instance of bsh.Interpreter and  sourcing our bean configuration file. We take the additional step of seeding the interpreter with a boolean variable, dev, which we will later be able to use within beans.bsh to construct conditional configuration statements of the type:

if (dev) {
 // use bean  X
} else {
 // use bean Y
}

Loading Beans

	public synchronized Object getBean(String name) {
	    try {
		return i.eval(name+".create()");

            } catch (EvalError e) {
		throw new RuntimeException(e);
	    }
	}

To create a bean we use a simple method that evaluates the statement <beanname>.create() on the intepreter. We will need to cast the bean object when we call this method. The method is synchronized to ensure that singleton beans are only intialised once  (we could quite easily remove the synchronisation on this method, for a slight performance boost, by simply pre loading all the singleton beans before we allow the user to use them).

Putting it all together

To instantiate our datasource bean using the BeanFactory class requires the following 2 lines of code:

BeanFactory config = BeanFactory.getInstance();
DataSource ds = (DataSource) config.getBean("dataSource");

Our DataSource bean was an example of a simple bean with no dependencies. We can now use our BeanFactory to construct a more complicated bean.

Bean Injection

We can now use our BeanFactory and beans.bsh configuration to construct a bean that has a dependency on another bean, using dependency injection. Our new bean is a PersonDAO. It’s dependency is the DataSource bean we created previously.

The PersonDao class:

package demo.bsh.beans;

import java.util.Date;

public class Person {

    private String name;
    private Date creationTime;
    
    private PersonDao dao;

    public void save() {
        dao.add(name);
    }
   
    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public void setDao(PersonDao dao) {
        this.dao = dao;
    }
    public PersonDao getDao() {
        return dao;
    }

    public Date getCreationTime() {
        return creationTime;
    }

    public void setCreationTime(Date creationTime) {
        this.creationTime = creationTime;
    }
}
 

Note that this bean does not implement an interface, it is not a requirement that they do so and so for brevity we don’t bother.

Next the bean configuration in beans.bsh

// bean.bsh configuration for PersonDao
// DAO - Singleton Bean
personDaoDef () {

    // Singleton DAO Bean.
    pdao = new demo.bsh.beans.PersonDaoImpl();
    pdao.dataSource = dataSource.create();

    create() { return pdao;    }

    return this;
}
personDao = personDaoDef();

It should be clear that the configuration of PersonDao and DataSource are near identical, with the exception that the PersonDao bean satisfies it’s dependency on DataSource with the following simple line of configuration.

pdao.dataSource = dataSource.create();

That’s all that’s required to wire up and inject our personDao with a datasource bean.

We can acquire an instance of a PersonDao as follows:

BeanFactory config = BeanFactory.getInstance();
PersonDao pdao = (PersonDao) config.getBean("dataSource");
pdao.add("Fred");

However instead we will once again use dependency injection and our bean configuration to complete the example by wiring up a Person object that uses PersonDao.

Putting it all Together

Our final example is to create a Person object. Person beans have the following characteristics:

  1. They are prototype beans (i.e. not singletons)
  2. Each person has a name
  3. Once a person has been created it can be saved - and so has a dependency on PersonDao - and implicitly DataSource

The Person class:

package demo.bsh.beans;

import java.util.Date;

public class Person {

	private String name;
	private Date creationTime;

	private PersonDao dao;

	public void save() {
		dao.add(name);
	}

	public String getName() {
		return name;
	}

	public void setName(String name) {
		this.name = name;
	}

	public void setDao(PersonDao dao) {
		this.dao = dao;
	}
	public PersonDao getDao() {
		return dao;
	}

	public Date getCreationTime() {
		return creationTime;
	}

	public void setCreationTime(Date creationTime) {
		this.creationTime = creationTime;
	}
}

The Person bean configuration:

// Person Model Object - Prototype Bean (1 per request)
personDef () {
    create() {
        p = new demo.bsh.beans.Person();
        p.dao = personDao.create();
        p.creationTime = new Date(); // set creation time to NOW
        return p;   
    }
    return this;
}
person = personDef();

Since this all happens within the create() method a new instance of Person will be created each time one is requested. That’s all the configuration it takes to wire up the Person bean. We have specified in our configuration for Person that we need to inject a PersonDao, and additionally we want to set the creation time to NOW. It is for this reason that we use method closures and the create() method to access our beans;  it allows for a level of  indirection necessary to make this choice between Singleton & Prototype in the beans.bsh  file.

Using the Person bean is simple:

BeanFactory config = BeanFactory.getInstance();

Person foo = (Person) config.getBean("person");
foo.setName("Foo");
foo.save();

Thread.sleep(2000);

Person bar = (Person) config.getBean("person");
bar.setName("Bar");
bar.save();

 // t
log.info(foo.getName() + " was created at " + foo.getCreationTime());
// t + 2s
log.info(bar.getName() + " was created at " + bar.getCreationTime());

Environmental Configuration

Our final requirement is to  configure beans dependent on the environment in which they are being used e.g. DEV, QA, PROD.

To do so we seed our bean shell interpreter with a variable that we can use, in beans.bsh, to distinguish between dev, and all other environments:

boolean dev = Boolean.parseBoolean(System.getProperty("dev", "true"));

// Construct an interpreter
i = new Interpreter();
// seed the environment
i.set("dev", dev);

We could easily extend this principle to have variables called QA, PROD or even hosts names.

Once the interpreter has been seeded we can re-write our bean shell configuration for DataSource:

// beans.bsh

// JDBC DataSource - Singleton Bean
dataSourceDef () {

  // JDBC Datasource,  singleton.
  ds = new  com.mysql.jdbc.jdbc2.optional.MysqlConnectionPoolDataSource();
  ds.port = 3306;
  ds.serverName = "localhost";
  if (dev) {
     ds.databaseName = "test";
     ds.user = "test";
     ds.password = "test";
  } else {
     ds.databaseName = "proddb";
     ds.user = "produser";
     ds.password = "prodpasswd";
  }

  create() { return ds; }
  return this;
}
dataSource = dataSourceDef();

Environmental configuration can now be consolidated within the bean configuration, simply and easily.  This makes it more transparent and easier to manage than some alternative solutions.

Controlling Heritrix programatically using a JMX Client

May 29th, 2009 by admin

Heritrix is a highly configurable web crawler written in Java, developed by the  team that run the Internet Archive. It is a good alternative to Nutch, especially if you are not necessarily interested in indexing crawled data using Lucene - which Nutch is wired up to do by default.

Heritrix instances can be controlled in one of 2 ways, a web interface or via JMX (I believe further releases will also by including a web services interface).

Connecting programatically via JMX is useful, especially if you want to automate various aspects of the crawl cycle. It is slightly fiddly though. Below is a self contained example of how to do this. It assumes that Heritrix as been started with a Web Console, something like:

xxx@yyy:~/dev/heretrix/heritrix-1.14.3$ ./bin/heritrix --admin=LOGIN:PASSWORD
Fri May 29 10:49:27 BST 2009 Starting heritrix.
Heritrix 1.14.3 is running.
Web console is at: http://127.0.0.1:8080
Web console login and password: LOGIN/PASSWORD

As this set up will automatically start a JMX service, using the PASSWORD specified above as the login credential for both the JMX monitorRole and controlRole users. Heritrix binds to JMX clients on port 8849, by default.

// Connection details
JMXServiceURL u = new JMXServiceURL(
"service:jmx:rmi://localhost/jndi/rmi://localhost:8849/jmxrmi");
Hashtable h = new Hashtable();
String[] credentials = new String[] {"controlRole", "PASSWORD"};
h.put("jmx.remote.credentials", credentials);

// Establish the JMX connection.
JMXConnector jmxc = JMXConnectorFactory.connect(u, h);
// can inspect object name information using JConsole
ObjectName objectName = new ObjectName(
"org.archive.crawler:jmxport=8849,name=Heritrix,type=CrawlService,guiport=8080,host=YOURHOST");

// direct method invocation to get attribute Status
String status = (String) jmxc.getMBeanServerConnection()
.getAttribute(objectName,  "Status");
System.out.println("Status is: " + status);
// e.g. Status is: isRunning=false isCrawling=false alertCount=0 newAlertCount=0

This should connect to a Heritrix server running on localhost and output it’s current status.

To get the correct MBean object name to use:

org.archive.crawler:jmxport=8849,name=Heritrix,type=CrawlService,guiport=8080,host=YOURHOST

I simply looked it up using JConsole. IMPORTANT: you will need to replace the YOURHOST string with the name of your host. This will generally be your hostname, rather than simply localhost.

One final observation is that we were unable to use Proxy stubs to invoke methods on our MBeans - even though this is the preffered JMX way. This is because the Heretrix MBean is an instance of org.archive.crawler.Heritrix which is itself a concrete class that does not implement any interface.

Bash Tools - Word Counting

May 22nd, 2009 by admin

Count the number of occurances of each unique word in a file:

sed 's/ /\n/g' filenane | sort | uniq -c | sort -nr

This crude method places each word on a new line and sorts them before doing a uniques count.

Count the number of words on each line in a file:

while read line; do
        echo "$line" | wc -w;
done < $1

where $1 is the file to count. It’s important to escape $line so that it is not interpreted.

Count The Maximum Number of Words on any Line in  a File:

max=0;
while read line; do
        tmp=`echo "$line" | wc -w`
        if [ $tmp -gt $max ]; then
                max=$tmp
                #echo $max words on: $line
        fi
done < $1

echo $max