The Kevin Dolan » Programming http://thekevindolan.com Putting the Kev in Dolan since 2009! Sun, 15 Aug 2010 00:40:56 +0000 en hourly 1 http://wordpress.org/?v=3.0 Java, I Would Like To See… http://thekevindolan.com/2010/03/java-i-would-like-to-see/ http://thekevindolan.com/2010/03/java-i-would-like-to-see/#comments Sun, 28 Mar 2010 23:04:07 +0000 Kevin http://thekevindolan.com/?p=894 boxing

Everybody knows I really love Java, and I’m a real fan of the new(ish) generic type features in Java, but as I mentioned last time, I have some problems with the way they handle reflection.  Well I have some other problems with Java as well, problems major enough to make me a little angry, but not big enough to make me even consider switching.  This post will just outline some of the things I would like to see implemented in future Java implementations.

I understand the high Java council has probably encountered and considered these suggestions in the past.  They may have plans to implement them at some juncture, or they may have decided that they are bad ideas.  I’d like to preface this post by saying that I really do love Java, and this is NOT a post about any Java weaknesses.  Just some things I occasionally wish I had…

One of the beautiful things about Java is that the language is so elegantly extensible that these features are easy to implement in Java without changing anything whatsoever.  The issue is that these work-around solutions can sometimes be unnecessarily verbose.

What would be really nice would be for Java to allow programmers to modify the syntax of Java to provide shortcuts to workarounds like these.  I think this would significantly improve developmental flow, and wouldn’t need to be more complicated than running a regex-matching engine at compile-time to replace some things with some computed values.  Something like that doesn’t necessarily need to be a Java-native feature but could even manifest itself as an Eclipse plugin.

I certainly understand the potential pitfalls of such a method, with regards to standardization and collaboration in large groups, but it’s something to consider at least for its novelty value.

Parameterized Type Reflection

This is what I wrote about last time, and I provided a work-around using the factory pattern.  Of course, while the factory pattern is a generally accepted means of object-instantiation, it is not entirely natural to programmers who generally use constructors more often.

The problem occurs due to the fact that when an object is parameterized using Java’s generics, the information about how it was parameterized is erased at run-time.  While in general, this is usually not a problem, it has twice vexed me when developing interesting libraries (genetic programming and MVC).

Another issue you see in real practice, which is a kind-of annoying result of this run-time parameterized type erasure comes from casting.

When you want to cast something like:

List< String > stringList = (List< String >) other;

You get a warning of “Type safety: Unchecked cast from Object to List<String>.”  This is a really annoying obvious result of not having this information at run-time, since there’s no way to throw a class-cast exception if you don’t know what class is there!  And I don’t know about you, but I never let warnings creep up in my code for too long, so I end up with all kinds of @SuppressWarnings(“unchecked”) all over my code, NASTY!

I am told that C# does a better job with this, but come on Java, can you give me a really good reason why you aren’t doing this?

See my last post for my workaround to this issue.  As far as a proposed syntax goes for improving this feature, I would just like to see some type of methods in Java.lang.class for accessing the parameterized types.

Typesafe Tuples!

Tuples are a concept that creeps up in a number of programming languages.  I first learned about them in a class about oCaml, but they are in Python and I’m sure a huge number of other languages.  The basic idea is that you can refer to a pair (or triplet or n-tuple) of values, each with its own independent type.

Tuples allow you to return more than one object without having to wrap the result in some other class.  I can see why Java might have resisted doing so, since you lose a little bit of information about what each value means when you use Tuples.  Using a class with named fields, on the other hand, lets you know exactly what you’re getting.

But sometimes, it’s just inconvenient to have to create a new class every time you want the utility of returning more than one piece of information.  Type-safe tuples are really easy to implement on your own, but if you’re even too lazy to do there, there’s a library for it.

This solution can just be pretty verbose.  There is a paper with some proposed syntax here.  You’ll need ACM portal access to view the document.  The specific syntax does not really matter to me, but something convenient would be nice.

Customized Auto-(un)boxing

One nice feature that recently crept up in Java was the auto-(un)boxing feature for converting from primitive types to their boxed object types. For instance, the following code is now legal and behaves as expected in java:

int n = new Integer(5);
Integer x = 4;

This is a pretty convenient feature in object-oriented programming, but I would like to see it optionally applied to classes that you create.  Here’s what I mean, with regards specifically to my MVC library.

I have a class called Path.  One of the constructors of path takes a String as a parameter.  For convenience though, most methods in my library that can take a Path as a parameter, will have an alternate method that takes a String, which is simply parsed and converted into a Path.  Right now, I have to write two methods every time I want to refer to a Path, one which takes Path and one which takes a String.  However, this results in a lot of duplicate code, documentation, and is generally wasteful in terms of effort.

What I would like to see, is some kind-of default conversion for situations like this.  Tell Java that anytime I’m expecting a Path and I get a String instead, do some logic to convert the String into a Path.  It’s as simple as that.

]]>
http://thekevindolan.com/2010/03/java-i-would-like-to-see/feed/ 0
Reflecting Generic Types in Java http://thekevindolan.com/2010/03/reflecting-generic-types/ http://thekevindolan.com/2010/03/reflecting-generic-types/#comments Sun, 28 Mar 2010 01:45:03 +0000 Kevin http://thekevindolan.com/?p=868 erasure-heavenly-action-music-picture-idea-girl-consulting-word-press

My first experience with generics in Java was when I was working on my Darwin G.P. library.  I was excited to find out about and take advantage of generics, but there was one major problem I had with the way they were implemented–they were completely disregarded at run-time!  I never fully understood what the motivation behind that decision was, and I’ve been informed that C# does things differently (have not verified), but I recently discovered an interesting workaround using the Factory pattern.

For those of you who do not know, generics were introduced in Java 1.5 and offer an extra layer of compile-time type-safety to classes.  They are utilized heavily by the Java standard-library collections (like ArrayList, HashMap, etc).  You can define a class like:

public class ValueHolder< Type > {
 
	private Type value;
 
	public ValueHolder(Type value) {
		this.value = value;
	}
 
	public Type getValue() {
		return value;
	}
 
	public void setValue(Type value) {
		this.value = value;
	}
 
}

This all works very nicely, and gives you some fantastic compile-time checking, but the run-time reflection is incredibly weak.

If you are unfamiliar with reflection, reflection gives you some features found in dynamic programming languages. On any class, you can call .class to get a Class object, which you can then use to perform any logic you want on. It’s very nice, but if you try to call .class on a generic class, you’ll be told “Illegal class literal for the type parameter Type.”

It would be nice to have some way of getting the class object for a parameterized type.

Java tells us on this page that:

4.6 Type Erasure

Type erasure is a mapping from types (possibly including parameterized types and type variables) to types (that are never parameterized types or type variables). We write |T| for the erasure of type T. The erasure mapping is defined as follows.

The erasure of a parameterized type (§4.5) G is |G|.
The erasure of a nested type T.C is |T|.C.
The erasure of an array type T[] is |T|[].
The erasure of a type variable (§4.4) is the erasure of its leftmost bound.
The erasure of every other type is the type itself.
The erasure of a method signature s is a signature consisting of the same name as s, and the erasures of all the formal parameter types given in s.

So Java is well-aware that parameterized types are not reified at run-time. I suppose they have their reasons for doing it this way, but it would still sometimes be nice to have this information at run-time.

The solution I came to in past projects was to explicitly pass the class object to the constructor as follows:

public class ValueHolder< Type > {
 
	private Type value;
	private Class< Type > cls;
 
	public ValueHolder(Type value, Class< Type > cls) {
		this.value = value;
		this.cls = cls;
	}
 
	...
 
}

This does work. And in fact, it actually prevents programmers from accidentally breaking something by passing the wrong Class object, since Class itself is parameterized, but still, it makes programmers have to repeat themselves, as follows:

ValueHolder< String > v = new ValueHolder< String >("hello world", String.class);

I recently discovered a new technique, that accomplishes the reification of generic types without forcing programmers to repeat themselves by means of the Factory Pattern. Simply adding a static method of the following signature to your class accomplishes this:

public static <Type> ValueHolder<Type> create(Type value, Class<Type> cls) {
	return new ValueHolder<Type>(value, cls);
}

Now, programmers can instantiate a ValueHolder as follows:

ValueHolder< String > v = ValueHolder.create("hello world", String.class);

It’s really only a minor reduction of the strain put on programmers, but actually does make a pretty big difference when using it.

]]>
http://thekevindolan.com/2010/03/reflecting-generic-types/feed/ 1
Nerdy Lifestyle Changes: Git! http://thekevindolan.com/2010/03/nerdy-lifestyle-changes-git/ http://thekevindolan.com/2010/03/nerdy-lifestyle-changes-git/#comments Wed, 24 Mar 2010 23:06:25 +0000 Kevin http://thekevindolan.com/?p=849 git_two1

More on my series of Nerdy Lifestyle Changes, I’ve started using GIT for my version control, both on my personal projects and group projects.  Git is a distributed version control system, meaning that each computer maintains the set of all changes ever made.  At first, this idea seems pretty radical compared to the more traditional notions of CVS or SVN, but there are some real advantages to using it.

I’m not going to go into any specifics about Git, because there are so many more resources out there that discuss things about Git I haven’t even begun to understand, but I will discuss how it has improved my nerdiness.

I used to be an SVN guy, which meant setting up a repository somewhere every time I wanted to version control a personal project.  This was cumbersome, because really all I wanted to do was take advantage of the ability to make radical changes or deletions, without the need to worry about potentially losing some work I’ve already done.

With Git, however, setting up version control takes a single terminal command.  Branching and merging are incredibly simple, which makes me more likely to take advantage of branching for tasks like fixing bugs or implementing features.

I have only started to really begin utilizing branching, but you could conceivably create a branch for every single major task, with virtually no workflow overhead, which provides some obvious and powerful benefits.

I haven’t had much of a chance to use Git for collaborative projects, but that’s really what it’s designed for.  I’m planning on using it for the production of Fitlia, where I’ll hopefully be collaborating with several others.  For those who want some notion of centralization, there is GitHub, which provides remote git repositories for free for open source projects, and at a reasonable fee for private ones, along with a whole slew of collaborative tools.

For instance, you can find the GitHub page for my newest project DDMVC, which you’ll hear more about later, I’m sure.

I really enjoy using Git for my projects, and it’s given me a chance to play with using the linux terminal a little more.  This page helped me out a lot getting started: Git Basics.

]]>
http://thekevindolan.com/2010/03/nerdy-lifestyle-changes-git/feed/ 0
Nerdy Lifestyle Changes: Test-Driven Development http://thekevindolan.com/2010/03/test-driven-development/ http://thekevindolan.com/2010/03/test-driven-development/#comments Tue, 23 Mar 2010 22:25:54 +0000 Kevin http://thekevindolan.com/?p=839 testing

Another installment in my Nerdy Lifestyle Changes series, which essentially details my discovery of awesome nerdy things everybody else already knows about, but I missed because I have to this point by-and-large avoided speaking to other C.S. majors: test-driven-development.  Everyone’s doing it, I might as well to!

When I was visiting some of the companies I was considering interning at this summer, I noticed one thing a lot of them had in common: a focus on unit-testing.

Unit-testing was generally something teachers encouraged in my classes to this point, but to me, that always meant writing a piece of code, and then maybe playing with some of the features in some random class with a main method, to see if they would do what you expect.  My unit tests were generally thought up on-the-fly as I added features and then deleted immediately afterwards.  This meant that if I made any changes, I would have to go back and rethink my testing.  Most of the time, I wouldn’t even notice bugs until much later.

In general however, I didn’t do it at all.  I would write an entire code-base, and then do some top-down testing at the end.  The projects in class were usually small enough to allow this to happen, and my code architecture was generally separated enough that bugs were easy to locate.

But now that I’m playing with TDD, I can’t even begin to understand how silly I was being.

I finally discovered jUnit, which makes unit testing a real breeze.  We used jUnit on one project sophomore year, but the context was that the teacher provided several test cases to determine whether or not our code was correct, and no real explanation to how it worked was provided.  I was intrigued, but not enough to look into it further.

But now I’m using it heavily.  I can simply set up a whole slew of tests for each and every feature of a class, and then rerun the tests every time I make any changes.  If I’ve introduced any new bugs by the change, I’ll know right away.

As a result, I’m experiencing unbelievable changes in the way I write my code.  I’m now more confident to make huge changes to the implementation of some feature and I’m way more confident that my code actually does what I want it to.  Already, I’ve been able to catch an alarming number of mistakes I would have probably otherwise overlooked until it was too late!

I mentioned a while ago, with regards to testing, that I now have my eclipse set up with two code views.  One code view I reserve specifically for test code.  The amount of test code I write is now actually on a par with the amount of application code I write.  For instance, last week, I wrote 5855 total lines of code (a significant amount of this was documentation, to be fair).  Of that code, 2319 lines were strictly for testing.

*Aside: I discovered the following linux terminal code to count the number of lines of code in a directory… I know measuring progress on a codebase in lines of code is like measuring progress on a plane by tonnage, but it’s sometimes useful:

wc -l `find /dir/name -type f`

While at first glance, and the reason I’ve always been so resistant to TDD, this seems like twice the amount of work, it actually felt like a lot less work, because every time I made a change, I would get immediate feedback on what bugs were introduced.  not only does it improve code correctness, but it also seems to have an effect on workflow efficiency.

But more than anything, it’s all about confidence… and I really dig anything that boosts my confidence.

]]>
http://thekevindolan.com/2010/03/test-driven-development/feed/ 0
Nerdy Lifestyle Changes: Look How Cool My Eclipse Looks! http://thekevindolan.com/2010/03/look-how-cool-my-eclipse-looks/ http://thekevindolan.com/2010/03/look-how-cool-my-eclipse-looks/#comments Sun, 21 Mar 2010 21:11:53 +0000 Kevin http://thekevindolan.com/?p=830 Screenshot-1

We all know that I’m a pretty big nerd.  Really, I’m proud of it.  That being said, I thought it was time I finally customized my eclipse to look the way I wanted it to!So I went out and found an article discussing how to use preset color themes, and went about using the Vibrant Ink preset from that webpage.  I really liked it, but I noticed certain problems with reference highlighting, so I made some further changes.

You can get my presets here: Vibrant Ink Modified.

You may recognize Vibrant Ink from TextMate for mac.

I’ve also recently been learning about some of the trends in industry development, since I’m going to be interning at NextJump this summer.

I’ve taken up Test-Driven-Development, so I wanted my eclipse to reflect the focus on testing.  You’ll notice two code views in my eclipse displayed at all times.  I reserve the right view for test-code only and the left view for whatever other code I’d like.  I enforce a strict 80-character width limit, to ensure that I don’t ever have to side-scroll.

The 80-character width limit was something I always considered a vestige from days of old, but the ability to have two classes on the screen at once provides a real reason to abide.  Eclipse has a preference built-in to provide a width-limit line for reference.

I probably could have gone wider, but I’ve also noticed that it stops me from the bad habit of over-nesting.  Anytime I find myself wrapping too many lines, I know it’s time to either change what I’m doing, or abstract it out, which inevitably makes my code more readable.

I’m still a tabs, not spaces guy, but I’ve changed the display for tabs to the 2-space equivalent, because I think it makes my code look better.

]]>
http://thekevindolan.com/2010/03/look-how-cool-my-eclipse-looks/feed/ 2
Improved MultiMap Implementation http://thekevindolan.com/2010/03/improved-multimap/ http://thekevindolan.com/2010/03/improved-multimap/#comments Fri, 12 Mar 2010 05:14:39 +0000 Kevin http://thekevindolan.com/?p=809 old_map

The other day, I implemented a parameterised MultiMap in Java.  In my project, however, I recently discovered that it would be nice to choose what data structures underlie the MultiMap, since you might have a preference for ordering, runtime, duplicates, etc.  So I made some minor changes that make the MultiMap significantly more powerful.

So I won’t talk about it too long, but the basic idea was that any class can extend the MultiMap class, which is now abstract.  By overriding the two methods: getNewMapInstance() and getNewPostingsInstance() you can use any combination of Maps and Collections you have available, and the MultiMap will assume the properties of those data structures.

For convenience, I’ve implemented a few permutations by default:

MultiHashMap, which uses a HashMap for the map and a HashSet for the postings.  This one gives you constant-time lookups for both the map and the postings, but guarantees no particular ordering and does not allow duplicates.

MultiTreeMap uses a TreeMap and a TreeSet, so you get logarithmic lookups, duplicate prevention, and as an added bonus, traversals are ordered by value!

MultiHashListMap and MultiTreeListMap both use LinkedLists for the postings.  This means the postings allow duplicates, and have an order related to the order in which items were inserted.  This is useful if the order of the values put to a given key really matter, and of course, if you want to allow duplicates.

Reasonably, there are other permutations, but it’s easy enough to just implement them on your own.

Here is the MultiMap0.2.jar and here is the JavaDoc.

]]>
http://thekevindolan.com/2010/03/improved-multimap/feed/ 1
Java Parameterised MultiMap Implementation http://thekevindolan.com/2010/03/java-parameterised-multimap-implementation/ http://thekevindolan.com/2010/03/java-parameterised-multimap-implementation/#comments Wed, 10 Mar 2010 23:40:05 +0000 Kevin http://thekevindolan.com/?p=800 old-world-map

A MultiMap is essentially an augmented data structure for storing key-value pairs where one key can potentially be associated with any number of values.  I’ve accomplished this in the past by using a Map<Key, List<Value>> or some variation in the past, but methods of access are inconvenient and can be a source of many bugs.  Surprisingly, I’ve never gone out of my way to implement what should be a relatively simple data structure, but I finally did it for my most recent project, so I’m making it available to anyone who wants to use it.

NOTE: I have improved the implementation and released a new version which can be found here.

I’ve recently been working on a project using Google Web Toolkit, and started implementing a simple MVC framework when I came upon the need to implement a MultiMap.  At first, I started using my old method as recommended at the bottom of this page, but I wanted it to be a little more convenient and less error prone.

So I started looking for some implementation online, but couldn’t find anything that took advantage of parameterised types, or that offered all the features I wanted.  Also, I’ve recently been dabbling in TDD and wanted to use this as a little more practice before I moved on to bigger-scale projects.

As a result, the MultiMap implementation has been thoroughly tested, and I am quite certain that it works as expected.

You can download the multimap.jar here (there’s only one class and a JUnit test class) or if you prefer you can download the MultiMap.java source file.

The use of the class is incredibly simple and straightforward.  The JavaDoc should cover everything you might want to do.

If you’d rather learn by example (which is sometimes easier), look at this Java file.

]]>
http://thekevindolan.com/2010/03/java-parameterised-multimap-implementation/feed/ 1
Using the News! http://thekevindolan.com/2010/02/using-the-news/ http://thekevindolan.com/2010/02/using-the-news/#comments Fri, 26 Feb 2010 23:12:01 +0000 Kevin http://thekevindolan.com/?p=797 newspaperLast time, I tried out a couple different methods to assign sentiment to news articles, and found that the best performance seemed to come from using my Temporal Interference method initialized by zeroes.  Well there’s a little more information available to us, and that’s the news article content themselves!

So the basic idea here is to train a model using Temporal Interference, and then use that model to score each news article, and use the scores as the new Perturbation model.  This would potentially lead to an iterative process, but should eventually converge.  Of course, there wasn’t a particularly clear way to do this, so I tried several.

For this first set of results, I tried various orders of modeling on the data at its actual duration value.  For time’s sake I skipped data set 5 due to its size.  Only results for 1-4 and 6 are shown:

Z,TI: 0.944, 0.902, 0.887, 0.852, 0.889, 0.708
Z,TI,NS: 0.858, 0.880, 0.810, 0.850, 0.854, 0.749
Z,TI,NS,TI: 0.951, 0.921, 0.907, 0.858, 0.744
Z,TI,NS,NS: 0.821, 0.853, 0.524, 0.761, 0.513
Z,TI,NS,NS,TI: 0.947, 0.906, 0.890, 0.853, 0.716
Z,TI,NS,TI,NS: 0.859, 0.880, 0.813, 0.848, 0.763
Z,TI,(NS,TI)x2: 0.951, 0.921, 0.907, 0.858, 0.747

What we learn here is that incorporating news information into the model does have an advantage.  For the most part, we saw optimum performance alternating News and Temporal Interference, ending with Temporal Interference, however, in the case of data set 6, we saw the best performance when we ended with the News Analysis.  This was somewhat counter-intuitive.

Also, with most data sets, the progress stabilized after just one iteration, however, this was not the case with data set 6.  I tried adding two more steps.  Adding another News step gave us an accuracy of 0.765, and adding another Temporal Interference step brought us back to 0.747.  Considering the kind of information made available to the system, this is somewhat impressive, but really goes to show why my initial attempts at this definitely would not have ever worked.

I then considered what would happen if we did not know ahead of time what kind of duration to expect for perturbations.  I did these studies on data set 3 only.  If we reduce the duration to just 49, we see a significant drop in accuracy with Temporal Interference alone to 0.692.  If we keep alternately adding NS/TI steps, we see the accuracy rise to 0.739, 0.702, 0.745, 0.703, 0.747, 0.703.  Interestingly, this one also performs best when we end with the News analysis.

If we reduce the duration to 40, we see the accuracy decrease to 0.642.  Running 2 NS/TI loops and ending with NS, we find the accuracy actually decrease to 0.624.  Odd indeed!

For duration 51, we see a drop to 0.712, and  then an improvement to 0.776.

For duration 60, we see a drop to 0.658, and then an improvement to 0.726.

Clearly something odd is going on here where we do not know the duration, and unfortunately, this is the case in real life.  So we need to do some further thinking about Perturbation modeling if we want to move forward.

]]>
http://thekevindolan.com/2010/02/using-the-news/feed/ 0
Perturbation Modeling, Initial Approach http://thekevindolan.com/2010/02/perturbation-modeling/ http://thekevindolan.com/2010/02/perturbation-modeling/#comments Thu, 25 Feb 2010 17:45:53 +0000 Kevin http://thekevindolan.com/?p=790 1235935919

Last time I introduced some test data, and before that I formalized the Perturbation Model for Price Moves a bit further.  Well this required me to rewrite the code I had written before for Sentiment analysis.  I took advantage of interval trees to make my code fairly efficient, and also changed the way I initialize the price movements, yielding minor improvements over the naive methods.

The basic idea was that I created an object Perturbation, which has fields for influence, duration, and start time.  Start time was a controversial decision for inclusion, but I ultimately decided it was best.

The Sentiment interface has been replaced by Modeler and contains one method:

public Perturbation[] getSentiment(Ticker ticker, int index);

As far as the initialization algorithm, I had previously initialized all sentiments to the price move observed in the interval of the news article.  The biggest problem I had with this was that a corpus of just two news articles at exactly the same time would initialize to the price move observed in that time for both, when in reality, it should be half that.

So basically the new method works by splitting the data into elementary intervals by the distribution of news articles.  For each elementary interval, the price move observed is distributed amongst all relevant perturbations, and then the resulting assignment is a weighted average of all of these observed price moves.  I called this Elementary Average.

The code is below:

public Perturbation[] getSentiment(Ticker ticker, int index) {
	NewsCorpus corpus = ticker.getCorpus();
	DataHistory data = ticker.getDataHistory(index);
	List<NewsStory> newsList = corpus.getNews();
	Perturbation[] perturbations = new Perturbation[newsList.size()];
	IntervalTree<NewsStory> intervalTree = new IntervalTree<NewsStory>();
	SortedSet<Long> endpoints = new TreeSet<Long>();
 
	for(NewsStory story : newsList) {
		perturbations[story.getId()] = new Perturbation(0,story.getTime(),forecast);
		intervalTree.addInterval(story.getTime(), story.getTime() + forecast, story);
		endpoints.add(story.getTime());
		endpoints.add(story.getTime() + forecast);
	}
 
	Long last = null;
	for(Long next : endpoints) {
		if(last != null) {
			List<NewsStory> stories = intervalTree.get(last, next);
 
			double price0 = data.getData(last);
			double price1 = data.getData(next);
			double priceMove = price1 - price0;
 
			//System.out.println(priceMove);
 
			double distributed = priceMove / stories.size();
 
			double length = next - last;
			double proportion = length / forecast / forecast;
 
			//System.out.println(stories.size());
 
			for(NewsStory story : stories) {
				double current = perturbations[story.getId()].getInfluence();
				perturbations[story.getId()].setInfluence(current + proportion * distributed);
			}
		}
		last = next;
	}
 
	return perturbations;
}

Temporal interference works essentially the same as before, but now takes advantage of interval trees, making it significantly faster.  Also, I added a parameter for learning rate.  This allows large changes in the estimates initially, but slowly limits the amount that can change, which guarantees convergence, which was not necessarily guaranteed before.

The code is below:

public Perturbation[] getSentiment(Ticker ticker, int index) {
	NewsCorpus corpus = ticker.getCorpus();
	DataHistory data = ticker.getDataHistory(index);
	List<NewsStory> newsList = corpus.getNews();
	IntervalTree<NewsStory> intervalTree = new IntervalTree<NewsStory>();
	Perturbation[] perturbations = initializer.getSentiment(ticker, index);
 
	for(NewsStory story : newsList)
		intervalTree.addInterval(story.getTime(), story.getTime() + forecast, story);
 
	double maxError = Double.MAX_VALUE;
	int iteration = 0;
	while(maxError > epsilon) {
		double rate = Math.exp(-iteration*learningRate);
		maxError = 0;
		for(NewsStory story : newsList) {				
			double price0 = data.getData(story.getTime());
			double price1 = data.getData(story.getTime() + forecast);
			double priceMove = price1 - price0;
 
 
			List<NewsStory> interferons = intervalTree.get(story.getTime(), story.getTime() + forecast);
			double sumInterference = 0;
			for(NewsStory interferon : interferons) {
				long intersection = forecast - Math.abs(interferon.getTime() - story.getTime());
				double interference = perturbations[interferon.getId()].getInfluence() * intersection;
				sumInterference += interference;
			}
 
			double old = perturbations[story.getId()].getInfluence();
			double error = priceMove - sumInterference;
			perturbations[story.getId()].setInfluence((old*forecast+rate*error)/forecast);
 
			maxError = Math.max(Math.abs(rate*error), maxError);
		}
		iteration++;
	}
 
	return perturbations;
}

Of course, how well does this new method perform compared with the old method? For a baseline, I wrote a new NaivePriceMove Modeler, and analyzed the results for the 5 data sets.  I used the correct duration for all of these tests.

Accuracy from Naive Price Move alone was: 0.637, 0.614, 0.566, 0.553, 0.568 respectively.  Error was:  1.42, 10.83, 15.51, 6.32, 9.60!

Accuracy from Elementary Average alone was: 0.637, 0.613, 0.569, 0.553, 0.568 — almost identical.  Error was: 0.312, 0.335, 0.336, 0.322, 0.328.

Note that though accuracy was almost identical, error (average squared) was significantly reduced in the case of Elementary Average–also more consistent.

Let’s see how temporal interference performs as initialized by either.

Temporal Interference initialized by Naive Price Move was: 0.944, 0.902, 0.887, 0.852, 0.889.  Error was: 0.032, 0.065, 0.114, 0.097, 0.070.

Temporal Interference initialized by Elementary Average was: 0.943, 0.902, 0.887, 0.852, 0.889.  Error was: 0.033, 0.065, 0.114, 0.098, 0.071.

As we can see, the results are almost identical.  This was interesting to me, and made me wonder whether or not the initial values even matter.  So I tried two more experiments, one in which I initialized the influences to 0.  The other where I initialized to random values between -1 and 1.

In the case of zeroes, the accuracy was obviously 0, and the error floated around 0.33 pretty closely.  When we initialize T.I. with zeroes, there is no change in accuracy or error at all.

In the case of random initialization, the accuracy was around 0.5, and the error floated around 0.66 pretty closely– as expected.  When we initialize T.I. with random values, we did see some change.  Accuracies exhibited were: 0.889, 0.864, 0.829, 0.799, 0.837.  Errors were: 0.064, 0.108, 0.179, 0.156, 0.117.

So it doesn’t converge to some value inherent to the system regardless of what we initialize to, but it does depend on what we seed, however, only to some degree.  I think the the zeroes case reduces to initializing with Naive Price move, but I may be wrong.  Also, Naive Price Move and Elementary Average both tend to result in the same direction of price move, so maybe that’s important.

There is one more test to run, and that’s a throwback to whether concurrent news articles introduce any major problems.  I generated another data set that creates two news articles at each time point.  Because there is no way to tease apart which one is good and which one is bad here, we expect low performance.  But we want to find out if one method performs better than the other on this particular test.

I used T.I. for all tests, and varied the initializer.  Initialized with zeroes, we saw an accuracy of 0.709, error of 0.280.  Initialized with random, we saw an accuracy of 0.609, error of 0.548.  Initialized with N.P.M, accuracy 0.517, error 12.619.  Initialized with E.A., accuracy 0.708, error 0.280.

Interesting!  Initialing with N.P.M. appears, as expected, to have disastrous results when there are news articles at the same time.  It actually performs worse than random initialization!  Also, note that zero-initialization seems to reduce to E.A., rather than N.P.M. as I had initially anticipated.

Now there is one final question to this whole analysis, efficiency.  Perhaps one method converges more quickly than another.  I counted the number of iterations taken to solve data set 4, initialized by different methods.  Zeroes: 388.  Random: 379.  N.P.M.: 385.  E.A.: 388.

It appears that there were no major differences in runtime.  Perhaps it is more strongly related to the learning rate than anything else.  Again, we see zeroes equaling E.A., indicating that perhaps the best idea is to simply initialize with zeroes strictly for code simplicity.

]]>
http://thekevindolan.com/2010/02/perturbation-modeling/feed/ 2
More Test Data http://thekevindolan.com/2010/02/more-test-data/ http://thekevindolan.com/2010/02/more-test-data/#comments Thu, 25 Feb 2010 01:30:40 +0000 Kevin http://thekevindolan.com/?p=786 tumblr_kvk45k0Nam1qznckp

I got back to working on the automated news analysis algorithm again, and thought that it would be wise to generate some new test data that will have some more context to it.  I wrote a simple algorithm that I discuss here, and I generated some data sets.

The basic idea was simple.  I want to in the future use some measure of similarity between documents that is smarter than the traditional tf.idf approach but at the moment, I don’t know which methods to use (as this is a large part of the project as a whole).  That being said, I still need to build a foundation for assigning sentiment to news stories for which we know what happens afterwards!

So a reasonable solution, I think, would be to generate some test data sets that will perform well using tf.idf, and then use that them to isolate the other aspects of the problem.

To generate these news corpuses, I used a method similar to the first method, except I now set the content of the news stories to be a nonsense string of several words.  I built 3 sets of 50 random words, and these sets are good, bad, and neutral.  Neutral words have some constant probability of appearing in any document, and the good/bad words are selected in proportion to influence of the generated news article.

The code I used specifically is reproduced below:

public NewsCorpus generate() {
	NewsCorpus corpus = new NewsCorpus();
	long time = 0;
	for(int i = 0; i &lt; numStories; i++) {
		double positive = Math.random();
		double influence = positive * 2 - 1;
 
		StringBuffer sb = new StringBuffer();
		int numWords = (int) (Math.random() * maxWords);
		for(int j = 0; j &lt; numWords; j++) {
			String[] wordList;
			if(Math.random() &lt; neutralProportion)
				wordList = neutralWords;
			else if(Math.random() &lt; positive)
				wordList = goodWords;
			else
				wordList = badWords;
			sb.append(wordList[(int) (Math.random() * 50)] + " ");
		}
 
		NewsStory story = new NewsStory(time, "Context Corpus Generator",
				"News Story #"+i, "LINEAR;"+influence+";"+timeFrame, sb.toString());
 
		corpus.addNews(story);
		time += timeStep;
	}
	return corpus;
}

I also generated several data sets using this method.  For all data sets, I used a word limit of 50 and a timestep of 1.  The data set summaries are below:

Data Set 1: 1000 articles, timeframe of 10, 0.33 neutral proportion

Data Set 2: 1000 articles, timeframe of 30, 0.33 neutral proportion

Data Set 3: 1000 articles, timeframe of 50, 0.33 neutral proportion

Data Set 4: 1000 articles, timeframe of 50, 0.50 neutral proportion

Data Set 5: 3000 articles, timeframe of 50, 0.50 neutral proportion

]]>
http://thekevindolan.com/2010/02/more-test-data/feed/ 2