The Kevin Dolan » test results http://thekevindolan.com Putting the Kev in Dolan since 2009! Sun, 15 Aug 2010 00:40:56 +0000 en hourly 1 http://wordpress.org/?v=3.0 Using the News! http://thekevindolan.com/2010/02/using-the-news/ http://thekevindolan.com/2010/02/using-the-news/#comments Fri, 26 Feb 2010 23:12:01 +0000 Kevin http://thekevindolan.com/?p=797 newspaperLast time, I tried out a couple different methods to assign sentiment to news articles, and found that the best performance seemed to come from using my Temporal Interference method initialized by zeroes.  Well there’s a little more information available to us, and that’s the news article content themselves!

So the basic idea here is to train a model using Temporal Interference, and then use that model to score each news article, and use the scores as the new Perturbation model.  This would potentially lead to an iterative process, but should eventually converge.  Of course, there wasn’t a particularly clear way to do this, so I tried several.

For this first set of results, I tried various orders of modeling on the data at its actual duration value.  For time’s sake I skipped data set 5 due to its size.  Only results for 1-4 and 6 are shown:

Z,TI: 0.944, 0.902, 0.887, 0.852, 0.889, 0.708
Z,TI,NS: 0.858, 0.880, 0.810, 0.850, 0.854, 0.749
Z,TI,NS,TI: 0.951, 0.921, 0.907, 0.858, 0.744
Z,TI,NS,NS: 0.821, 0.853, 0.524, 0.761, 0.513
Z,TI,NS,NS,TI: 0.947, 0.906, 0.890, 0.853, 0.716
Z,TI,NS,TI,NS: 0.859, 0.880, 0.813, 0.848, 0.763
Z,TI,(NS,TI)x2: 0.951, 0.921, 0.907, 0.858, 0.747

What we learn here is that incorporating news information into the model does have an advantage.  For the most part, we saw optimum performance alternating News and Temporal Interference, ending with Temporal Interference, however, in the case of data set 6, we saw the best performance when we ended with the News Analysis.  This was somewhat counter-intuitive.

Also, with most data sets, the progress stabilized after just one iteration, however, this was not the case with data set 6.  I tried adding two more steps.  Adding another News step gave us an accuracy of 0.765, and adding another Temporal Interference step brought us back to 0.747.  Considering the kind of information made available to the system, this is somewhat impressive, but really goes to show why my initial attempts at this definitely would not have ever worked.

I then considered what would happen if we did not know ahead of time what kind of duration to expect for perturbations.  I did these studies on data set 3 only.  If we reduce the duration to just 49, we see a significant drop in accuracy with Temporal Interference alone to 0.692.  If we keep alternately adding NS/TI steps, we see the accuracy rise to 0.739, 0.702, 0.745, 0.703, 0.747, 0.703.  Interestingly, this one also performs best when we end with the News analysis.

If we reduce the duration to 40, we see the accuracy decrease to 0.642.  Running 2 NS/TI loops and ending with NS, we find the accuracy actually decrease to 0.624.  Odd indeed!

For duration 51, we see a drop to 0.712, and  then an improvement to 0.776.

For duration 60, we see a drop to 0.658, and then an improvement to 0.726.

Clearly something odd is going on here where we do not know the duration, and unfortunately, this is the case in real life.  So we need to do some further thinking about Perturbation modeling if we want to move forward.

]]>
http://thekevindolan.com/2010/02/using-the-news/feed/ 0
Perturbation Modeling, Initial Approach http://thekevindolan.com/2010/02/perturbation-modeling/ http://thekevindolan.com/2010/02/perturbation-modeling/#comments Thu, 25 Feb 2010 17:45:53 +0000 Kevin http://thekevindolan.com/?p=790 1235935919

Last time I introduced some test data, and before that I formalized the Perturbation Model for Price Moves a bit further.  Well this required me to rewrite the code I had written before for Sentiment analysis.  I took advantage of interval trees to make my code fairly efficient, and also changed the way I initialize the price movements, yielding minor improvements over the naive methods.

The basic idea was that I created an object Perturbation, which has fields for influence, duration, and start time.  Start time was a controversial decision for inclusion, but I ultimately decided it was best.

The Sentiment interface has been replaced by Modeler and contains one method:

public Perturbation[] getSentiment(Ticker ticker, int index);

As far as the initialization algorithm, I had previously initialized all sentiments to the price move observed in the interval of the news article.  The biggest problem I had with this was that a corpus of just two news articles at exactly the same time would initialize to the price move observed in that time for both, when in reality, it should be half that.

So basically the new method works by splitting the data into elementary intervals by the distribution of news articles.  For each elementary interval, the price move observed is distributed amongst all relevant perturbations, and then the resulting assignment is a weighted average of all of these observed price moves.  I called this Elementary Average.

The code is below:

public Perturbation[] getSentiment(Ticker ticker, int index) {
	NewsCorpus corpus = ticker.getCorpus();
	DataHistory data = ticker.getDataHistory(index);
	List<NewsStory> newsList = corpus.getNews();
	Perturbation[] perturbations = new Perturbation[newsList.size()];
	IntervalTree<NewsStory> intervalTree = new IntervalTree<NewsStory>();
	SortedSet<Long> endpoints = new TreeSet<Long>();
 
	for(NewsStory story : newsList) {
		perturbations[story.getId()] = new Perturbation(0,story.getTime(),forecast);
		intervalTree.addInterval(story.getTime(), story.getTime() + forecast, story);
		endpoints.add(story.getTime());
		endpoints.add(story.getTime() + forecast);
	}
 
	Long last = null;
	for(Long next : endpoints) {
		if(last != null) {
			List<NewsStory> stories = intervalTree.get(last, next);
 
			double price0 = data.getData(last);
			double price1 = data.getData(next);
			double priceMove = price1 - price0;
 
			//System.out.println(priceMove);
 
			double distributed = priceMove / stories.size();
 
			double length = next - last;
			double proportion = length / forecast / forecast;
 
			//System.out.println(stories.size());
 
			for(NewsStory story : stories) {
				double current = perturbations[story.getId()].getInfluence();
				perturbations[story.getId()].setInfluence(current + proportion * distributed);
			}
		}
		last = next;
	}
 
	return perturbations;
}

Temporal interference works essentially the same as before, but now takes advantage of interval trees, making it significantly faster.  Also, I added a parameter for learning rate.  This allows large changes in the estimates initially, but slowly limits the amount that can change, which guarantees convergence, which was not necessarily guaranteed before.

The code is below:

public Perturbation[] getSentiment(Ticker ticker, int index) {
	NewsCorpus corpus = ticker.getCorpus();
	DataHistory data = ticker.getDataHistory(index);
	List<NewsStory> newsList = corpus.getNews();
	IntervalTree<NewsStory> intervalTree = new IntervalTree<NewsStory>();
	Perturbation[] perturbations = initializer.getSentiment(ticker, index);
 
	for(NewsStory story : newsList)
		intervalTree.addInterval(story.getTime(), story.getTime() + forecast, story);
 
	double maxError = Double.MAX_VALUE;
	int iteration = 0;
	while(maxError > epsilon) {
		double rate = Math.exp(-iteration*learningRate);
		maxError = 0;
		for(NewsStory story : newsList) {				
			double price0 = data.getData(story.getTime());
			double price1 = data.getData(story.getTime() + forecast);
			double priceMove = price1 - price0;
 
 
			List<NewsStory> interferons = intervalTree.get(story.getTime(), story.getTime() + forecast);
			double sumInterference = 0;
			for(NewsStory interferon : interferons) {
				long intersection = forecast - Math.abs(interferon.getTime() - story.getTime());
				double interference = perturbations[interferon.getId()].getInfluence() * intersection;
				sumInterference += interference;
			}
 
			double old = perturbations[story.getId()].getInfluence();
			double error = priceMove - sumInterference;
			perturbations[story.getId()].setInfluence((old*forecast+rate*error)/forecast);
 
			maxError = Math.max(Math.abs(rate*error), maxError);
		}
		iteration++;
	}
 
	return perturbations;
}

Of course, how well does this new method perform compared with the old method? For a baseline, I wrote a new NaivePriceMove Modeler, and analyzed the results for the 5 data sets.  I used the correct duration for all of these tests.

Accuracy from Naive Price Move alone was: 0.637, 0.614, 0.566, 0.553, 0.568 respectively.  Error was:  1.42, 10.83, 15.51, 6.32, 9.60!

Accuracy from Elementary Average alone was: 0.637, 0.613, 0.569, 0.553, 0.568 — almost identical.  Error was: 0.312, 0.335, 0.336, 0.322, 0.328.

Note that though accuracy was almost identical, error (average squared) was significantly reduced in the case of Elementary Average–also more consistent.

Let’s see how temporal interference performs as initialized by either.

Temporal Interference initialized by Naive Price Move was: 0.944, 0.902, 0.887, 0.852, 0.889.  Error was: 0.032, 0.065, 0.114, 0.097, 0.070.

Temporal Interference initialized by Elementary Average was: 0.943, 0.902, 0.887, 0.852, 0.889.  Error was: 0.033, 0.065, 0.114, 0.098, 0.071.

As we can see, the results are almost identical.  This was interesting to me, and made me wonder whether or not the initial values even matter.  So I tried two more experiments, one in which I initialized the influences to 0.  The other where I initialized to random values between -1 and 1.

In the case of zeroes, the accuracy was obviously 0, and the error floated around 0.33 pretty closely.  When we initialize T.I. with zeroes, there is no change in accuracy or error at all.

In the case of random initialization, the accuracy was around 0.5, and the error floated around 0.66 pretty closely– as expected.  When we initialize T.I. with random values, we did see some change.  Accuracies exhibited were: 0.889, 0.864, 0.829, 0.799, 0.837.  Errors were: 0.064, 0.108, 0.179, 0.156, 0.117.

So it doesn’t converge to some value inherent to the system regardless of what we initialize to, but it does depend on what we seed, however, only to some degree.  I think the the zeroes case reduces to initializing with Naive Price move, but I may be wrong.  Also, Naive Price Move and Elementary Average both tend to result in the same direction of price move, so maybe that’s important.

There is one more test to run, and that’s a throwback to whether concurrent news articles introduce any major problems.  I generated another data set that creates two news articles at each time point.  Because there is no way to tease apart which one is good and which one is bad here, we expect low performance.  But we want to find out if one method performs better than the other on this particular test.

I used T.I. for all tests, and varied the initializer.  Initialized with zeroes, we saw an accuracy of 0.709, error of 0.280.  Initialized with random, we saw an accuracy of 0.609, error of 0.548.  Initialized with N.P.M, accuracy 0.517, error 12.619.  Initialized with E.A., accuracy 0.708, error 0.280.

Interesting!  Initialing with N.P.M. appears, as expected, to have disastrous results when there are news articles at the same time.  It actually performs worse than random initialization!  Also, note that zero-initialization seems to reduce to E.A., rather than N.P.M. as I had initially anticipated.

Now there is one final question to this whole analysis, efficiency.  Perhaps one method converges more quickly than another.  I counted the number of iterations taken to solve data set 4, initialized by different methods.  Zeroes: 388.  Random: 379.  N.P.M.: 385.  E.A.: 388.

It appears that there were no major differences in runtime.  Perhaps it is more strongly related to the learning rate than anything else.  Again, we see zeroes equaling E.A., indicating that perhaps the best idea is to simply initialize with zeroes strictly for code simplicity.

]]>
http://thekevindolan.com/2010/02/perturbation-modeling/feed/ 2
Two Approaches to News Rating http://thekevindolan.com/2010/02/two-approaches-to-news-rating/ http://thekevindolan.com/2010/02/two-approaches-to-news-rating/#comments Wed, 03 Feb 2010 23:03:39 +0000 Kevin http://thekevindolan.com/?p=697 flan

Last time, I generated a few data sets for testing different methods of rating the training news articles.  This time, I actually implemented two of them, the naive approach I had used before, and the new-and-improved version taking into account temporal interference.

Let us first examine the architecture I am using.  I created an interface Sentiment, which is to be implemented by classes that take a ticker and can find some sentiment estimations from it.  The signature for the method is as follows:

/**
* Get the sentiment for the news articles in a ticker
* @param ticker   the ticker to inspect
* @param forecast the forecast parameter
* @param index	   the data index to inspect
* @return		 an array where
* 					array[0][i] = influence of i
* 					array[1][i] = time-frame of i, in terms of multiples of forecast
*/
public double[][] getSentiment(Ticker ticker, long forecast, int index);

The idea behind doing it this way, returning a two dimensional array, is that I may eventually want to implement a more complex learning method that can also tease apart what scale it is that a news article has an effect, since I imagine some news articles represent more long-term impact, while others represent more short-term impact.  But that’s for the future.

The implementation for the naive approach is reproduced below:

public double[][] getSentiment(Ticker ticker, long forecast, int index) {
	DataHistory data = ticker.getDataHistory(index);
	NewsCorpus corpus = ticker.getCorpus();
 
	double[][] result = new double[2][corpus.getCount()];
	for(int i = 0; i &lt; corpus.getCount(); i++) {
		NewsStory story = corpus.getNews(i);
		long time = story.getTime();
		double price1 = data.getData(time);
		time += forecast;
		double price2 = data.getData(time);
 
		result[0][i] = (price2 - price1) / forecast;
		result[1][i] = 1;
	}
	return result;
}

The essential idea is that you just look ahead and see how the price moved some time after the news article. This was the approach I had used on my original project. Having now had the chance to test this method on my theoretically generated data, I can see just how ineffective this is.

For data set 1, the performance (in terms of proportion of sentiments in which the correct sign was estimated), was 0.84. That value is not entirely bad, and would be acceptable, except we find that the data set is actually quite simple. For the other sets, the performance was 0.72, 0.62, 0.56, and 0.70 respectively. On the hard data set (#4), the value was only slightly better than chance by random guessing.

I also tried to see what would happen on a larger data set, a newly generated one, which can be found here.  This corpus had a time-frame of 50 and a time-step of 1, making it similar to data set 4.  Performance on data set 6 was poor, again at about 0.56, indicating the performance is very heavily correlated to the ratio of time-frames to time-steps.

The summary of results for the naive approach can be found here.

It’s no wonder my initial attempts failed so horribly, at the very base of the methodology was a very poorly thought-out concept.

My next attempt was to try the method that teased apart temporal influences.  The general approach was to observe a price change, and then subtract off the effect from all news articles before it, within the forecast parameters.  The value we used to subtract off was an assumed value, initialized to the naive price move.  This approach is then repeated several times, until we come to some steady state.

My initial findings were that this method performed with almost identical accuracy as the previous test.  There had to be something wrong.

It was then that I realized that I also needed to be subtracting off the influence from news articles ahead of the target news article.  This increased the necessary number of iterations, but the results amazed me.

Keep in mind, there is a parameter here, the epsilon value, the acceptable error.  High values of epsilon meant longer run-times, but often more accurate results.

With an epsilon value of 1E-4, the number of iterations required for the 5 main data sets were: 18, 40, 60, 125, 33.

At that epsilon, the accuracy values were: 0.96, 0.93, 0.92, 0.87, 0.99.

Detailed results for this test can be found here.

At that value of epsilon, I was not able to run the test on the data set number 6.  Early anecdotal notes on performance, are that the accuracy seems to also be correlated to the ratio of time-frame to time-step, but does not seem to drop off as quickly.  Number of iterations seems to be almost linear to the same ratio.

Wanting to do some more experiments on performance, I tried setting the value of epsilon to 1E-8.

The first data set required 1387 iterations, but had accuracy 0.97.  The second data set took too long to finish.  The fifth data set took only 189 iterations, and only got a single sentiment wrong.

Reducing the epsilon value to 1E-2, we needed iterations  4, 11, 23, 27, 13 and got accuracies 0.94, 0.90, 0.88, 0.82, 0.94.  I gave the algorithm 3 minutes to do data set 6, and it was still unable to finish.

At 1E-1, or 0.1, we found runtime to be:  3, 8,  20, 21, 11 and accuracy to be: 0.93, 0.90,  0.88, 0.80,  0.94.  Here we start to notice less significant dropoff.  Keep in mind that each iteration probably causes the values to move less and less, and the values at each iteration are deterministic and not dependent on epsilon.  At this epsilon, data set 6 could be processed in 120 iterations (which took about 6 minutes), and got an accuracy of 0.89, which is actually greater than the epsilon for data set 4 at epsilon of 1E-4, but not by much.

Seeing how this one performed, however, led me to discover a wildly inefficient chunk of code that would make this algorithm painfully slow as we increase the number of news articles, the nested for loops that check every single article against every other at each iteration.  I did it this was, because the news articles might not necessarily be sorted, however, this really only needs to be done once, and we can cache the results.

Making this change did not change the results obtained at all, but did make the runtime seriously faster.

At 1E-1, data set 6 was now complete in under 30 seconds.  I then ran data set 6 with different values of epsilon.  At 1E-2, we required 142 iterations, with an accuracy of 0.89.  At 1E-4, we required 221 iterations, with accuracy 0.91.  Feeling good, I tried 1E-8, but lost patience.  At 1E-5, 313 iterations with accuracy 0.92.  At 1E-6: 1173 iterations (a sharp jump!), but the accuracy too increased to 0.94.

You can draw your own conclusions about what’s really going on here from these several data points, I’m not making any real claims yet, but I can imagine there is still a lot to learn.

Here is the code, as it currently exists:

private double epsilon = 1E-8;
 
@SuppressWarnings("unchecked")
@Override
public double[][] getSentiment(Ticker ticker, long forecast, int index) {
	//Get the observed price moves
	double[] priceMove = (new NaivePriceMove()).getSentiment(ticker, forecast, index)[0];
	int count = priceMove.length;
 
	//Create the empty result table
	double[][] result = new double[2][count];
	for(int i = 0; i < count; i++) {
		result[0][i] = priceMove[i];
		result[1][i] = 1;
	}
 
	NewsCorpus corpus = ticker.getCorpus();
 
	//Cache the proximities
	List<Integer>[] proximities = (List<Integer>[]) new List<?>[count];
	for(int current = 0; current < count; current++) {
		proximities[current] = new ArrayList<Integer>();
		long time = corpus.getNews(current).getTime();
		List<NewsStory> range = corpus.getNews(time - forecast + 1, time + forecast - 1);
		//Go through the other news events
		for(NewsStory story : range) {
			int other = story.getId();
			if(other != current) 
				proximities[current].add(other);
		}
	}
 
	//Keep going until we reach an acceptable error
	double error = Double.MAX_VALUE;
	int iter = 0;
	while(error > epsilon) {
		iter++;
		error = 0;
		//Go through each news event
		for(int current = 0; current < count; current++) {
			double observed = forecast * priceMove[current];
			long time = corpus.getNews(current).getTime();
			//Go through the other news events
			for(int other : proximities[current]) {
				long otherTime = corpus.getNews(other).getTime();
				long difference = Math.abs(time - otherTime);
				//Depending on how much overlap, discount the influence of the other
				long effect = forecast - difference;
				observed -= result[0][other] * effect;
			}
			observed /= forecast;
			//Error is max error
			double difference = observed - result[0][current];
			difference *= difference;
			error = Math.max(difference, error);
			result[0][current] = observed;
		}
	}
	System.out.println(iter + " iterations.");
	return result;
}
]]>
http://thekevindolan.com/2010/02/two-approaches-to-news-rating/feed/ 0