The Kevin Dolan » automated news analysis http://thekevindolan.com Putting the Kev in Dolan since 2009! Sun, 15 Aug 2010 00:40:56 +0000 en hourly 1 http://wordpress.org/?v=3.0 Using the News! http://thekevindolan.com/2010/02/using-the-news/ http://thekevindolan.com/2010/02/using-the-news/#comments Fri, 26 Feb 2010 23:12:01 +0000 Kevin http://thekevindolan.com/?p=797 newspaperLast time, I tried out a couple different methods to assign sentiment to news articles, and found that the best performance seemed to come from using my Temporal Interference method initialized by zeroes.  Well there’s a little more information available to us, and that’s the news article content themselves!

So the basic idea here is to train a model using Temporal Interference, and then use that model to score each news article, and use the scores as the new Perturbation model.  This would potentially lead to an iterative process, but should eventually converge.  Of course, there wasn’t a particularly clear way to do this, so I tried several.

For this first set of results, I tried various orders of modeling on the data at its actual duration value.  For time’s sake I skipped data set 5 due to its size.  Only results for 1-4 and 6 are shown:

Z,TI: 0.944, 0.902, 0.887, 0.852, 0.889, 0.708
Z,TI,NS: 0.858, 0.880, 0.810, 0.850, 0.854, 0.749
Z,TI,NS,TI: 0.951, 0.921, 0.907, 0.858, 0.744
Z,TI,NS,NS: 0.821, 0.853, 0.524, 0.761, 0.513
Z,TI,NS,NS,TI: 0.947, 0.906, 0.890, 0.853, 0.716
Z,TI,NS,TI,NS: 0.859, 0.880, 0.813, 0.848, 0.763
Z,TI,(NS,TI)x2: 0.951, 0.921, 0.907, 0.858, 0.747

What we learn here is that incorporating news information into the model does have an advantage.  For the most part, we saw optimum performance alternating News and Temporal Interference, ending with Temporal Interference, however, in the case of data set 6, we saw the best performance when we ended with the News Analysis.  This was somewhat counter-intuitive.

Also, with most data sets, the progress stabilized after just one iteration, however, this was not the case with data set 6.  I tried adding two more steps.  Adding another News step gave us an accuracy of 0.765, and adding another Temporal Interference step brought us back to 0.747.  Considering the kind of information made available to the system, this is somewhat impressive, but really goes to show why my initial attempts at this definitely would not have ever worked.

I then considered what would happen if we did not know ahead of time what kind of duration to expect for perturbations.  I did these studies on data set 3 only.  If we reduce the duration to just 49, we see a significant drop in accuracy with Temporal Interference alone to 0.692.  If we keep alternately adding NS/TI steps, we see the accuracy rise to 0.739, 0.702, 0.745, 0.703, 0.747, 0.703.  Interestingly, this one also performs best when we end with the News analysis.

If we reduce the duration to 40, we see the accuracy decrease to 0.642.  Running 2 NS/TI loops and ending with NS, we find the accuracy actually decrease to 0.624.  Odd indeed!

For duration 51, we see a drop to 0.712, and  then an improvement to 0.776.

For duration 60, we see a drop to 0.658, and then an improvement to 0.726.

Clearly something odd is going on here where we do not know the duration, and unfortunately, this is the case in real life.  So we need to do some further thinking about Perturbation modeling if we want to move forward.

]]>
http://thekevindolan.com/2010/02/using-the-news/feed/ 0
Perturbation Modeling, Initial Approach http://thekevindolan.com/2010/02/perturbation-modeling/ http://thekevindolan.com/2010/02/perturbation-modeling/#comments Thu, 25 Feb 2010 17:45:53 +0000 Kevin http://thekevindolan.com/?p=790 1235935919

Last time I introduced some test data, and before that I formalized the Perturbation Model for Price Moves a bit further.  Well this required me to rewrite the code I had written before for Sentiment analysis.  I took advantage of interval trees to make my code fairly efficient, and also changed the way I initialize the price movements, yielding minor improvements over the naive methods.

The basic idea was that I created an object Perturbation, which has fields for influence, duration, and start time.  Start time was a controversial decision for inclusion, but I ultimately decided it was best.

The Sentiment interface has been replaced by Modeler and contains one method:

public Perturbation[] getSentiment(Ticker ticker, int index);

As far as the initialization algorithm, I had previously initialized all sentiments to the price move observed in the interval of the news article.  The biggest problem I had with this was that a corpus of just two news articles at exactly the same time would initialize to the price move observed in that time for both, when in reality, it should be half that.

So basically the new method works by splitting the data into elementary intervals by the distribution of news articles.  For each elementary interval, the price move observed is distributed amongst all relevant perturbations, and then the resulting assignment is a weighted average of all of these observed price moves.  I called this Elementary Average.

The code is below:

public Perturbation[] getSentiment(Ticker ticker, int index) {
	NewsCorpus corpus = ticker.getCorpus();
	DataHistory data = ticker.getDataHistory(index);
	List<NewsStory> newsList = corpus.getNews();
	Perturbation[] perturbations = new Perturbation[newsList.size()];
	IntervalTree<NewsStory> intervalTree = new IntervalTree<NewsStory>();
	SortedSet<Long> endpoints = new TreeSet<Long>();
 
	for(NewsStory story : newsList) {
		perturbations[story.getId()] = new Perturbation(0,story.getTime(),forecast);
		intervalTree.addInterval(story.getTime(), story.getTime() + forecast, story);
		endpoints.add(story.getTime());
		endpoints.add(story.getTime() + forecast);
	}
 
	Long last = null;
	for(Long next : endpoints) {
		if(last != null) {
			List<NewsStory> stories = intervalTree.get(last, next);
 
			double price0 = data.getData(last);
			double price1 = data.getData(next);
			double priceMove = price1 - price0;
 
			//System.out.println(priceMove);
 
			double distributed = priceMove / stories.size();
 
			double length = next - last;
			double proportion = length / forecast / forecast;
 
			//System.out.println(stories.size());
 
			for(NewsStory story : stories) {
				double current = perturbations[story.getId()].getInfluence();
				perturbations[story.getId()].setInfluence(current + proportion * distributed);
			}
		}
		last = next;
	}
 
	return perturbations;
}

Temporal interference works essentially the same as before, but now takes advantage of interval trees, making it significantly faster.  Also, I added a parameter for learning rate.  This allows large changes in the estimates initially, but slowly limits the amount that can change, which guarantees convergence, which was not necessarily guaranteed before.

The code is below:

public Perturbation[] getSentiment(Ticker ticker, int index) {
	NewsCorpus corpus = ticker.getCorpus();
	DataHistory data = ticker.getDataHistory(index);
	List<NewsStory> newsList = corpus.getNews();
	IntervalTree<NewsStory> intervalTree = new IntervalTree<NewsStory>();
	Perturbation[] perturbations = initializer.getSentiment(ticker, index);
 
	for(NewsStory story : newsList)
		intervalTree.addInterval(story.getTime(), story.getTime() + forecast, story);
 
	double maxError = Double.MAX_VALUE;
	int iteration = 0;
	while(maxError > epsilon) {
		double rate = Math.exp(-iteration*learningRate);
		maxError = 0;
		for(NewsStory story : newsList) {				
			double price0 = data.getData(story.getTime());
			double price1 = data.getData(story.getTime() + forecast);
			double priceMove = price1 - price0;
 
 
			List<NewsStory> interferons = intervalTree.get(story.getTime(), story.getTime() + forecast);
			double sumInterference = 0;
			for(NewsStory interferon : interferons) {
				long intersection = forecast - Math.abs(interferon.getTime() - story.getTime());
				double interference = perturbations[interferon.getId()].getInfluence() * intersection;
				sumInterference += interference;
			}
 
			double old = perturbations[story.getId()].getInfluence();
			double error = priceMove - sumInterference;
			perturbations[story.getId()].setInfluence((old*forecast+rate*error)/forecast);
 
			maxError = Math.max(Math.abs(rate*error), maxError);
		}
		iteration++;
	}
 
	return perturbations;
}

Of course, how well does this new method perform compared with the old method? For a baseline, I wrote a new NaivePriceMove Modeler, and analyzed the results for the 5 data sets.  I used the correct duration for all of these tests.

Accuracy from Naive Price Move alone was: 0.637, 0.614, 0.566, 0.553, 0.568 respectively.  Error was:  1.42, 10.83, 15.51, 6.32, 9.60!

Accuracy from Elementary Average alone was: 0.637, 0.613, 0.569, 0.553, 0.568 — almost identical.  Error was: 0.312, 0.335, 0.336, 0.322, 0.328.

Note that though accuracy was almost identical, error (average squared) was significantly reduced in the case of Elementary Average–also more consistent.

Let’s see how temporal interference performs as initialized by either.

Temporal Interference initialized by Naive Price Move was: 0.944, 0.902, 0.887, 0.852, 0.889.  Error was: 0.032, 0.065, 0.114, 0.097, 0.070.

Temporal Interference initialized by Elementary Average was: 0.943, 0.902, 0.887, 0.852, 0.889.  Error was: 0.033, 0.065, 0.114, 0.098, 0.071.

As we can see, the results are almost identical.  This was interesting to me, and made me wonder whether or not the initial values even matter.  So I tried two more experiments, one in which I initialized the influences to 0.  The other where I initialized to random values between -1 and 1.

In the case of zeroes, the accuracy was obviously 0, and the error floated around 0.33 pretty closely.  When we initialize T.I. with zeroes, there is no change in accuracy or error at all.

In the case of random initialization, the accuracy was around 0.5, and the error floated around 0.66 pretty closely– as expected.  When we initialize T.I. with random values, we did see some change.  Accuracies exhibited were: 0.889, 0.864, 0.829, 0.799, 0.837.  Errors were: 0.064, 0.108, 0.179, 0.156, 0.117.

So it doesn’t converge to some value inherent to the system regardless of what we initialize to, but it does depend on what we seed, however, only to some degree.  I think the the zeroes case reduces to initializing with Naive Price move, but I may be wrong.  Also, Naive Price Move and Elementary Average both tend to result in the same direction of price move, so maybe that’s important.

There is one more test to run, and that’s a throwback to whether concurrent news articles introduce any major problems.  I generated another data set that creates two news articles at each time point.  Because there is no way to tease apart which one is good and which one is bad here, we expect low performance.  But we want to find out if one method performs better than the other on this particular test.

I used T.I. for all tests, and varied the initializer.  Initialized with zeroes, we saw an accuracy of 0.709, error of 0.280.  Initialized with random, we saw an accuracy of 0.609, error of 0.548.  Initialized with N.P.M, accuracy 0.517, error 12.619.  Initialized with E.A., accuracy 0.708, error 0.280.

Interesting!  Initialing with N.P.M. appears, as expected, to have disastrous results when there are news articles at the same time.  It actually performs worse than random initialization!  Also, note that zero-initialization seems to reduce to E.A., rather than N.P.M. as I had initially anticipated.

Now there is one final question to this whole analysis, efficiency.  Perhaps one method converges more quickly than another.  I counted the number of iterations taken to solve data set 4, initialized by different methods.  Zeroes: 388.  Random: 379.  N.P.M.: 385.  E.A.: 388.

It appears that there were no major differences in runtime.  Perhaps it is more strongly related to the learning rate than anything else.  Again, we see zeroes equaling E.A., indicating that perhaps the best idea is to simply initialize with zeroes strictly for code simplicity.

]]>
http://thekevindolan.com/2010/02/perturbation-modeling/feed/ 2
More Test Data http://thekevindolan.com/2010/02/more-test-data/ http://thekevindolan.com/2010/02/more-test-data/#comments Thu, 25 Feb 2010 01:30:40 +0000 Kevin http://thekevindolan.com/?p=786 tumblr_kvk45k0Nam1qznckp

I got back to working on the automated news analysis algorithm again, and thought that it would be wise to generate some new test data that will have some more context to it.  I wrote a simple algorithm that I discuss here, and I generated some data sets.

The basic idea was simple.  I want to in the future use some measure of similarity between documents that is smarter than the traditional tf.idf approach but at the moment, I don’t know which methods to use (as this is a large part of the project as a whole).  That being said, I still need to build a foundation for assigning sentiment to news stories for which we know what happens afterwards!

So a reasonable solution, I think, would be to generate some test data sets that will perform well using tf.idf, and then use that them to isolate the other aspects of the problem.

To generate these news corpuses, I used a method similar to the first method, except I now set the content of the news stories to be a nonsense string of several words.  I built 3 sets of 50 random words, and these sets are good, bad, and neutral.  Neutral words have some constant probability of appearing in any document, and the good/bad words are selected in proportion to influence of the generated news article.

The code I used specifically is reproduced below:

public NewsCorpus generate() {
	NewsCorpus corpus = new NewsCorpus();
	long time = 0;
	for(int i = 0; i &lt; numStories; i++) {
		double positive = Math.random();
		double influence = positive * 2 - 1;
 
		StringBuffer sb = new StringBuffer();
		int numWords = (int) (Math.random() * maxWords);
		for(int j = 0; j &lt; numWords; j++) {
			String[] wordList;
			if(Math.random() &lt; neutralProportion)
				wordList = neutralWords;
			else if(Math.random() &lt; positive)
				wordList = goodWords;
			else
				wordList = badWords;
			sb.append(wordList[(int) (Math.random() * 50)] + " ");
		}
 
		NewsStory story = new NewsStory(time, "Context Corpus Generator",
				"News Story #"+i, "LINEAR;"+influence+";"+timeFrame, sb.toString());
 
		corpus.addNews(story);
		time += timeStep;
	}
	return corpus;
}

I also generated several data sets using this method.  For all data sets, I used a word limit of 50 and a timestep of 1.  The data set summaries are below:

Data Set 1: 1000 articles, timeframe of 10, 0.33 neutral proportion

Data Set 2: 1000 articles, timeframe of 30, 0.33 neutral proportion

Data Set 3: 1000 articles, timeframe of 50, 0.33 neutral proportion

Data Set 4: 1000 articles, timeframe of 50, 0.50 neutral proportion

Data Set 5: 3000 articles, timeframe of 50, 0.50 neutral proportion

]]>
http://thekevindolan.com/2010/02/more-test-data/feed/ 2
Perturbation Model of Price Movement http://thekevindolan.com/2010/02/perturbation-model/ http://thekevindolan.com/2010/02/perturbation-model/#comments Thu, 04 Feb 2010 05:20:01 +0000 Kevin http://thekevindolan.com/?p=706 shwayze

I was sitting in my networks class today, thinking of how it would be possible to implement an algorithm for taking into consideration the similarity of documents for teasing apart temporal interference, when I started coming to a more coherent model of what I’ve been trying to do in general.  This article will set up some early ideas for a model of what’s going on, what we’re attempting to accomplish, and possible general procedures for doing so.  It also sets up some terminology.

Essentially, at this stage we have a data history.  This data history is made of two parts, a set of price point information and a set of several relevant news articles.

We will call our price point information, the Time-Sensitive Response Variable, or TSRV.  Let us explore what we are assuming about the TSRV.

I began to think about the idea of making analogies to physics, because that’s something I understand a little better than economics.  I think the way a lot of people approach the stock market for investing is to think about price as position.  This gives way to the idea that the market may often find itself trending one way or the other.  The idea behind a trend is that the price has a certain velocity, which is resistant to change (intertia),  until some outside influence (force/acceleration) causes a reversal or something of that nature.

Having looked at a lot of stock graphs, I am not so sure this is the case.  I understand many successful traders would disagree with me, but for the sake of this project, we are going to think of the TSRV for price as velocity, that is it is resistant to movement without outside influence–it experiences inertia.  In this concept the price over time is a derivative of some unknown value behind the scenes, which I intuitively feel might exist, that behaves more like the traditional concept of price.

You might say that the price over time is constantly gyrating madly about, so thinking that the TSRV is resistant to change is ridiculous, but keep in mind I said it was resistant to change…undisturbed.  There is a constant barrage of outside influence coming in to affect the price.  I consider these outside forces, the analogy of a force in physics, which is proportional to acceleration.

We will be calling these outside influence perturbations.  Perturbations could inevitably take many forms, but for simplicity we will be thinking of individual perturbations as being discrete chunks of constant force with finite lengths.  In physics, we know that the acceleration observed is due to the sum of forces acting on an object.  So too is the effect of the perturbations additive.

We understand that there are a great number of perturbations affecting any TSRV, some of which we know about, many of which we do not.  Furthermore, we generally only know the existence of some perturbations, not any details of their strength, direction, or duration.  The perturbations we do not know about, will generally seem to manifest themselves as noise, but it should be known that under this model, there is no random TSRV movement, only movement due to unconsidered perturbations.

For our purposes with regards to automated news analysis, we have a set of several relevant news articles that we assume have some effect on the movement of the price, in this fashion.  Our end-goal is to approximate the effect that news articles have.  According to our definition of the effect of perturbations, there are two dimensions of the effect of a perturbation, the strength/direction of the acceleration caused, the influence, and the length of time it effects the price, the  duration.

Given a set of perturbations, we want to find some way to determine their characteristics, so that we can hopefully use some similarity metric to indicate possibly future price moves.

It is well understood that the duration will vary for most news articles, but the computational difficulty of determining both seems at this juncture beyond the scope of what I can hope to accomplish.  Perhaps sometime in the future we can devise more complex algorithms, but for now we will focus on determining the influence of perturbations given some predetermined duration.

Now that we have a more formal understanding of our basic assumptions, we can consider possible means of accomplishing the approximation of the characteristics of perturbations, but I’ll save that for next time.

And yes, that’s a picture of Shwayze.

]]>
http://thekevindolan.com/2010/02/perturbation-model/feed/ 2
Two Approaches to News Rating http://thekevindolan.com/2010/02/two-approaches-to-news-rating/ http://thekevindolan.com/2010/02/two-approaches-to-news-rating/#comments Wed, 03 Feb 2010 23:03:39 +0000 Kevin http://thekevindolan.com/?p=697 flan

Last time, I generated a few data sets for testing different methods of rating the training news articles.  This time, I actually implemented two of them, the naive approach I had used before, and the new-and-improved version taking into account temporal interference.

Let us first examine the architecture I am using.  I created an interface Sentiment, which is to be implemented by classes that take a ticker and can find some sentiment estimations from it.  The signature for the method is as follows:

/**
* Get the sentiment for the news articles in a ticker
* @param ticker   the ticker to inspect
* @param forecast the forecast parameter
* @param index	   the data index to inspect
* @return		 an array where
* 					array[0][i] = influence of i
* 					array[1][i] = time-frame of i, in terms of multiples of forecast
*/
public double[][] getSentiment(Ticker ticker, long forecast, int index);

The idea behind doing it this way, returning a two dimensional array, is that I may eventually want to implement a more complex learning method that can also tease apart what scale it is that a news article has an effect, since I imagine some news articles represent more long-term impact, while others represent more short-term impact.  But that’s for the future.

The implementation for the naive approach is reproduced below:

public double[][] getSentiment(Ticker ticker, long forecast, int index) {
	DataHistory data = ticker.getDataHistory(index);
	NewsCorpus corpus = ticker.getCorpus();
 
	double[][] result = new double[2][corpus.getCount()];
	for(int i = 0; i &lt; corpus.getCount(); i++) {
		NewsStory story = corpus.getNews(i);
		long time = story.getTime();
		double price1 = data.getData(time);
		time += forecast;
		double price2 = data.getData(time);
 
		result[0][i] = (price2 - price1) / forecast;
		result[1][i] = 1;
	}
	return result;
}

The essential idea is that you just look ahead and see how the price moved some time after the news article. This was the approach I had used on my original project. Having now had the chance to test this method on my theoretically generated data, I can see just how ineffective this is.

For data set 1, the performance (in terms of proportion of sentiments in which the correct sign was estimated), was 0.84. That value is not entirely bad, and would be acceptable, except we find that the data set is actually quite simple. For the other sets, the performance was 0.72, 0.62, 0.56, and 0.70 respectively. On the hard data set (#4), the value was only slightly better than chance by random guessing.

I also tried to see what would happen on a larger data set, a newly generated one, which can be found here.  This corpus had a time-frame of 50 and a time-step of 1, making it similar to data set 4.  Performance on data set 6 was poor, again at about 0.56, indicating the performance is very heavily correlated to the ratio of time-frames to time-steps.

The summary of results for the naive approach can be found here.

It’s no wonder my initial attempts failed so horribly, at the very base of the methodology was a very poorly thought-out concept.

My next attempt was to try the method that teased apart temporal influences.  The general approach was to observe a price change, and then subtract off the effect from all news articles before it, within the forecast parameters.  The value we used to subtract off was an assumed value, initialized to the naive price move.  This approach is then repeated several times, until we come to some steady state.

My initial findings were that this method performed with almost identical accuracy as the previous test.  There had to be something wrong.

It was then that I realized that I also needed to be subtracting off the influence from news articles ahead of the target news article.  This increased the necessary number of iterations, but the results amazed me.

Keep in mind, there is a parameter here, the epsilon value, the acceptable error.  High values of epsilon meant longer run-times, but often more accurate results.

With an epsilon value of 1E-4, the number of iterations required for the 5 main data sets were: 18, 40, 60, 125, 33.

At that epsilon, the accuracy values were: 0.96, 0.93, 0.92, 0.87, 0.99.

Detailed results for this test can be found here.

At that value of epsilon, I was not able to run the test on the data set number 6.  Early anecdotal notes on performance, are that the accuracy seems to also be correlated to the ratio of time-frame to time-step, but does not seem to drop off as quickly.  Number of iterations seems to be almost linear to the same ratio.

Wanting to do some more experiments on performance, I tried setting the value of epsilon to 1E-8.

The first data set required 1387 iterations, but had accuracy 0.97.  The second data set took too long to finish.  The fifth data set took only 189 iterations, and only got a single sentiment wrong.

Reducing the epsilon value to 1E-2, we needed iterations  4, 11, 23, 27, 13 and got accuracies 0.94, 0.90, 0.88, 0.82, 0.94.  I gave the algorithm 3 minutes to do data set 6, and it was still unable to finish.

At 1E-1, or 0.1, we found runtime to be:  3, 8,  20, 21, 11 and accuracy to be: 0.93, 0.90,  0.88, 0.80,  0.94.  Here we start to notice less significant dropoff.  Keep in mind that each iteration probably causes the values to move less and less, and the values at each iteration are deterministic and not dependent on epsilon.  At this epsilon, data set 6 could be processed in 120 iterations (which took about 6 minutes), and got an accuracy of 0.89, which is actually greater than the epsilon for data set 4 at epsilon of 1E-4, but not by much.

Seeing how this one performed, however, led me to discover a wildly inefficient chunk of code that would make this algorithm painfully slow as we increase the number of news articles, the nested for loops that check every single article against every other at each iteration.  I did it this was, because the news articles might not necessarily be sorted, however, this really only needs to be done once, and we can cache the results.

Making this change did not change the results obtained at all, but did make the runtime seriously faster.

At 1E-1, data set 6 was now complete in under 30 seconds.  I then ran data set 6 with different values of epsilon.  At 1E-2, we required 142 iterations, with an accuracy of 0.89.  At 1E-4, we required 221 iterations, with accuracy 0.91.  Feeling good, I tried 1E-8, but lost patience.  At 1E-5, 313 iterations with accuracy 0.92.  At 1E-6: 1173 iterations (a sharp jump!), but the accuracy too increased to 0.94.

You can draw your own conclusions about what’s really going on here from these several data points, I’m not making any real claims yet, but I can imagine there is still a lot to learn.

Here is the code, as it currently exists:

private double epsilon = 1E-8;
 
@SuppressWarnings("unchecked")
@Override
public double[][] getSentiment(Ticker ticker, long forecast, int index) {
	//Get the observed price moves
	double[] priceMove = (new NaivePriceMove()).getSentiment(ticker, forecast, index)[0];
	int count = priceMove.length;
 
	//Create the empty result table
	double[][] result = new double[2][count];
	for(int i = 0; i < count; i++) {
		result[0][i] = priceMove[i];
		result[1][i] = 1;
	}
 
	NewsCorpus corpus = ticker.getCorpus();
 
	//Cache the proximities
	List<Integer>[] proximities = (List<Integer>[]) new List<?>[count];
	for(int current = 0; current < count; current++) {
		proximities[current] = new ArrayList<Integer>();
		long time = corpus.getNews(current).getTime();
		List<NewsStory> range = corpus.getNews(time - forecast + 1, time + forecast - 1);
		//Go through the other news events
		for(NewsStory story : range) {
			int other = story.getId();
			if(other != current) 
				proximities[current].add(other);
		}
	}
 
	//Keep going until we reach an acceptable error
	double error = Double.MAX_VALUE;
	int iter = 0;
	while(error > epsilon) {
		iter++;
		error = 0;
		//Go through each news event
		for(int current = 0; current < count; current++) {
			double observed = forecast * priceMove[current];
			long time = corpus.getNews(current).getTime();
			//Go through the other news events
			for(int other : proximities[current]) {
				long otherTime = corpus.getNews(other).getTime();
				long difference = Math.abs(time - otherTime);
				//Depending on how much overlap, discount the influence of the other
				long effect = forecast - difference;
				observed -= result[0][other] * effect;
			}
			observed /= forecast;
			//Error is max error
			double difference = observed - result[0][current];
			difference *= difference;
			error = Math.max(difference, error);
			result[0][current] = observed;
		}
	}
	System.out.println(iter + " iterations.");
	return result;
}
]]>
http://thekevindolan.com/2010/02/two-approaches-to-news-rating/feed/ 0
Test Data Sets http://thekevindolan.com/2010/02/test-data-sets/ http://thekevindolan.com/2010/02/test-data-sets/#comments Tue, 02 Feb 2010 23:57:04 +0000 Kevin http://thekevindolan.com/?p=690 test-data

From my last post, I introduced the idea of creating test data sets for the purpose of finding an algorithm to tease apart the influence of individual news articles.  I have done just that and am posting the data sets for further analysis.

My method for generating these test files was as the following pseudocode describes:

-Take 3 parameters, TIME-STEP,  TIME-FRAME, and COUNT.

-Create COUNT news articles, each with the following encoded in their summary field:

-Time-frame equal to TIME-FRAME

-Influence randomly set between [-1,1]

-For each timestep 0 through (TIME-STEP * COUNT)

-Find all news articles before current time, within their Time-frame value of now

-Add the sum of those news articles’ Influence values to the current price

-Record the current price

Because we defined a constant TIME-FRAME ahead of time, a simpler algorithm could have been used, but I am planning on attempting experiments with variable time-frames at a later date, so this was a sensible solution to save myself some work in the future.

I created 6 data sets, each with 500 data points, as follows:

Data set 0

TIME-STEP: 1

TIME FRAME: 1

Data set 1

TIME-STEP: 1

TIME FRAME: 2

Data set 2

TIME-STEP: 1

TIME FRAME: 5

Data set 3

TIME-STEP: 1

TIME FRAME: 10

Data set 4

TIME-STEP: 1

TIME FRAME: 50

Data set 5

TIME-STEP: 3

TIME FRAME: 17

The motivation for choosing the values for data-sets 1-4 are simple, to see the effects of using longer and longer time-frames relative to time-steps.  Data set 5 exists for the sole purpose of seeing if any problems are present with weird offsets.  If we see anything unexpected there, future research may be necessary.

I have attached a zip file of the corpus, if you are interested: here.

]]>
http://thekevindolan.com/2010/02/test-data-sets/feed/ 1
Temporal Interference http://thekevindolan.com/2010/02/temporal-interference/ http://thekevindolan.com/2010/02/temporal-interference/#comments Mon, 01 Feb 2010 21:14:33 +0000 Kevin http://thekevindolan.com/?p=685 coyabaanniversary400

This article is the second of the Automated News Analysis series, regarding a particular problem I overlooked during my first try at news analysis.  The subject of this article is taking a dataset of news and price history, and attempting to assign sentiment to the news articles for which we know the price development.  The problem that we will explore in particular, is removing influences from other news articles near our target article in time.

So the general methodology behind analyzing a new corpus is that on our training data set, we know how the price performed after some news event.  The hope is that we can use this information, and take into account similarity to future news stories, to somehow retrieve some information as to what the price will do in the future.

So from this, we are making an assumption, that the price move is in some way related to the content of the news article.  Intuitively, we would agree that this is the case.  If a company receives a piece of good news, it is likely that the price will go up.  If it receives bad news, it is likely to go down.  Obviously, the system is messier than this, but that is the basic assumption between this entire project.

We would like to know how a particular news article actually affects the price data on historical data.  A human analyst might look at a news article in history, and reason using external domain knowledge about the company, and form an opinion of his own.  At this juncture, we are not ready to consider external knowledge for computation, we merely want to find some purely mathematical model that can approximate the effect a news event has on the price data.

In the past, I had generally just worked under the assumption that a good enough approximation would be to ask how the price changed a certain period of time after the news article.  In hindsight, I think this is a very naive approach.  The problem, primarily, stems from two major sources of confound– temporal interference and spatial interference.  Temporal interference refers to interference from news articles proximal in time to the target news article.

Imagine a corpus having several bad news articles followed immediately by one good news article.  In general, we would expect the influence of these bad news articles to outweigh the influence of the good one, and thus we would see a downward price move.  Would there be any reasonable way to tease apart these two influences, at least on some approximate level, to better determine the sentiment of a particular news article.

To attempt to find a solution to this problem, we will approach it entirely theoretically.  We will first define some assumptions for generating test data sets, and attempt to find an algorithm that can effectively approximate the initial seeds to the problem.  In order to do this, we will first use strong assumptions, and then loosen them to more accurately represent what we think might be the case in the real world.

Our assumptions will be as follows:

A) The influence of events is solely responsible for the movement of price.

B) An event affects the price in a linear fashion, according to its sentiment.

C) An event affects the price over some predetermined constant time frame.

This gives us a nice way of generating test data.  The goal of using this test data would be to find an algorithm that looks at a corpus and teases out the influence of each event into some coherent model that matches the random seeds we used to generate the test data.

I have a few ideas in mind for how I will solve this problem, but for now I will get to work on generating some test data, and then report back on progress.

]]>
http://thekevindolan.com/2010/02/temporal-interference/feed/ 0
Automated News Analysis Series: Introduction http://thekevindolan.com/2010/01/automated-news-analysis/ http://thekevindolan.com/2010/01/automated-news-analysis/#comments Fri, 29 Jan 2010 04:30:46 +0000 Kevin http://thekevindolan.com/?p=679 newspaper

As many of you are probably aware, I’ve been involved in varying degrees with a project on computational finance.  Originally called the DarwInvest project, I had intended to develop a Genetic Programming algorithm for making intelligent trades on price history data.  The projected ended up being encompassed by the development of my extensive GP library, Darwin, my proudest achievement in elegant code design.  Well, I’m back at it, and this time I’m trying to tackle the news!

I intend to take TKD in a slightly different direction from here-on-out (not that it had any direction to begin with), and focus mainly on my computer programming projects.  For that reason, I am changing the category WEB STUFF to COMPUTERS to make it more general.

This post serves primarily as an introduction for my current endeavor in algorithmic trading: automated news analysis.  But first, a primer on my opinions of automated algorithmic trading.

It has long been the holy grail of A.I. developers everywhere to develop some magic computer system that predicted the stock market and could turn infinite profits with minimal effort.  Obviously, that type of endeavor seems incredibly frivolous.  That being said, my interest in A.I. has naturally led me in the same direction.

According to my research, there is a lot of industry interest in algorithmic trading.  I’ve recently been interviewing with several financial firms for a summer internship, and they all seem to have some type of foot in the algorithmic trading game.  My prediction for the future is that more and more trading decisions are going to be made by computers, as more sophisticated and successful algorithms are developed.

The advantages to using computers to make trades are obvious, in that computers can be utilized to consistently analyze vastly more data than any single human, or teams of humans, could ever hope to analyze.  With direct data pipelines for data sources, they can also be made to act immediately to incoming information, in situations where humans would have to hesitate.

The drawbacks, of course, are that humans have a large corpus of implicit knowledge on the interactions between systems, and so are generally able to draw larger-scale conclusions about the implications of a certain piece of information.

Still, I feel the growing sophistication of A.I. and the resources being poured into its development will eventually lead to algorithms that can compete, and even beat humans at trading.

We are entering an era of algorithmic arms races between huge financial firms.  Trading decisions by individuals will become less and less successful as the computers of large firms find clever ways of exploiting them.  Eventually it will come down to technical specifications, who has more computing power, better bandwidth, etc, and especially algorithmic ingenuity.

Some have speculated that the widespread advent of algorithmic trading would put the market in essential stasis, where there would be little profit potential for firms, encouraging a withdrawal of resources from investment firms.  I wager that this will not be the case, in much the same way a restaurant will never become unpopular by being too crowded.  If firms pull out, by that very action, the profitability returns encouraging new firms to enter.

With that being said, there will always be the people refusing to trust computers to make trades.  The idea of trusting huge amounts of money to computers is certainly a scary notion, and not one that should be taken lightly.  Also, advances by individual firms to improve their technical infrastructure and algorithmic sophistication will inevitably become the primary source of competition, and I feel it will allow the markets to stay as dynamic as they are now.

My current project is focused on analyzing relevant news articles for their impact on stock market historical price data.  It is more generally a project on qualitative classification of full-text documents, on a particularly noisy source of data which very well may actually have no underlying predictability.

I will be updating on my progress as I make it.  Right now, I am running a script to crawl the web for relevant news articles from the past 10 years.  I will be running this for several days in an attempt to build a sufficiently large corpus of information, and I will report on the results in a few days.

Stay tuned, because I’m excited!

]]>
http://thekevindolan.com/2010/01/automated-news-analysis/feed/ 0