VOL 200 .... No. 52

FRIDAY, MARCH 1, 1985

More Test Data

Categories: Programming

tumblr_kvk45k0Nam1qznckp

I got back to working on the automated news analysis algorithm again, and thought that it would be wise to generate some new test data that will have some more context to it.  I wrote a simple algorithm that I discuss here, and I generated some data sets.

The basic idea was simple.  I want to in the future use some measure of similarity between documents that is smarter than the traditional tf.idf approach but at the moment, I don’t know which methods to use (as this is a large part of the project as a whole).  That being said, I still need to build a foundation for assigning sentiment to news stories for which we know what happens afterwards!

So a reasonable solution, I think, would be to generate some test data sets that will perform well using tf.idf, and then use that them to isolate the other aspects of the problem.

To generate these news corpuses, I used a method similar to the first method, except I now set the content of the news stories to be a nonsense string of several words.  I built 3 sets of 50 random words, and these sets are good, bad, and neutral.  Neutral words have some constant probability of appearing in any document, and the good/bad words are selected in proportion to influence of the generated news article.

The code I used specifically is reproduced below:

public NewsCorpus generate() {
	NewsCorpus corpus = new NewsCorpus();
	long time = 0;
	for(int i = 0; i < numStories; i++) {
		double positive = Math.random();
		double influence = positive * 2 - 1;
 
		StringBuffer sb = new StringBuffer();
		int numWords = (int) (Math.random() * maxWords);
		for(int j = 0; j < numWords; j++) {
			String[] wordList;
			if(Math.random() < neutralProportion)
				wordList = neutralWords;
			else if(Math.random() < positive)
				wordList = goodWords;
			else
				wordList = badWords;
			sb.append(wordList[(int) (Math.random() * 50)] + " ");
		}
 
		NewsStory story = new NewsStory(time, "Context Corpus Generator",
				"News Story #"+i, "LINEAR;"+influence+";"+timeFrame, sb.toString());
 
		corpus.addNews(story);
		time += timeStep;
	}
	return corpus;
}

I also generated several data sets using this method.  For all data sets, I used a word limit of 50 and a timestep of 1.  The data set summaries are below:

Data Set 1: 1000 articles, timeframe of 10, 0.33 neutral proportion

Data Set 2: 1000 articles, timeframe of 30, 0.33 neutral proportion

Data Set 3: 1000 articles, timeframe of 50, 0.33 neutral proportion

Data Set 4: 1000 articles, timeframe of 50, 0.50 neutral proportion

Data Set 5: 3000 articles, timeframe of 50, 0.50 neutral proportion


related post

Tags: ,
  1. The Jelly King
    February 25th, 2010 at 00:52 | #1

    This picture captures all my emotions at once. I approve yet another.

  1. February 25th, 2010 at 12:49 | #1
Comments are closed.