The Kevin Dolan » test http://thekevindolan.com Putting the Kev in Dolan since 2009! Sun, 15 Aug 2010 00:40:56 +0000 en hourly 1 http://wordpress.org/?v=3.0 Test Data Sets http://thekevindolan.com/2010/02/test-data-sets/ http://thekevindolan.com/2010/02/test-data-sets/#comments Tue, 02 Feb 2010 23:57:04 +0000 Kevin http://thekevindolan.com/?p=690 test-data

From my last post, I introduced the idea of creating test data sets for the purpose of finding an algorithm to tease apart the influence of individual news articles.  I have done just that and am posting the data sets for further analysis.

My method for generating these test files was as the following pseudocode describes:

-Take 3 parameters, TIME-STEP,  TIME-FRAME, and COUNT.

-Create COUNT news articles, each with the following encoded in their summary field:

-Time-frame equal to TIME-FRAME

-Influence randomly set between [-1,1]

-For each timestep 0 through (TIME-STEP * COUNT)

-Find all news articles before current time, within their Time-frame value of now

-Add the sum of those news articles’ Influence values to the current price

-Record the current price

Because we defined a constant TIME-FRAME ahead of time, a simpler algorithm could have been used, but I am planning on attempting experiments with variable time-frames at a later date, so this was a sensible solution to save myself some work in the future.

I created 6 data sets, each with 500 data points, as follows:

Data set 0

TIME-STEP: 1

TIME FRAME: 1

Data set 1

TIME-STEP: 1

TIME FRAME: 2

Data set 2

TIME-STEP: 1

TIME FRAME: 5

Data set 3

TIME-STEP: 1

TIME FRAME: 10

Data set 4

TIME-STEP: 1

TIME FRAME: 50

Data set 5

TIME-STEP: 3

TIME FRAME: 17

The motivation for choosing the values for data-sets 1-4 are simple, to see the effects of using longer and longer time-frames relative to time-steps.  Data set 5 exists for the sole purpose of seeing if any problems are present with weird offsets.  If we see anything unexpected there, future research may be necessary.

I have attached a zip file of the corpus, if you are interested: here.

]]>
http://thekevindolan.com/2010/02/test-data-sets/feed/ 1