The Very Large Dataset Project
I once took a statistics course taught by an Iranian. Whether it was due to the teaching style, the subject matter or even both, I turned out to both be very successful in the course and to love it. Truly, a really passionate attitude toward it. But statistics? That revelation should pretty much seal my fate in any of your eyes as to the degree to which I am a complete nerd. Nevertheless here’s what I did:
I’ve always been fascinated with datasets. The prospect of being able to take a large field of statistics and to use that to try to measure various things in an effort to make some sense out of the glop of data, has always held great appeal. Being interested in the stock market, I then set out to locate datasets about the stock market so that I could “play around” with them. As any of you who may have looked already know, there aren’t any out there. You can get some, like historical prices for instance, but that’s about it. What other datasets you can get are very narrow in scope. You might find, for instance, historical aggregate earnings for the Dow companies, but that’s just earnings, and just the Dow. I wanted so much more.
Also, to use historical prices alone, seemed to violate a major principle I’ve always believed in: past price behavior cannot cause future price behavior. It’s a major fundamental premise behind statistics. In other words, it’s not the price of a stock that impacts the price of a stock. It’s other stuff. Any instance where past price movement seems to correspond to a future price movement is coincidence. There are some notable pseudo-exceptions. Stocks do generally move in cycles. When stocks’ prices move in cycles (which are measured by their prices), trying to determine which causes which can become rather muddy. The price is the measure of the cycle. That’s almost like trying to distinguish between annual rainfall and ruler.
Contrary to what some of you might be thinking at the moment, I am not at all opposed to technical analysis. To the contrary. What gives even more importance to technical analysis is that because so many people use it in one fashion or another, any causal behavior as noticed by the technical analyst is reinforced by the fact that other people are noticing the same things. Resistance price points are a perfect example. People really do use them to determine buying and selling points.
Another thing I always wanted was a good (read easy) way to backtest different kinds of screens and other events. I have found a couple of backtesting websites but found them to be rather limiting. Certain stocks only, beginning-of-year entry points, etc. Fortunately, I have a little programming bug and I set out to create a web query that could poll a large dataset of stocks, and then use that data to measure as many different things as possible. I chose as my dataset, all AMEX, NASDAQ, and NYSE tickers. I’ve been steadily accumulating financial statistics (weekly) and housing them in a database. It goes back to the end of May and is large enough now that I can start getting more meaningful results.
This post then, serves as an initiation into different things I plan on learning from studying this dataset. Also, I am always looking for ways to expand valuable offerings to my readers. I think it’s safe to say that at least some of the things I say are taken negatively at best or even just indifferently. Perhaps too niche. Hopefully, starting this little project will give some others good reason to enjoy this site too.












