Data Analysis in Early Stage Startups

This is intended to be a short post. The topic is data as it relates to early-stage startups. There are two types of data sets that people want. One, they want to know about financings, valuations, and acquisition prices. Two, they want to know about metrics, everything from daily active users, monthly active users, and a range of other emerging engagement metrics. People want this data so that we can all make more sense of what is happening in the early stages. Rather than get caught up in the hype, we can trust the data.

As nice it would be to have these, the cold reality is that these data sets are nearly impossible to get. “If” every startup honest contributed their financing-related data to CrunchBase, from start to end, we’d have some rich data, but that ain’t happening. There’s little incentive for founders and investors to disclose this data, and for currently early-stage startups, we won’t know the financing particulars for a long while, if ever. And, “if” every startup properly collected their own usage and engagement data, we’d be able to better decipher which metrics are for vanity and which are for value. As it stands, only a handful of people know the metrics at growing early-stage startups and have little to no incentive to share them.

Therefore, we don’t have good data, and whatever is there is far from clean. And, double therefore, making inferences from the data is a dangerous exercise in extrapolation. This is why I *never* try to cite data to back up any arguments I’m making. Data can always be manipulated or misused, reshaped to advance any argument. Perceptive readers will cut through that b.s. and it reduces credibility all around, not to mention trust in the source and respect of the reader’s time.

Brendan Baker has a great way of explaining this, saying that any communication or analysis around early-stage data should include the following language:

Here’s what we found, here’s what I think it means, and here are the limitations.


Let me go one step further on Brendan’s suggestion and say that this disclaimer should be appended to any data collection and analysis of early-stage companies. This clearly presents the realities to the audience, protects the author from some inevitable doubts, respects the reader’s time, and hopefully creates a good enough atmosphere for discussion around what should be an important topic. The next decade is going to slap us in the face with all sorts of data, so we must start establishing these groundrules now. Please comment with more, and thanks in advance.

About Semil Shah

Official contributor to @TechCrunch (since Jan 2011); from July 1, will begin EIR with @JavelinVP

2 responses to “Data Analysis in Early Stage Startups”

  1. rohit sharma (@rohit_x_) says :


    I disagree. “People want this data so that we can all make more sense of what is happening in the early stages. Rather than get caught up in the hype, we can trust the data.” is a very big assumption.

    To make sense of what is happening, from an investor or an entrepreneur’s point of view, heuristics formed by experience, pattern learning, and an ability to fast-forward a hypothesis while freezing reality are key soft-skills that buttress data-driven learnings.

    Any ‘analysis’ of early stage companies must be accompanied by insights derived as if you were an investor (or the entrepreneur), neither are data-driven points of view.

    So, subtle but important difference imho – Data not available and even if it was, it’s not enough. In the face of such uncertainty, ‘decision theory’ exercised with data will likely match or mimic average VC returns; i.e. you cannot use a Kahneman-like algorithm fed w/ early stage data to make a better than average decision on startups vs. a good VC. I know this is not what you set out to do with data (just analysis) but I am citing this example to help illustrate the point that data is not enough in addition to your assertion that data slivers available are not sufficient.

  2. damondpace says :

    I agree. In a very early stage startup it’s hard to see what’s going on and extrapolate any real knowledge out of data. You’re still in the “guessing game” stage at this point and all your really saying with the data is that you’ve answered 5 out of 100 questions and these are the answers you’ve come up with. You still need to answer 95 other questions to get enough data to validate your evolving hypothesis and make truly informed decisions. It’s all part of the building process.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 49 other followers

%d bloggers like this: