Alpha Needles in Unstructured Data Haystacks
By Barry L. Star
CEO, Wall Street Horizon
The challenges of working with unstructured data came into sharp focus last year when a self-driving car was involved in a fatal crash on a Florida highway. The NTSB ultimately found that the driver-assistance system was not at fault. However, the federal regulators did warn people that they can only rely on these systems to handle some of the situations that occur on the roads.
To get the Wall Street equivalent, just change a few words in the regulators’ warning. Traders can only rely on these systems to handle some of the situations that occur in the markets.
While not as catastrophic as a collision with an 18-wheeler, a bad trade based on a poor assessment of unstructured data could cost a firm millions.
That said, there is untold value and potential alpha tied up in the vast amounts of unstructured data now available. The key question, therefore, is how best to identify it, extract it, and leverage it to improve trading and investment results.
The Nature and Scope of the Unstructured Data Problem?
While definitions vary, unstructured data is content that doesn’t follow a pre-defined data model. Unlike the tidy and consistent rows, columns, and formats of structured data, unstructured data is all that messy and free-form content that people generate, mostly intended for use by other people.
It’s the free-form material in e-mail and text messages, blogs, videos, podcasts, chat sessions, and all kinds of marketing material, including web sites, marketing and sales collateral, white papers, slide presentations, and more. Essentially, it’s all stuff that doesn’t fit nicely into databases and spreadsheets.
Gartner and other market research firms agree that unstructured data comprises the lion’s share of most organizations’ informational assets – somewhere in the 80% range. The problem is that we’re still doing a poor job of mining the value out of this huge category of data.
Options for Moving Forward
Many Wall Street firms want to leverage various forms of unstructured data to generate profits and avoid losses. They want to wring the value out of that data to gain a clearer picture of their markets, spot patterns and anticipate developments more effectively, and take faster action to seize opportunities and sidestep risks.
These companies have three choices. They can build the software themselves, try a shortcut that skips all the really hard translation and analytics by using keywords to attempt to gauge sentiment, or buy or subscribe to a third-party system. Let’s look at the pros and cons of each option.
The “build” option can work if the company’s unstructured data initiative is narrowly focused and of limited scope. A good in-house development team can create and piece together all the taxonomy, text parsing functionality, analytics and metadata management functions required for a focused application.
For this to work, the focus must remain narrow. Imagine how hard it would be, for example, to write a parser that could recognize and normalize the five popular date formats and the at least 22 different ways to represent numbers. And that’s the simple stuff. How about creating a parser that could pull the valuable nugget out of a CFO’s quote in an earnings release that reads something like: “We expect our next quarter earnings to be in the middle- to upper-range of the projected estimates?”
That’s the problem with this option. You need to have a specialized focus or you leave too much value on the table.
The other build option dips a toe into unstructured data but doesn’t dive all the way in. It’s where you build lighter-weight sniffers, programs that, for example, detect and score the presence of certain keywords in unstructured data flows in order to discern positive or negative sentiments. Like the narrowly-focused systems discussed above, these types of systems generally lack the precision and thoroughness required to move the investment performance needle.
The “buy” option has its pros and cons as well. Offerings of this type invariably have been built by specialists and experts, and reflect the years of focus and experience they’ve poured into their products. At Wall Street Horizon, for example, we’ve been exclusively focused on corporate event data for more than 14 years. We know this particular type of unstructured data inside and out, and that knowledge is built right into our data. Another benefit is that these offerings are ready to be implemented, and thus offer very short time-to-value.
While buy options come with costs, working with specialists is usually a smart play. As the old adage goes, “You get what you pay for.”
Choosing Your Best Path
At Wall Street Horizon, we have been dealing with unstructured data and clients who work with it for years. While there are no hard and fast rules, we’ve noticed certain patterns in the approaches they tend to take. Large trading firms, for example, often have strategies that call for pulling specific, proprietary signals out of unstructured data streams. With plenty of resources at their disposal, they tend to build their own solutions. It’s an expensive approach, but obviously highly customizable. The sentiment option often works well for latency-sensitive momentum traders who use speed to get out of the way if and when necessary. The buy option is usually the preferred choice for market makers and traders, who need extremely accurate, machine readable data as fast as possible, on a reasonable budget.
The best approach for your firm, therefore, will be dictated by three factors: your specific goals or business requirements, the timeframe or schedule under which you’re operating, and – as always – your available budget.
It’s time to Make Your Move
Despite the great strides we’ve made in capturing and leveraging business intelligence, way too much of it remains locked away in files and datasets of all kinds. The only thing more mind-boggling than how much of this material already exists is how fast it’s growing.
It’s no longer a viable strategy for institutional investment firms to continue to only marginally use – or mostly ignore – this vast source of potential value, competitive advantage, and alpha.
Whether it’s building an in-house system, or teaming up with outside experts and leveraging their commercially available systems, many companies are moving ahead with unstructured data initiatives.
Don’t let your unstructured data project wind up crashing on the highway. Assess your options carefully, pick your path, and then move ahead decisively. That’s what will position your company to profit from all the value now tucked away in unstructured data.
Barry L. Star is the founder and CEO of Wall Street Horizon, a leading provider of corporate event data services for the capital markets.
The exchange's network will enable a range of market participants to access high-quality crypto data.
There is growing interest in actionable insights into market data.
ApeVue provides daily prices for around 100 non-public companies.
Limited competition for benchmarks and indices, credit ratings and trading data may increase costs.
The LSEG data centre will relocate from the City of London to a new London site this year.