06.03.2016

Data Lakes Vs Data Streams: Which is Better? (by Guy Warren, ITRS Group)

06.03.2016

Data lakes and data streams: two of the hottest data buzzwords du jour and as likely as any pair to spark an argument between data scientists backing one or the other. But which really is better?

Firstly, what are these lakes and streams?

A data lake is still a fairly new concept that refers to the storage of a large amount of unstructured and semi structured data. It addresses the need to store data in a more agile method compared to traditional databases and data warehouses, where a rigid data structure and data definition is required. The data is usually indexed so that it is searchable, either as text or by a tag which forms part of the schema. The flexibility-factor is that each new stream of data can come with no schema, or its own schema, but either way can still be added to the data lake for future processing.

Why is this useful? Because businesses are producing increasing amounts of useful data, in various formats, speeds and sizes. To realise the full value of this data, it must be stored in a such way that people can dive into the data lake and pull out what they need there and then, without having to define the data dictionary and relational structure of data in advance. This increases the speed at which data can be captured and analysed, and gives much more flexibility for adding new sources to the lake. This makes lakes much more flexible than traditional storage for data scientists or business analysts, who are constantly looking for ways to capture and analyse their data, and even pour it back into the lake to create new data sources from their results. Perhaps someone has run an analysis to find anomalies within a subset of the data and has then contributed this analysis back to the data lake as a new source. However, to get the best out of a complex data lake, a data curator is still recommended to create consistency and allow joins across data from different sources.

A data stream on the other hand, is an even newer concept in the general data science world (except for people who use Complex Event Processing engines which work on streaming data). In contrast to deep storage, it’s a result of the increasing requirement to process and perform real-time analysis on streaming data. Highly scalable real-time analysis is a challenge that very few technologies out there can truly deliver on…yet. The value of the data stream (versus the lake) is the speed and continuous nature of the analysis, without having to store the data first. Data is analysed ‘in motion’.

The data stream can then also be stored. This gives the ability to add further context or compare the real-time data against your historical data to provide a view of what has changed – and perhaps even why (which depending on your solution, may impact responsiveness). For example, by comparing real-time data on trades per counterparty against historical data, it could show that a counterparty, who usually submits a given number of trades a day, has not submitted as many trades as expected. A business can then investigate why this is the case and act in real-time, rather than retroactively or at the end of day. Is it a connection problem with the counterparty, is the problem on the business’ side or the client’s? Is it a problem with the relationship? Perhaps they’ve got a better price elsewhere? All useful insight when it comes to shaping trading strategy and managing counterparty relationships.

The availability of these new ways of storing and managing data has created a need for smarter, faster data storage and analytics tools to keep up with the scale and speed of the data. There is also a much broader set of users out there who want to be able to ask questions of their data themselves, perhaps to aid their decision making and drive their trading strategy in real-time rather than weekly or quarterly. And they don’t want to rely on or wait for someone else such as a dedicated business analyst or other limited resource to do the analysis for them. This increased ability and accessibility is creating whole new sets of users and completely new use cases, as well as transforming old ones.

Look at IT capacity management, for example; hitherto limited to looking at sample historical data in a tool like a spreadsheet and trying to identify issues and opportunities in the IT estate. Now, it is possible to compare real-time historical server data with trading data, i.e. what volume of trades generated what load on the applications processing the trades. It is also possible to spot unusual IT loads before they cause an issue. Imagine an upgrade to a key application: the modern capacity management tools can detect that the servers are showing unusually high load given the volume of trades going through the application, catching a degradation in application performance before a high trading load causes an outage. In the future, by feeding in more varied and richer sources of data (particularly combining IT and business data) and implementing machine learning algorithms, it will be possible to accurately predict server outages or market moves that could trigger significant losses if not caught quickly.

So: which is better, a data lake or a data stream? The answer is both. Businesses need to be able to process and analyse data at increasingly large volumes and speed, and from across a growing number of sources as the data arrives in a stream, along with the ability to both access and analyse the data easily and quickly from a data lake. Historically, the problem has been that standard tooling doesn’t easily allow for mixing these two paradigms – but the world is changing!

Markets Media Follow

Digital publisher covering trading & technology in capital markets. @FIXGlobalOnline @TheBondDesk @BestExecution @DerivSource @WomeninFinanceM @TraderTeeVee

Markets Media @marketsmedia ·

27 Jan

🏆 The 2026 Global Markets Choice Awards are here! 🌍 Nominations are officially OPEN for the celebration of excellence in global capital markets trading & technology. Nominate below:
https://www.jotform.com/form/260086385121150

Reply on Twitter 2016249652194980199 Retweet on Twitter 2016249652194980199 Like on Twitter 2016249652194980199 Twitter 2016249652194980199

Retweet on Twitter Markets Media Retweeted

M Group Strategic Communications @mgroupsc ·

21 Jan

Delaware Life Insurance Company is becoming the first insurance carrier to offer an index that contains cryptocurrency, adding the BlackRock U.S. Equity Bitcoin Balanced Risk 12% Index to its fixed index annuity (FIA) portfolio.

As the digital assets industry pushes toward

Reply on Twitter 2014004152599187797 Retweet on Twitter 2014004152599187797 2 Like on Twitter 2014004152599187797 9 Twitter 2014004152599187797

Markets Media @marketsmedia ·

20 Jan

Franklin Templeton is expanding its tokenized fund suite, signaling growing institutional demand for blockchain-based fund infrastructure and regulated investment products moving onchain. Read the full article below:

Reply on Twitter 2013725504843489307 Retweet on Twitter 2013725504843489307 Like on Twitter 2013725504843489307 Twitter 2013725504843489307

Markets Media @marketsmedia ·

15 Jan

$50 billion in active ETF inflows helped fuel a record year for @BlackRock 's iShares business, as investors continue to lean into active strategies.

Reply on Twitter 2011892328555037001 Retweet on Twitter 2011892328555037001 Like on Twitter 2011892328555037001 Twitter 2011892328555037001

From The Markets

Tradeweb Expands Kalshi Prediction Market Data

The launch marks the first phase of Tradeweb’s strategic partnership with Kalshi.

06.24.2026 By Shanny Basar , Senior Writer
From The Markets

ICMA Welcomes First UK Consolidated Tape for Bonds

A widely accessible bond tape will foster broader participation in the UK bond markets.

06.24.2026 By Shanny Basar , Senior Writer
From The Markets

MarketAxess Introduces TraX Tape

The launch comes as the UK & EU increase the availability of bond trading data.

06.22.2026 By Shanny Basar , Senior Writer
From The Markets

Pyth Network Offers Proprietary 24/7 Index Products

This delivers 24/7 price discovery to equities and commodities for the first time at scale.

06.11.2026 By Shanny Basar , Senior Writer
From The Markets

ISDA Launches US Treasury Repo Market Clearing Indicators

Sponsored cleared repo volumes are used as a proxy to monitor client participation in central clearing.

06.11.2026 By Shanny Basar , Senior Writer

Want the latest news on securities markets -- FREE?

Data Lakes Vs Data Streams: Which is Better? (by Guy Warren, ITRS Group)

NEWSLETTER SIGN UP

Related articles

Tradeweb Expands Kalshi Prediction Market Data

ICMA Welcomes First UK Consolidated Tape for Bonds

MarketAxess Introduces TraX Tape

Pyth Network Offers Proprietary 24/7 Index Products

ISDA Launches US Treasury Repo Market Clearing Indicators