This is a contributed article by Casber Wang.
The promise of analysing massive data to unlock greater customer insights, solve billion-dollar questions, and fuel deeper analytics and AI initiatives has many businesses drooling.
But to realise these promising advancements, enterprises must first wrangle disparate data sources, both structured and unstructured, and in multiple formats, to fuel these insights. And that’s no simple task.
Over the past 20 years, a series of technologies have promised to solve this problem and failed. Chief among them was Hadoop in the mid-2000s.
Before Hadoop, the only option was resource-heavy on-premise databases that required companies to carefully model their data, manage storage, evaluate its value and figure out how it all connected.
Instead, Hadoop advocated an open data ecosystem made up of data lakes, open data standards, modular best-of-breed software stacks and competitive data management vendors that drive value for customers.
While the Hadoop movement, and Apache-type projects, pushed the idea of an open data ecosystem forward, it ultimately stumbled for three reasons:
- The cost of purchasing, scaling and managing hardware was too expensive
- A lack of common data formats between applications and data lakes made managing and using data difficult
- Insufficient tools and skills available to manage data
Despite Hadoop’s underachievement, open data is making a comeback. And this time around, a new breed of open data ecosystem technologies are overcoming Hadoop’s shortcomings to capture the full scope of data within a company.
But why now? Four key technology trends are driving the open data ecosystem resurgence, and this time it’s here to stay.
- The rise of cloud storage
The rapid increase of cloud data storage – Amazon S3, Azure Data Lake Storage (ADLS) and Google Cloud Storage (GCS) – means companies can house structured and unstructured data lakes at scale.
First-generation systems required large capital to build on-prem compute and storage systems, which were costly to maintain and even more expensive to scale.
But cloud storage removed expensive on-premise hardware from the data storage equation, instead introducing resource-based pricing so companies only pay for the storage they use. And as the price drops, cloud storage services are becoming the default landing pads for data, often becoming the systems of record.
For today’s enterprise, a shift towards the cloud’s predictable performance and elasticity is the key to unlocking data capabilities like accelerated querying, avoiding copies, and improving oversight and management of data lakes.
2. Prevailing open-source data formats
More companies are adopting open data formats to make data compatible across programming languages and implementations.
Open-source data formats like Apache Parquet (columnar data storage), Apache Arrow (memory format for analytics, artificial intelligence, and machine learning) and Apache Iceberg (table format/transaction layer) means companies can use their data across all their current and future tools, rather than being locked into vendors with proprietary or incompatible formats.
With open and immediately usable formats, companies can store massive amounts of data and run associated business analytics and AI workloads directly – without lengthy and expensive software implementations that require data transformation.
This is an especially tantalising proposition for today’s enterprises as API-driven ‘plug and play’ data analysis and AI tools like H20 and DataRobot* become fast and easy to implement and see results.
3. The growth of cloud-native vendor support
In the mid-2000s, Hadoop let companies indiscriminately dump data into lakes without worrying about schema, consumption and management.
Companies raced to collect more data without considering architecture design, access, analytics or sustainability. These companies didn’t know what was in their data lake, let alone how to manage it or extract value from it. With tools that could address this problem yet to emerge, these data lakes turned into data swamps.
But today a plethora of vendors and tools have popped up to help enterprises handle specific data management challenges. There is a fast-growing data management iceberg with more solutions appearing across data streaming, transformation, observability and quality, governance, and consumption for end users.
Companies like Dremio* and Trino are running SQL queries directly in a cloud data lake. Technologies from companies such as Segment* and Matillion* are ingesting data and writing it into open formats. And platforms like Airflow, Prefect and Dagster are handling data orchestration.
As these vendors emerge and compete to bring value to customers, operating in the open data ecosystem only becomes easier.
For enterprises deciding which technologies will drive their data infrastructure, established vendors and best-of-breed startups, each have advantages and disadvantages. In choosing the right path for your business, consider these two differences in solutions providers:
- Established vendors generally offer better on-premise compatibility, though they lack the functionalities of best-of-breed startup tools
- Best-of-breed startup solutions generally have deep functionalities in one area but are less mature in features that help meet enterprise requirements like security and governance
4. Applications are meeting users at the right altitude
Data analysts, scientists and business users have little interest in the under-the-hood data workings of manual schema changes, resource provisioning and database management, which were requirements for first-gen open data ecosystems.
Today, vertically integrated tools are designed with abstraction built in to help end users operate at the insights level they crave.
As applications continue to evolve, and businesses diversify their data capabilities, more sophisticated users will seek out greater flexibility and depth that lets them go one layer deeper.
Open data ecosystems are the long game
Just as these four trends have powered the open data ecosystem’s revival, it’s also fueled the rise of proprietary cloud data warehouses such as Snowflake*.
Some argue that Snowflake’s approach as a single data warehouse that encompasses every workload is the only path forward. However, over time, just as application development is shifting from monolithic architectures to API-driven microservices architectures, we’ll likely see data analytics workloads gradually shift from proprietary data warehouses to open data architectures, too.
It’s an exciting time for open data as it has become more accessible to enterprises than ever before. With a full range of technologies – from cloud data lakes to data management to open data formats – that address Hadoop’s initial shortcomings, companies are now finally equipped to capture and use the full scope of data within their organisations, bringing the big data promise to life.
DataRobot, Dremio, Matillion and Segment are current or exited Sapphire Investments. Snowflake is a Sapphire investment at IPO.
Casber Wang is a VP at Sapphire Ventures, focusing on security, enterprise infrastructure and data & analytics. He is a board observer at Tetrate, Upytcs and Verbit. Prior to Okta’s acquisition, he was also a board observer at Auth0. In 2020, Business Insider listed Wang as an Enterprise VC Rising Star Investor. Prior to Sapphire, he was part of the technology investment banking group at Bank of America Merrill Lynch.