This is a contributed article by Casber Wang.

The promise of analysing massive data to unlock greater customer insights, solve billion-dollar questions, and fuel deeper analytics and AI initiatives has many businesses drooling.

But to realise these promising advancements, enterprises must first wrangle disparate data sources, both structured and unstructured, and in multiple formats, to fuel these insights. And that’s no simple task.

Over the past 20 years, a series of technologies have promised to solve this problem and failed. Chief among them was Hadoop in the mid-2000s.

Before Hadoop, the only option was resource-heavy on-premise databases that required companies to carefully model their data, manage storage, evaluate its value and figure out how it all connected.

Instead, Hadoop advocated an open data ecosystem made up of data lakes, open data standards, modular best-of-breed software stacks and competitive data management vendors that drive value for customers.

While the Hadoop movement, and Apache-type projects, pushed the idea of an open data ecosystem forward, it ultimately stumbled for three reasons:

  • The cost of purchasing, scaling and managing hardware was too expensive
  • A lack of common data formats between applications and data lakes made managing and using data difficult
  • Insufficient tools and skills available to manage data

Despite Hadoop’s underachievement, open data is making a comeback. And this time around, a new breed of open data ecosystem technologies are overcoming Hadoop’s shortcomings to capture the full scope of data within a company.

But why now? Four key technology trends are driving the open data ecosystem resurgence, and this time it’s here to stay.

  1. The rise of cloud storage

The rapid increase of cloud data storage – Amazon S3, Azure Data Lake Storage (ADLS) and Google Cloud Storage (GCS) – means companies can house structured and unstructured data lakes at scale.

First-generation systems required large capital to build on-prem compute and storage systems, which were costly to maintain and even more expensive to scale.

But cloud storage removed expensive on-premise hardware from the data storage equation, instead introducing resource-based pricing so companies only pay for the storage they use. And as the price drops, cloud storage services are becoming the default landing pads for data, often becoming the systems of record.

For today’s enterprise, a shift towards the cloud’s predictable performance and elasticity is the key to unlocking data capabilities like accelerated querying, avoiding copies, and improving oversight and management of data lakes.

2. Prevailing open-source data formats

More companies are adopting open data formats to make data compatible across programming languages and implementations.

Open-source data formats like Apache Parquet (columnar data storage), Apache Arrow (memory format for analytics, artificial intelligence, and machine learning) and Apache Iceberg (table format/transaction layer) means companies can use their data across all their current and future tools, rather than being locked into vendors with proprietary or incompatible formats.

Leave a Reply

Your email address will not be published. Required fields are marked *