Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data analytics in ecommerce
    Analytics Technology Drives Conversions for Your eCommerce Site
    5 Min Read
    CRM Analytics
    CRM Analytics Helps Content Creators Develop an Edge in a Saturated Market
    5 Min Read
    data analytics and commerce media
    Leveraging Commerce Media & Data Analytics in Ecommerce
    8 Min Read
    big data in healthcare
    Leveraging Big Data and Analytics to Enhance Patient-Centered Care
    5 Min Read
    instagram visibility
    Data Analytics Plays a Key Role in Improving Instagram Visibility
    7 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: The Data Lake Debate: Pro Cross-Examines Con
Share
Notification Show More
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Software > Hadoop > The Data Lake Debate: Pro Cross-Examines Con
Data ManagementHadoopKnowledge ManagementOpen SourceUnstructured Data

The Data Lake Debate: Pro Cross-Examines Con

TamaraDull
Last updated: April 6, 2015 5:18 am
TamaraDull
7 Min Read
Image
SHARE

Image

As to be expected, Anne, your arguments against building a data lake are both persuasive and passionate. You’ve made some great points, my friend, but you’re making this way too easy for me. Before I jump into my rebuttal [my next post], I’d like to clarify a few things that you brought up. I’ve boiled it down to three questions. What say you?

Image

As to be expected, Anne, your arguments against building a data lake are both persuasive and passionate. You’ve made some great points, my friend, but you’re making this way too easy for me. Before I jump into my rebuttal [my next post], I’d like to clarify a few things that you brought up. I’ve boiled it down to three questions. What say you?

More Read

data protection

How to Protect Data Within an App With RASP Security

VPNs Are Crucial Privacy Protection Tools in the Age of Big Data
Collecting Analytic Data by Tracking Mobile Visitors: A Guide for Mobile Insights
Data Collection: Get All Your Customers to Sign Up for Your Digital Campaigns
The Use of Big Data Analytics in Algorithmic Trading

Image1. In your arguments, you focus on data volumes and the ancillary costs of open source software (OSS) to support these large volumes. Yet, more recent studies show that organizations aren’t as concerned about their data volumes—not everyone is a Google or Facebook—as much as they’re concerned about the variety of data and the ability to integrate it all. How do you address these concerns?

ImageI cannot stress enough that data brought into a data lake is co-located not integrated. Even with schema on read, the integration happens outside of the storage environment – on the banks of this beautiful data lake. Every query that requires a new structure or schema for the data will need to be written from scratch. The cost to value ratio for the time and talent required for this extensive coding (for a still novel technology) for most organizations is limited if not nonexistent. The required skills and abilities to access and integrate data from Hadoop make available talent scarce. You are right, not everyone is a Google or a Facebook. Organizations do not have these skills on staff nor do they have the budget to bring them on.

Hadoop does provide a fantastic data storage opportunity, but it does not require us to abandon all of our existing structured data environments. Copying existing structured data to a data lake (especially transactional data) would be a duplication of effort and storage and would create additional risk for the organization. Moving operational data would be an enormous event, as it would require applications throughout the organization to undergo a significant coding/design overhaul which is not going to be a popular idea in any business unit.

The ideal scenario is to leave existing data where it lives today and use Hadoop as the storage repository for the data that previously could not be stored because of constraints presented by volume, variety or velocity. Organizations can take advantage of data virtualization tools where not only is the integration coding challenge eliminated but other advantages such as centralized security and governance are gained. The data is queried, transformed and structured as needed and provisioned to business users through virtual views. No dumping of data – just purposeful access, integration and use.

Image2. Related to the first question, you state: “Before organizations start down the path of discovering capabilities within a data lake, they should first turn to taking full advantage of their current data.” What if most of their current data is semi-structured or unstructured data (often cited as much as 80-90%)? How do they take full advantage of that data?

ImageWho’s the one making this easy? Careful throwing those stones Ms. Dull. Your glass house is exquisite.

Historically, in business, unstructured data sources were managed within the scope of knowledge management or content management. The vast storage capabilities that Hadoop presents allows the documents, emails and other unstructured sources to be centrally stored and the content is now considered accessible data. While it is true, the sources can now be accessed through Hadoop to glean the content as ingestible data, it is not the storage and access that brings the advantage. The advantage is in the insights derived from the analysis of the data. Regardless of the type of data (structured, semi-structured or unstructured) or how and where the data is stored, organizations can take full advantage of any and all data by generating value when processing or analyzing it within a specific business context. 

Image3. You seem to suggest a top-down data management approach to big data; for example, “…the real success factor is found in strong data management capabilities under the umbrella of a mature data governance program.” Are you implying a top-down approach to big data? When does a bottom-up approach make sense?

ImageThere is a time and a place for both data science and data governance. They do not need to be mutually exclusive. The rigor of data governance is not to create obstacles but to create an environment to foster data management autonomy at the lowest level within the framework of the enterprise data governance program. When it comes to data discovery, governance still has value to protect the organization from compliance and security risks not because of the data itself but how the data is used. I emphatically support innovation labs and data science programs – they are ideal examples of bottom up approaches. However, just because they play in the sandbox, doesn’t mean they don’t follow playground rules.

ImageThanks, Anne! I’ll get started on my first rebuttal to what you’ve presented. Stay tuned!

 

 


Previously in the Data Lake Debate:

  • The Introduction – by Jill Dyche
  • Pro’s Up First – by Tamara Dull
  • Questioning the Pro – by Anne Buff and Tamara Dull
  • Negative Puts a Stake in the Ground – by Anne Buff
TAGGED:Data Lake Debate
Share This Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

trusted data management
The Future of Trusted Data Management: Striking a Balance between AI and Human Collaboration
Artificial Intelligence Big Data Data Management
data analytics in ecommerce
Analytics Technology Drives Conversions for Your eCommerce Site
Analytics Exclusive
data grids in big data apps
Best Practices for Integrating Data Grids into Data-Intensive Apps
Big Data Exclusive
AI helps create discord server bots
AI-Driven Discord Bots Can Track Server Stats
Artificial Intelligence Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

Data Lake Debate
Big DataData ManagementHadoopPolicy and Governance

The Data Lake Debate: Questioning the Pro

8 Min Read
Data Lake Debate
Big DataData ManagementHadoopPolicy and Governance

The Data Lake Debate: Negative Puts a Stake in the Ground

10 Min Read
Data Lake Debate
Big DataData ManagementHadoopOpen SourcePolicy and Governance

The Data Lake Debate: The Final Word from Negative

8 Min Read
Image
Big DataData ManagementHadoopOpen SourcePolicy and Governance

The Data Lake Debate: Pro Delivers First Rebuttal

5 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai is improving the safety of cars
From Bolts to Bots: How AI Is Fortifying the Automotive Industry
Artificial Intelligence
giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-24 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?