Cookies help us display personalized product recommendations and ensure you have great shopping experience.

By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData CollectiveSmartData Collective
  • Analytics
    AnalyticsShow More
    data analytics in ecommerce
    Analytics Technology Drives Conversions for Your eCommerce Site
    5 Min Read
    CRM Analytics
    CRM Analytics Helps Content Creators Develop an Edge in a Saturated Market
    5 Min Read
    data analytics and commerce media
    Leveraging Commerce Media & Data Analytics in Ecommerce
    8 Min Read
    big data in healthcare
    Leveraging Big Data and Analytics to Enhance Patient-Centered Care
    5 Min Read
    instagram visibility
    Data Analytics Plays a Key Role in Improving Instagram Visibility
    7 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: It’s data, Jim, but not as we know it – Part 1: What the echo of the Big Bang tells us about the nature of information
Share
Notification Show More
Font ResizerAa
SmartData CollectiveSmartData Collective
Font ResizerAa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Big Data > Data Mining > It’s data, Jim, but not as we know it – Part 1: What the echo of the Big Bang tells us about the nature of information
Data Mining

It’s data, Jim, but not as we know it – Part 1: What the echo of the Big Bang tells us about the nature of information

TeradataEMEA
Last updated: July 27, 2009 9:02 pm
TeradataEMEA
9 Min Read
SHARE

Possibly I am just turning into a grumpy old man in my middle-age, but there are two words that when used together annoy me beyond almost all reason – yes, even more than the “p-word” that has featured in two of my previous posts: “unstructured” and “data.”

Despite what some vendors – and some commentators, who really should know better – would have you believe, there is nothing remotely formless or “unstructured” about “new” types of data, like image files, audio files, text-based documents, XML documents and so on. Of course for the most part these data hardly qualify as “new,” either, but don’t indulge my pedantry by getting me started down that road.

Data is merely information that has been encoded in some way and the only truly “unstructured data” is “noise”; random signals, representative of nothing much more than a system in equilibrium with its environment. A picture, a song, the complete works of Shakespeare – these are all forms of information and they are emphatically not “unstructured.”

To see the truth of this, take, for example, a GIF file (make sure that it is one that you don’t much care about, or a copy of one that you do) and open it with a text …

More Read

A Portrait of the Data Quality Expert as a Young Idiot

Participate in the 2011 Rexer Data Mining Survey
Big Data is Critical to the DoD Science and Technology Investment Agenda
Social Media: Back to Spreadsheets
More Ways to get a Scoring Model wrong

Possibly I am just turning into a grumpy old man in my middle-age, but there are two words that when used together annoy me beyond almost all reason – yes, even more than the “p-word” that has featured in two of my previous posts: “unstructured” and “data.”

Despite what some vendors – and some commentators, who really should know better – would have you believe, there is nothing remotely formless or “unstructured” about “new” types of data, like image files, audio files, text-based documents, XML documents and so on. Of course for the most part these data hardly qualify as “new,” either, but don’t indulge my pedantry by getting me started down that road.

Data is merely information that has been encoded in some way and the only truly “unstructured data” is “noise”; random signals, representative of nothing much more than a system in equilibrium with its environment. A picture, a song, the complete works of Shakespeare – these are all forms of information and they are emphatically not “unstructured.”

To see the truth of this, take, for example, a GIF file (make sure that it is one that you don’t much care about, or a copy of one that you do) and open it with a text editor. Now mess with and/or delete some of the bytes at random, save the adulterated file and then try and open it with your normal picture editing or viewing software.

In fact a GIF file is highly structured and includes meta-data in the header that, for example, includes a colour table; the height and width of the pixels represented by the bitmap that follows; whether the image is animated or still; etc., etc. All this meta-data is then followed by an array of bytes that define the actual bitmap bits and an end-of-file marker. Monkey with this file structure and you risk reducing the value of the data that it contains to peanuts; monkey with the actual data payload and you likewise either corrupt the file so that it can’t be read or so that it represents a different or a degraded image. Repeat this experiment with just about any multimedia file type and you will get the same result – either a corrupt file that cannot be read correctly or one that is no longer an accurate representation of the original object. These data are not only structured; the nature of that structure is critical to their correct interpretation.

And of course it’s not just the “wrapper” that has structure; the structure of the data itself is critical. Most people would interpret the statement “Dave didn’t marry Sue because she was rich” as meaning that Dave and Sue were married, but that Dave’s motivation for their union was not financial. Conversely, the statement that “Dave didn’t marry Sue, because she was rich” would probably be interpreted as meaning that Dave and Sue did not marry and that is was the difference in their circumstances that got in the way. A single structural element – one comma – makes a big difference to our interpretation of the “same” data. Suppose that during their courtship Dave tells Sue “I love you”; the structure of this sentence is identical to the structure of the sentence “I want you” (subject-verb-object, I think, but if I am mistaken and there are any linguists out there reading this, please feel free to correct me), but the two statements may or may not be synonymous (although I hear that Dave is a good guy, so perhaps we should give him the benefit of the doubt).

In fact, even apparently random noise can convey meaning. Tune a radio telescope to the microwave range of the electromagnetic spectrum and you will hear a faint hum, directionally uniform to 1 part in 500. This is quite literally a distant reverberation of the “Big Bang” in which the Universe was created and which confirms that the Universe was indeed once hot-and-dense, as the Big Bang theory demands that it must have been. That’s important information, as historically there have been other theories of the origin of the Universe that don’t assume an explosive beginning.

From measurements of the cosmic microwave background radiation, as it is called, physicists and astronomers are able either to infer or to calculate directly many other essential truths about the Universe, including the speed at which our galaxy is moving (600 kilometres-per-second towards the constellation of Leo, in case this answer is one day all that stands between you and the “who wants to be a millionaire?” prize money). It turns out that there is an awful lot of important information encoded in that apparently random noise.

Back on Earth, less exotic, “new” types of data are increasingly interesting to the commercial and government organizations that most of us serve. We should probably call these “multimedia data”, “non-record based data” or “non-relational” data. Actually, I’m not crazy about “non-relational” either; whilst this data is typically not relational in the accepted sense – the ordering of the bytes that define the bitmap in a GIF file is important, for example – this data can, after all, be accommodated in tables in a relational database using BLOB and CLOB objects. So long as we regard these objects themselves as atomic, it seems to me these data are as relational as any other attribute of an entity. Things clearly get more complex if we want to examine or “query” the objects themselves (“select all of the pictures in which the sky is red”), but let’s not go there for now.

My recent travelling companion and the main attraction on the “CTO Road Show” that we took on tour across the EMEA region in June – Teradata CTO Stephen Brobst – refers to “non-traditional data types” versus “record-based” or “square” data. These are definitions that I can live with. And I’m sure that engineering PhD Stephen will sleep easier for knowing that the flunky from marketing considers his use of technical vocabulary to be correct and not in the least aggravating!

 

Martin Willcox

TAGGED:data qualityunstructured data
Share This Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

trusted data management
The Future of Trusted Data Management: Striking a Balance between AI and Human Collaboration
Artificial Intelligence Big Data Data Management
data analytics in ecommerce
Analytics Technology Drives Conversions for Your eCommerce Site
Analytics Exclusive
data grids in big data apps
Best Practices for Integrating Data Grids into Data-Intensive Apps
Big Data Exclusive
AI helps create discord server bots
AI-Driven Discord Bots Can Track Server Stats
Artificial Intelligence Exclusive

Stay Connected

1.2kFollowersLike
33.7kFollowersFollow
222FollowersPin

You Might also Like

5 Lessons Social CRM can Learn from CRM

8 Min Read

DQ-Tip: “…Go talk with the people using the data”

3 Min Read

7/17/2009 1:59:47 PM

2 Min Read

Robotic folder: mastery in a domain

4 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
data-driven web design
5 Great Tips for Using Data Analytics for Website UX
Big Data

Quick Link

  • About
  • Contact
  • Privacy
Follow US
© 2008-24 SmartData Collective. All Rights Reserved.
Go to mobile version
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?