Data Discovery


Data discovery is the process of determining and accurately inventorying the data under an organization’s control.

  • can refer to:
    • initial inventory of data
    • response to e-discovery
    • etc.

Discovery Methods

Label-based Discovery

  • data labels help discovery data efficiently
    • labels must be accurate and sufficient
  • help find, collect and disclose all required data and only the required data

Metadata-based Discovery

  • metadata is a listing of traits and characteristics about specific data elements or sets
  • often automatically created at the same time as the data
  • data discovery can use metadata to identify required data

Content-based Discovery

  • discovery tools can be used to identify data by their contents
  • can be basic searches to sophisticated pattern-matching

Structured, Semistructured, and Unstructured Data

Structured data is data that is sorted according to meaningful, discrete types and attributes.

  • data in relational databases
  • easier to perform discovery on structured data

Unstructured data is data that is unsorted.

  • e.g., content of emails

Semistructured data is data that uses tags or other elements to created fields and records within data without requiring the rigid structure that structured data relies on.

  • e.g.,
    • XML and JSON
  • MongoDB is a common semistructured database
  • easier to perform data discovery on
  • more challenging than structured data due to flexibility of semistructured data

Data Location

  • location of data may cause issues for data discovery
  • laws and regulations may limit
    • types or methods of data discovery
    • what you can do with data
    • where and how you can store it
  • can create technical hurdles to discovery
    • data stored in unstructured form or in a service that handles data in a way that is challenging to conduct
      • may need to design around these constraints

Data Analytics

  • data analytic systems can provide ways to perform data discovery
    • often they create new data feeds from sets of data that already exist within an environment
      • means you need to consider how to handle data labeling, classification, etc.
  • data analytics methods:
    • data mining
      • is an outgrowth of the possibilities offered by regular use of the cloud, aka big data
      • by collecting various data streams and running queries across them, can detect and analyze unknown trends and patterns
    • real-time analysis
      • tools can provide data mining functionality concurrently with data creation and use
    • business intelligence
      • state-of-the-art data mining involving recursive, iterative tools and processes that detect trends in trends and identify more oblique patterns in historical and recent data