Data Discovery
Data discovery is the process of determining and accurately inventorying the data under an organization’s control.
- can refer to:
- initial inventory of data
- response to e-discovery
- etc.
Discovery Methods
Label-based Discovery
- data labels help discovery data efficiently
- labels must be accurate and sufficient
- help find, collect and disclose all required data and only the required data
Metadata-based Discovery
- metadata is a listing of traits and characteristics about specific data elements or sets
- often automatically created at the same time as the data
- data discovery can use metadata to identify required data
Content-based Discovery
- discovery tools can be used to identify data by their contents
- can be basic searches to sophisticated pattern-matching
Structured, Semistructured, and Unstructured Data
Structured data is data that is sorted according to meaningful, discrete types and attributes.
- data in relational databases
- easier to perform discovery on structured data
Unstructured data is data that is unsorted.
- e.g., content of emails
Semistructured data is data that uses tags or other elements to created fields and records within data without requiring the rigid structure that structured data relies on.
- e.g.,
- XML and JSON
- MongoDB is a common semistructured database
- easier to perform data discovery on
- more challenging than structured data due to flexibility of semistructured data
Data Location
- location of data may cause issues for data discovery
- laws and regulations may limit
- types or methods of data discovery
- what you can do with data
- where and how you can store it
- can create technical hurdles to discovery
- data stored in unstructured form or in a service that handles data in a way that is challenging to conduct
- may need to design around these constraints
- data stored in unstructured form or in a service that handles data in a way that is challenging to conduct
Data Analytics
- data analytic systems can provide ways to perform data discovery
- often they create new data feeds from sets of data that already exist within an environment
- means you need to consider how to handle data labeling, classification, etc.
- often they create new data feeds from sets of data that already exist within an environment
- data analytics methods:
- data mining
- is an outgrowth of the possibilities offered by regular use of the cloud, aka big data
- by collecting various data streams and running queries across them, can detect and analyze unknown trends and patterns
- real-time analysis
- tools can provide data mining functionality concurrently with data creation and use
- business intelligence
- state-of-the-art data mining involving recursive, iterative tools and processes that detect trends in trends and identify more oblique patterns in historical and recent data
- data mining