Use Context-Aware Data Classification for a Robust Data Security Posture

Nov 21, 2023
Enterprises often interpret a data security mandate as identifying configuration issues or vulnerabilities in their data infrastructure. To improve security posture, though, the scope of data security activities must be to protect sensitive data assets such as customer information, trade secrets, financial information or patents. DSPM-based data classification offers a granular view that helps define adequate policies for the type, context and sensitivity of the data.

Typical labeling practices (public, internal, confidential, secret) fail to capture the differences and nuances between different types of data, such as the difference between R&D documents and customer payment information. In this blog post, we’ll present a set of data classification categories that can help you extract context from your data for richer and more accurate labeling.

Understanding Data Classification

Classification is the process of labeling and categorizing data based on the type of information it holds. Data classification helps organizations understand the value and sensitivity of their data, as well as the impact on the business if that data were exposed. This allows them to set more effective security policies.

Why Classification Is Key to Cloud Security

Data classification plays a major part in improving an organization’s security posture. It’s also explicitly required by some compliance frameworks and can help streamline other GRC efforts (HIPAA, SOC 2, ISO 27001). This can manifest in multiple ways:

  • Granular security policies: Data classification helps organizations define security policies (such as access controls) specific to the data they need to secure.
  • Incident management: Classification helps businesses prioritize incidents that involve sensitive or valuable data over issues that involve non-sensitive data.
  • Compliance and regulation: Classification allows organizations to identify, categorize and apply appropriate controls around regulated data like PII, PHI and credit card details (PCI) to meet compliance requirements. During audits and regulatory reviews, classification provides the ability to demonstrate compliance by showing how regulated data is handled.
  • Data detection and response (DDR) accuracy: Once data is classified, organizations can implement more effective real time monitoring for data incidents, highlighting cases where sensitive data is put at risk and requires immediate response from security teams.
  • Reduced attack surface: Organizations can reduce their attack surface area by consolidating duplicated data and ensuring data is accessible in accord with least privilege principles.
  • Prioritization: Not all data is created equal. Classification enables overworked security teams to focus their security efforts on the data assets that would have a larger impact in case of a breach or compliance violation

Typical Data Classification Challenges in the Modern Enterprise

Data classification is only effective if carried out consistently at a company level. Today’s complex data infrastructure means that data is often left unclassified or inadequately classified, rendering downstream policies ineffective.

Common Data Classification Challenges

Data Fragmentation

It’s challenging to discover and monitor every repository where data needs to be classified when data is spread across services in hybrid environments (cloud-based or on-premises databases, big data platforms, data lakes, collaboration systems).

Use of Unstructured Data

While structured data is queryable, its unstructured counterpart (documents, media files, PDFs and emails) requires more resources and frequent manual intervention to classify.

Shadow Data

The cloud’s elasticity that enables developers to spin services up and down with minimal friction is a key reason for unknown, undiscovered and, implicitly, unclassified data.

Mergers and Acquisitions

Differences in security policies, classification practices and IT architectures between two distinct business entities result in inconsistent classification and inadequate policy enforcement.

How to Classify? 5 Categories for Data Classification

To define rich and comprehensive security policies, data must be classified based on its type, context, subject and sensitivity.

1. Data Types

Data types are the most granular building block of classification to enable policy definition and enforcement. Some examples of data types include email addresses, social security numbers, country codes, payment card information, and the like. DSPM solutions will usually have pre-built classifiers or data types, as well as custom data types based on specific business needs.

It’s worth noting that using Data Types can correctly classify data, which would otherwise be difficult to identify with simple techniques like regular expressions. For example, not all eight-number strings are social security numbers (SSN), so regular expressions that query for eight-number strings to identify SSNs may produce false positives. More advanced classification engines use context analysis, validation functions and ML/AI models to validate accuracy. This should be achieved with low resource consumption, high performance, and without compromising on accuracy.

2. Context

Simply labeling data by its type isn’t enough to derive appropriate policies. This is because some data types require different policies based on the business context. An email address, for example, requires different policies depending on who it belongs to and how it’s used. It can be associated with an employee or a customer, belong to someone from the US or the EU, or have a generic domain name such as or a sensitive one such as

Organizations can determine the context surrounding a data point by identifying metadata (e.g., timestamps, format, location) and by enriching the data - for example, by comparing it against other sources such as CRM or ERP.

Enrichment can also provide context by associating two disparate data points to extract the true value and level of sensitivity. For example, a name and address are qualified as personally identifiable information and are subject to regulations such as GDPR. However, a name, address and credit card number are also subject to the Payment Card Industry Data Security Standard (PCI DSS).

DSPM tools can automate the data classification process to identify and enrich data points with business, privacy and security attributes such as location, how the data was generated, modifications, residency, retention period and applicable laws.

3. Subject

Some types/instances/flavors of sensitive data can’t be accurately identified by predefined data types. For example, a contract might not match a specific PII pattern but still be considered sensitive due to trade secrets or intellectual property.

Sensitive data may be created and stored in a variety of file formats. The file’s subject offers a great deal of information about the type of data it holds. For example, these can be contracts, resumes, hospital discharge forms, patents, IT architecture documents, and even database tables.

Defining policies according to file subjects is both intuitive and rich. For example, IT architecture documents are entirely reserved for senior IT staff, such as architects. These are also highly sensitive documents, and any leaks would pose major cybersecurity concerns.

One challenge in using file subjects to define security policies is the inconsistency of naming conventions. For example, job applications may have associated files that can take multiple forms, such as ‘FirstName-LastName-Resume’ or ‘FirstName-LastName-CV,’ or even just ‘FirstName-LastName.’ Mature DSPM solutions can accurately classify these types of data across inconsistent naming conventions.

4. Sensitivity

Standards organizations, such as the International Standards Organization (ISO) and the National Institute of Standards and Technology (NIST), advise against practices that treat all data equally: Organizations are mandated by regulation to classify data and label data sensitivity, based on the contents of the data. The risk related to a specific dataset or record is determined based on the sensitivity and level of exposure.

Classifying data can help organizations determine the sensitivity levels associated with their data assets. This would often be determined by the consequences of this data exposure.

  • Regulatory fines: A leak of customer data may result in a GDPR breach fine
  • Disruption to business operations: Failing to adhere to regulations such as the PCI standard can mean the withdrawal of the facility to take payment by credit and debit card
  • Reputational damage: Customers and partners losing trust in the organization following a breach
  • Commercial interests: Losing trade secrets or other classified documents

Additionally, sensitivity is determined by the breadth and depth of the affected data. For example, a shallow and narrow data point can include just a list of first and family names. While this is considered PII, the impact of having this data compromised is low, and as such, the sensitivity is also low. As the information gets richer, such as adding a billing address, card number, transactions and the location of the transaction, the impact and associated sensitivity become much higher.

5. Microsoft Information Protection (MIP) Labels

Microsoft Information Protection is a system applicable to the whole Microsoft estate (as well as non-Microsoft resources) that assigns sensitivity labels to documents such as emails, Word documents, and spreadsheets. These labels are customizable by each customer, but default to the following:

  • Non-business: User personal data
  • Public: Business data freely available and approved for public consumption
  • General: Business data for internal use and not meant for a public audience
  • Confidential: Business data that can cause harm if overshared
  • Highly confidential: Sensitive business reserved for certain persons

Each label has additional security measures, such as encryption read access controls, as well as restricted file sharing via email or uploaded to file servers or storage services. From the above, the default label assigned whenever a document is created is ‘general.’

Besides the default label assignment when a document is created, the MIP labels are static, meaning that any changes to the labels are often made manually or via limited automations, without adequate consideration of the content of the document. This is an issue when a collaborative document labeled as ‘general’ has confidential information added to it without a label change.

A mature DSPM solution can read and interpret the contents of an MIP-labeled document to alert the security teams of the mislabeled file and suggest an adequate sensitivity level.

