Skip to content

Functions

THe core functionality of carduus is provided by a the following functions that operate over pyspark DataFrame.

carduus.token.tokenize

tokenize(df, pii_transforms, tokens, key_provider=None)

Replaces all PII attributes with encrypted tokens.

All PII columns found in the DataFrame are normalized using the provided pii_transforms. All PII attributes provided by the enhancements of the pii_transforms are added if they are not already present in the DataFrame. The fields of each TokenSpec from tokens are hashed and encrypted together according to the OPPRL specification. Finally, the PII columns are dropped.

Parameters:

Name Type Description Default
df DataFrame

The pyspark DataFrame containing all PII attributes.

required
pii_transforms dict[str, PiiTransform | OpprlPii]

A dictionary that maps column names of df to PiiTransform objects to specify how each raw PII column is normalized and enhanced into derived PII attributes. Values can also be a member of the OpprlPii enum if using the standard OPPRL tokens.

required
tokens Iterable[TokenSpec | OpprlToken]

A collection of TokenSpec objects that denotes which PII attributes are encrypted into each token. Elements can also be a member of the OpprlToken enum if using the standard OPPRL tokens.

required
key_provider EncryptionKeyProvider | None

An optional EncryptionKeyProvider instance that serves your private keys and the public keys of the parties you exchange data with. Default is an instance of SparkConfKeyProvider which looks for encryption keys loaded as spark configuration properties.

None

Returns:

Type Description
DataFrame

The DataFrame with PII columns replaced by encrypted tokens.

carduus.token.transcrypt_out

transcrypt_out(
    df, token_columns, recipient, key_provider=None
)

Prepares a DataFrame containing encrypted tokens to be sent to a specific trusted party by re-encrypting the tokens using the recipient's public key without exposing the original PII.

Output tokens will be unmatchable to any dataset or within the given dataset until the intended recipient processes the data with transcrypt_in.

Parameters:

Name Type Description Default
df DataFrame

Spark DataFrame with token columns to transcrypt.

required
token_columns Iterable[str]

The collection of column names that correspond to tokens.

required
recipient str

The name of the recipient that will be receiving transcrypted data. Used to lookup the appropriate public keys for asymmetric encryption.

required
key_provider EncryptionKeyProvider | None

An optional EncryptionKeyProvider instance that serves your private keys and the public keys of the parties you exchange data with. Default is an instance of SparkConfKeyProvider which looks for encryption keys loaded as spark configuration properties.

None

Returns:

Type Description
DataFrame

The DataFrame with the original encrypted tokens re-encrypted for sending to the recipient.

carduus.token.transcrypt_in

transcrypt_in(df, token_columns, key_provider=None)

Used by the recipient of a DataFrame containing tokens in the intermediate representation produced by transcrypt_out to re-encrypt the tokens such that they will match with other datasets

Parameters:

Name Type Description Default
df DataFrame

Spark DataFrame with token columns to transcrypt.

required
token_columns Iterable[str]

The collection of column names that correspond to tokens.

required
key_provider EncryptionKeyProvider | None

An optional EncryptionKeyProvider instance that serves your private keys and the public keys of the parties you exchange data with. Default is an instance of SparkConfKeyProvider which looks for encryption keys loaded as spark configuration properties.

None

Returns:

Type Description
DataFrame

The DataFrame with the original encrypted tokens re-encrypted for sending to the destination.

The TokenSpec class allows for custom tokens, beyond the builtin OPPRL tokens, to be generated during tokenization. See the custom tokens guide for more information.

carduus.token.TokenSpec dataclass

An collection of PII fields that will be encrypted together to create a token.

For an enum of standard TokenSpec instances that comply with the Open Privacy Preserving Record Linkage protocol see OpprlToken.

Attributes:

Name Type Description
name str

The name of the column that holds these tokens.

fields Iterable[str]

The PII fields to encrypt together to create token values.

OPPRL Implementation

Although carduus is designed to be extensible, most users will want to use the tokenization procedure proposed by the Open Privacy Preserving Record Linkage (OPPRL) protocol. This open specification proposes standard ways of normalizing, enhancing, and encrypting data such that all user across all OPPRL implementations, including carduus, can share data between trusted parties.

The following two enum objects provide PiiTransform instances and TokenSpec instances that comply with OPPRL. These can be passed to the column mapping and token set arguments of tokenize respectively.

carduus.token.OpprlPii

Bases: Enum

Enum of PiiTransform objects for the PII fields supported by the Open Privacy Preserving Record Linkage specification.

Attributes:

Name Type Description
first_name NameTransform

PiiTransform implementation for a subject's first name according to the OPPRL standard.

middle_name NameTransform

PiiTransform implementation for a subject's middle name according to the OPPRL standard.

last_name NameTransform

PiiTransform implementation for a subject's last (aka family) name according to the OPPRL standard.

gender GenderTransform

PiiTransform implementation for a subject's gender according to the OPPRL standard.

birth_date DateTransform

PiiTransform implementation for a subject's date of birth according to the OPPRL standard.

carduus.token.OpprlToken

Bases: Enum

Enum of TokenSpec objects that meet the Open Privacy Preserving Record Linkage tokenization specification.

Attributes:

Name Type Description
opprl_token_1

Standard OPPRL token #1. Creates tokens based on first_initial, last_name, gender, and birth_date.

opprl_token_2

Standard OPPRL token #2. Creates tokens based on first_soundex, last_soundex, gender, and birth_date.

Encryption Key Management

See the encryption key section of the "Getting started" guide for details about how Carduus accesses encryption keys.

carduus.keys.SparkConfKeyProvider

carduus.keys.SimpleKeyProvider

carduus.keys.generate_pem_keys

generate_pem_keys(key_size=2048)

Generates a fresh RSA key pair.

Parameters:

Name Type Description Default
key_size int

The size (in bits) of the key.

2048

Returns:

Type Description
tuple[bytes, bytes]

A tuple containing the private key and public key bytes. Both in the PEM encoding.

Interfaces

Carduus offers interfaces that can be extended by the user to add additional behaviors to the tokenization and transcryption processes.

The PiiTransform abstract base class can be extended to add support for custom PII attributes, normalizations, and enhancements. See the custom PII guide for more details.

carduus.token.PiiTransform

Bases: ABC

Abstract base class for normalization and enhancement of a specific PII attribute.

Intended to be extended by users to add support for building tokens from a custom PII attribute.

normalize(column, dtype)

A normalized representation of the PII column.

A normalized value has eliminated all representation or encoding differences so all instances of the same logical values have identical physical values. For example, text attributes will often be normalized by filtering to alpha-numeric characters and whitespace, standardizing to whitespace to the space character, and converting all alpha characters to uppercase to ensure that all ways of representing the same phrase normalize to the exact same string.

Parameters:

Name Type Description Default
column Column

The spark Column expression for the PII attribute being normalized.

required
dtype DataType

The spark DataType object of the column object found on the DataFrame being normalized. Can be used to delegate to different normalization logic based on different schemas of input data. For example, a subject's birth date may be a DateType, StringType, or LongType on input data and thus requires corresponding normalization into a DateType.

required

Returns:

Type Description
Column

The normalized version of the PII attribute.

enhancements(column)

A collection of PII attributes that can be automatically derived from a given normalized PII attribute

If an implementation of PiiTransform does not override this method, it is assumed that no enhancements can be derived

Parameters:

Name Type Description Default
column Column

The normalized PII column to produce enhancements from.

required
Return

A dict with keys that correspond to the PII attributes of the ___ and values that correspond to the Column expression that produced the new PII from a normalized input attribute.

The EncryptionKeyProvider abstract base class can be extended to delegate the retrieval of encryption keys to the preferred secret management service for your organization.

carduus.keys.EncryptionKeyProvider

Bases: ABC

Abstract base class for serving encryption keys to carduus. Can be implemented to call out to whichever service you use to manage encryption keys.

private_key() abstractmethod

Provides your private key.

public_key_of(recipient) abstractmethod

Provides the public key of a specific recipient that you share data with.