Functions
THe core functionality of carduus is provided by a the following functions that operate over pyspark DataFrame
.
carduus.token.tokenize
tokenize(df, pii_transforms, tokens, private_key)
Adds encrypted token columns based on PII.
All PII columns found in the DataFrame
are normalized using the provided pii_transforms
.
All PII attributes provided by the enhancements of the pii_transforms
are added if they
are not already present in the DataFrame
. The fields of each TokenSpec
from tokens
are hashed and encrypted together according to the OPPRL specification.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The pyspark |
required |
pii_transforms
|
dict[str, PiiTransform | OpprlPii]
|
A dictionary that maps column names of |
required |
tokens
|
Sequence[TokenSpec | OpprlToken]
|
A collection of |
required |
private_key
|
bytes
|
Your private RSA key. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The |
carduus.token.transcrypt_out
transcrypt_out(
df, token_columns, recipient_public_key, private_key
)
Prepares a DataFrame
containing encrypted tokens to be sent to a specific trusted party by re-encrypting
the tokens using the recipient's public key without exposing the original PII.
Output tokens will be unmatchable to any dataset or within the given dataset until the intended recipient
processes the data with transcrypt_in
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Spark |
required |
token_columns
|
Iterable[str]
|
The collection of column names that correspond to tokens. |
required |
recipient_public_key
|
bytes
|
The public RSA key of the recipient who will be receiving the dataset with ephemeral tokens. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The |
carduus.token.transcrypt_in
transcrypt_in(df, token_columns, private_key)
Used by the recipient of a DataFrame
containing tokens in the intermediate representation produced by
transcrypt_out
to re-encrypt the tokens such that they will match with
other datasets
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Spark |
required |
token_columns
|
Iterable[str]
|
The collection of column names that correspond to tokens. |
required |
private_key
|
bytes
|
Your private RSA key. The ephemeral tokens must have been created with the corresponding public key by the sender. |
required |
Returns:
The DataFrame
with the original encrypted tokens re-encrypted for sending to the destination.
The TokenSpec
class allows for custom tokens, beyond the builtin OPPRL tokens, to be generated during tokenization. See the custom tokens guide for more information.
carduus.token.TokenSpec
dataclass
An collection of PII fields that will be encrypted together to create a token.
For an enum of standard TokenSpec
instances that comply with the Open Privacy Preserving Record Linkage protocol
see OpprlToken
.
Attributes:
Name | Type | Description |
---|---|---|
name |
str
|
The name of the column that holds these tokens. |
fields |
Iterable[str]
|
The PII fields to encrypt together to create token values. |
OPPRL Implementation
Although carduus is designed to be extensible, most users will want to use the tokenization procedure proposed by the Open Privacy Preserving Record Linkage (OPPRL) protocol. This open specification proposes standard ways of normalizing, enhancing, and encrypting data such that all user across all OPPRL implementations, including carduus, can share data between trusted parties.
The following two enum
objects provide PiiTransform
instances and TokenSpec
instances that comply with OPPRL. These can be passed to the column mapping and token set arguments of tokenize
respectively.
carduus.token.OpprlPii
Bases: Enum
Enum of PiiTransform objects for the PII fields supported by the Open Privacy Preserving Record Linkage specification.
Attributes:
Name | Type | Description |
---|---|---|
first_name |
NameTransform
|
PiiTransform implementation for a subject's first name according to the OPPRL standard. |
middle_name |
NameTransform
|
PiiTransform implementation for a subject's middle name according to the OPPRL standard. |
last_name |
NameTransform
|
PiiTransform implementation for a subject's last (aka family) name according to the OPPRL standard. |
gender |
GenderTransform
|
PiiTransform implementation for a subject's gender according to the OPPRL standard. |
birth_date |
DateTransform
|
PiiTransform implementation for a subject's date of birth according to the OPPRL standard. |
carduus.token.OpprlToken
Bases: Enum
Enum of TokenSpec
objects that meet the Open Privacy Preserving
Record Linkage tokenization specification.
Attributes:
Name | Type | Description |
---|---|---|
opprl_token_1 |
Token #1 from the OPPRL specification. Creates tokens based on |
|
opprl_token_2 |
Token #2 from the OPPRL specification. Creates tokens based on |
|
orrpl_token_3 |
Token #3 from the OPPRL specification. Creates tokens based on the |
Encryption Key Management
See the encryption key section of the "Getting started" guide for details about how Carduus accesses encryption keys.
carduus.keys.generate_pem_keys
generate_pem_keys(key_size=2048)
Generates a fresh RSA key pair.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key_size
|
int
|
The size (in bits) of the key. |
2048
|
Returns:
Type | Description |
---|---|
tuple[bytes, bytes]
|
A tuple containing the private key and public key bytes. Both in the PEM encoding. |
Interfaces
Carduus offers interfaces that can be extended by the user to add additional behaviors to the tokenization and transcryption processes.
The PiiTransform
abstract base class can be extended to add support for custom PII attributes, normalizations, and enhancements. See the custom PII guide for more details.
carduus.token.PiiTransform
Bases: ABC
Abstract base class for normalization and enhancement of a specific PII attribute.
Intended to be extended by users to add support for building tokens from a custom PII attribute.
normalize(column, dtype)
A normalized representation of the PII column.
A normalized value has eliminated all representation or encoding differences so all instances of the same logical values have identical physical values. For example, text attributes will often be normalized by filtering to alpha-numeric characters and whitespace, standardizing to whitespace to the space character, and converting all alpha characters to uppercase to ensure that all ways of representing the same phrase normalize to the exact same string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
Column
|
The spark |
required |
dtype
|
DataType
|
The spark |
required |
Returns:
Type | Description |
---|---|
Column
|
The normalized version of the PII attribute. |
enhancements(column)
A collection of PII attributes that can be automatically derived from a given normalized PII attribute
If an implementation of PiiTransform does not override this method, it is assumed that no enhancements can be derived
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
Column
|
The normalized PII column to produce enhancements from. |
required |
Return
A dict
with keys that correspond to the PII attributes of the ___ and values that correspond to the Column
expression that produced the new PII from a normalized input attribute.