Getting Started

This guide provides a quick tour of the main features of Carduus, including instructions for how most users will be interacting with the library.

Carduus has not reached a v1.0 release yet and therefore the API and behaviors are subject to change. Use at your own risk! See the contributing guide if you would like to help the project.

Installation

Carduus is a cross-platform python library built on top of PySpark. It supports the following range of versions:

Python: 3.10+
PySpark: 3.5+

The latest stable release of carduus can be installed from PyPI into your active Python environment using pip.

pip install carduus

You can also build Carduus from source using Poetry. The source code is hosted on Github. Checkout the commit you wish to build and run poetry build in the project's root.

Encryption Keys

Carduus's tokenization capabilities require the use of private and public encryption keys. Carduus users are expected to manage their own encryption keys.

There are 3 kinds of encryption keys that play different roles:

Your private RSA key - Used to transcrypt incoming data and derive a symmetric encryption key used to tokenize PII. This key must never be shared or accessed by untrusted parties.
Your public RSA key - The public key counterpart to your private key. Will be shared with trusted parties that will be sending you tokenized data.
Trusted partner public keys - A collection of public keys from the various trusted parties that you will be sending tokenized to.

Generating New Keys

Carduus expects encryption keys to be represented the PEM encoding. Private keys should use the PKCS #8 format and public keys should be formatted as SubjectPublicKeyInfo. Carduus recommends a key size of 2048 bits.

You can generate these keys using tools like openssl or by calling the generate_pem_keys function provided by Carduus. This function will return a tuple containing 2 instances of bytes. The first is the PEM data for your private key that you must keep secret. The second is the PEM data for your public key that can may share with the parties you intend to receive data from.

You can decode these keys into strings of text (using UTF-8) or write them into a .pem file for later use.

from carduus.keys import generate_pem_keys

private, public = generate_pem_keys()  # Or provide key_size= 

print(private.decode())
# -----BEGIN PRIVATE KEY-----
# ... [ Base64-encoded private key ] ...
# -----END PRIVATE KEY-----

print(public.decode())
# -----BEGIN PUBLIC KEY-----
# ... [ Base64-encoded public key ] ...
# -----END PUBLIC KEY-----

Tokenization

Tokenization refers to the process of replacing PII with encrypted tokens using a one-way cryptographic function. Carduus implements the OPPRL tokenization protocol, which performs a standard set of PII normalizations and enhancements such that records pertaining to the same subject are more likely to receive the same token despite minor differences in PII representation across records. The OPPRL protocol hashes the normalized PII and then encrypts the hashes with a symmetric encryption based on a secret key derived from your private RSA key. In the event that the private encryption key of one party is compromised, there risk to all other parties is mitigated by the fact that everyone's tokens are encrypted with a different key.

To demonstrate the tokenization process, we will use a dataset of 5 records shown below. Each record is assigned a label that corresponds to the true identity of the subject.

pii = spark.createDataFrame(
    [
        (1, "Jonas", "Salk", "male", "1914-10-28"),
        (1, "jonas", "salk", "M", "1914-10-28"),
        (2, "Elizabeth", "Blackwell", "F", "1821-02-03"),
        (3, "Edward", "Jenner", "m", "1749-05-17"),
        (4, "Alexander", "Fleming", "M", "1881-08-06"),
    ],
    ("label", "first_name", "last_name", "gender", "birth_date")
)
pii.show()
# +-----+----------+---------+------+----------+
# |label|first_name|last_name|gender|birth_date|
# +-----+----------+---------+------+----------+
# |    1|     Jonas|     Salk|  male|1914-10-28|
# |    1|     jonas|     salk|     M|1914-10-28|
# |    2| Elizabeth|Blackwell|     F|1821-02-03|
# |    3|    Edward|   Jenner|     m|1749-05-17|
# |    4| Alexander|  Fleming|     M|1881-08-06|
# +-----+----------+---------+------+----------+

To perform tokenization, call the tokenize function and pass it the PII DataFrame, a column mapping, and a collection of specifications for token you want to generate. You may optionally pass an instance of EncryptionKeyProvider, but all code snippets shown in this guide assume the default key service is used (unless otherwise specified) and therefore encryption keys will be read from the spark session properties.

from carduus.token import tokenize, OpprlPii, OpprlToken
tokens = tokenize(
    pii,
    pii_transforms=dict(
        first_name=OpprlPii.first_name,
        last_name=OpprlPii.last_name,
        gender=OpprlPii.gender,
        birth_date=OpprlPii.birth_date,
    ),
    tokens=[OpprlToken.token1, OpprlToken.token2, OpprlToken.token3],
    private_key=b"""-----BEGIN PRIVATE KEY----- ...""",
).drop("first_name", "last_name", "gender", "birth_date")
# +-----+--------------------+--------------------+--------------------+
# |label|       opprl_token_1|       opprl_token_2|       opprl_token_3|
# +-----+--------------------+--------------------+--------------------+
# |    1|4YO6eFn0u75yrF+Td...|V6uRRgDgXylFsNM2c...|6N+/voOASNM0ivgA7...|
# |    1|4YO6eFn0u75yrF+Td...|V6uRRgDgXylFsNM2c...|6N+/voOASNM0ivgA7...|
# |    2|OLU6TLB4XdWmfhvIS...|3QtCKgeIqO20Jgp9E...|JrrjTxjy97Xe93afx...|
# |    3|mks197t0d8Uhc3l3s...|YccsE4wMErZvz9oBd...|7wMcWiDjImAhuP307...|
# |    4|kRZfUY8KScpiHKmxC...|1gKSJDcq6YtwdUc7C...|3h8jUydHTIcOU8jhW...|
# +-----+--------------------+--------------------+--------------------+

Notice that both records with label = 1 received the same pair of tokens despite slight representation differences in the original PII.

The pii_transforms argument is a dictionary that maps column names from the pii DataFrame to the corresponding OPPRL PII field. This tells Carduus how to normalize and enhance the values found in that column. For example, the OpprlPii.first_name object will apply name cleaning rules to the values found in the first_name column and automatically derive additional PII columns called first_initial and first_soundex which are used to create both OPPRL tokens.

The tokens argument is collection of OPPRL token specifications that tell Carduus which PII fields to jointly hash and encrypt to create each token. The current OPPRL protocol supports three token specifications, described below:

Token	Fields to jointly encrypt
`OpprlToken.token1`	`first_initial`, `last_name`, `gender`, `birth_date`
`OpprlToken.token2`	`first_soundex`, `last_soundex`, `gender`, `birth_date`
`OpprlToken.token3`	`first_metaphone`, `last_metaphone`, `gender`, `birth_date`

Why multiple tokens?

Each use case has a different tolerance for false positive and false negative matches. By producing multiple tokens for each record using PII attributes, each user can customize their match logic to trade-off between different kinds of match errors. Linking records that match on any token will result in fewer false negatives, and linking records that match all tokens will result in fewer false positives. User can design their own match strategies by using subsets of tokens.

Transcryption

noun: trans- (latin: across) -crypt (greek: hidden, secret)

The process of transforming a ciphertext produced by one encryption into a ciphertext of a different encryption without emitting the original plaintext.

Transcryption is performed when a user wants to share tokenized data with another party. The sender and recipient each have a corresponding transcryption function that must be invoked to ensure safe transfer of data between trusted parties. The sender performs transcryption to replace their tokens with "ephemeral tokens" that are specific to the transaction. In other words, the ephemeral tokens do not match the sender's tokens, the recipients data, or the ephemeral tokens from prior transactions between any sender and recipient.

Furthermore, records from the same dataset that have identical PII will be assigned unique ephemeral tokens. This destroys the utility of the tokens until the recipient performs the reverse transcryption process using their private key. This is beneficial in the event that a third party gains access to the dataset during transfer (eg. an if transcrypted datasets are delivered over an insecure connection) because records pertaining to the same subject cannot be associated with each other.

Sender

Carduus provides the transcrypt_out function in for the sender to call on their tokenized datasets. In the following code snippet, notice the 2 records pertaining to the same subject (label = 1) no longer have identical tokens. The tokens_to_send DataFrame can safely be written files or a database and delivered to the recipient.

tokens_to_send = transcrypt_out(
    tokens, 
    token_columns=("opprl_token_1", "opprl_token_2", "opprl_token_3"), 
    recipient_public_key=b"""-----BEGIN PUBLIC KEY----- ...""",
    # This is the private key of the sender
    # It is NOT the private key associated with the recipient_public_key.
    private_key=b"""-----BEGIN PRIVATE KEY----- ...""",
)
tokens.to_send.show()
# +-----+--------------------+--------------------+--------------------+
# |label|       opprl_token_1|       opprl_token_2|       opprl_token_3|
# +-----+--------------------+--------------------+--------------------+
# |    1|EUqTdA0Yjb5F82oqj...|iiI/sd+sn+t5qhPDg...|JwpZVzJe+3CMocLGK...|
# |    1|AXVDGgQ0KMa1ek1/v...|YJeYE9pz507CUzlks...|EKY2JdLhduCq+zj+p...|
# |    2|XmYqnnMDqD9ZUR6yS...|A4e3L03NWeKbbL8vR...|CAZtbKDYjbbHsqVdc...|
# |    3|mVhnJ1kz+ZRZvlfKo...|uO13H3LHjVnZ1flUp...|C16RKoV4SvWD5wuVp...|
# |    4|oI/KS52K4H3M+WBdn...|vczsK9qlbU6TN1vBi...|fpTSrlGgtz3vASFXc...|
# +-----+--------------------+--------------------+--------------------+

The token_columns argument is a iterable collection containing the column names of the tokens DataFrame that correspond to tokens that need to be transcrypted.

The recipient argument denotes the name given to the public key associated with the intended recipient. If using the default encryption key provider (via Spark session properties) the above code snippet will grab the public key from carduus.token.publicKey.AcmeCorp spark session property.

Recipient

The transcrypt_in function provides the transcryption process for the recipient. It is called on a dataset produced by the sender using transcrypt_out to convert ephemeral tokens into normal tokens that will match other tokenized datasets maintained by the recipient, including prior datasets delivered from the same sender.

Notice that the first 2 records pertaining to the same subject (label = 1) have identical tokens again, but do these tokens are not the same as the original tokens because they are encrypted with the scheme for the recipient.

from carduus.token import transcrypt_in

tokens_received = transcrypt_in(
    tokens_to_send, 
    token_columns=("opprl_token_1", "opprl_token_2", "opprl_token_3"),
    # This is the private key corresponding to the public key used to the prepare the data.
    # It is NOT the private key used to tokenize the PII.
    private_key=b"""-----BEGIN PRIVATE KEY----- ...""",
)
tokens_received.show()
# +-----+--------------------+--------------------+--------------------+
# |label|       opprl_token_1|       opprl_token_2|       opprl_token_3|
# +-----+--------------------+--------------------+--------------------+
# |    1|q1a5o2gg1hX+CFjdP...|gS2WMqSs0SReCN7Xp...|2XAgeBwomiKq2s4xO...|
# |    1|q1a5o2gg1hX+CFjdP...|gS2WMqSs0SReCN7Xp...|2XAgeBwomiKq2s4xO...|
# |    2|FEyzz52oI5hym0G/+...|UQDqblSWpqoH5znps...|ud2cG5YD/2yBhjZel...|
# |    3|aEgXJZGUpkZxt7Z7T...|QZ5LaSiE8pMDmMKEu...|0yQxfv4EFAUHatIva...|
# |    4|CgmXSDhY8mxZCRV+O...|wu+eNOqsOvh14Bjei...|7aGCrwy8tTDL3RkO6...|
# +-----+--------------------+--------------------+--------------------+

As with transcrypt_out, the token_columns argument is a iterable collection containing the column names of the tokens DataFrame that correspond to tokens that need to be transcrypted.

Deployment

Carduus is a Python library that uses PySpark to parallelize and distribute the tokenization and transcryption workloads. Your application that uses Carduus can be submitted to any compatible Spark cluster, or use a connection to a remote spark cluster. Otherwise, Carduus will start a local Spark process that uses the resources of the host machine.

For more information about different modes of deployment, see the official Spark documentation.

Next Steps

Reference the full API
Learn more about using Carduus and extending it's functionality from the advanced user guides: