Carduus Databricks Demo¶
This notebook demonstrates the use of the carduus
python package from within the Databricks environment to parallelize the tokenization processes using Spark clusters.
We use a small 5 row sample dataset of PII, however carduus
is capable to distributing the workload of tokenizing and transcrypting large datasets across the cluster.
pii = spark.createDataFrame(
[
(1, "Jonas", "Salk", "male", "1914-10-28"),
(1, "jonas", "salk", "M", "1914-10-28"),
(2, "Elizabeth", "Blackwell", "F", "1821-02-03"),
(3, "Edward", "Jenner", "m", "1749-05-17"),
(4, "Alexander", "Fleming", "M", "1881-08-06"),
],
("label", "first_name", "last_name", "gender", "birth_date")
)
display(pii)
label | first_name | last_name | gender | birth_date |
---|---|---|---|---|
1 | Jonas | Salk | male | 1914-10-28 |
1 | jonas | salk | M | 1914-10-28 |
2 | Elizabeth | Blackwell | F | 1821-02-03 |
3 | Edward | Jenner | m | 1749-05-17 |
4 | Alexander | Fleming | M | 1881-08-06 |
One-time Workspace Setup¶
See the Databricks documentation for managing secrets to get started. You will need an to install the Databricks CLI and authenticate as a user that can manage Databricks secrets.
Create a secret scope for carduus
encryption keys by running the following command with the Databricks CLI.
databricks secrets create-scope carduus
Add your private key as a secret in the carduus
scope.
(cat << EOF
-----BEGIN PRIVATE KEY-----
...
-----END PRIVATE KEY-----
EOF
) | databricks secrets put-secret carduus PrivateKey
Add the secrets for the public keys of the third party organizations you will be sending and receiving tokenized data to/from.
(cat << EOF
-----BEGIN PUBLIC KEY-----
...
-----END PUBLIC KEY-----
EOF
) | databricks secrets put-secret carduus AcmeCorpPublicKey
Cluster Setup¶
Add spark configuration properties to your custer by reading from the secrets created above. Your organization's private key must be stored under the carduus.token.privateKey
key and all third party public keys must be stored under a key with the format carduus.token.publicKey.<recipient>
where <recipient>
is the name you would like to associate with the public key of a specific recipient that you will send data to.
carduus.token.privateKey {{secrets/carduus/PrivateKey}}
carduus.token.publicKey.AcmeCorp {{secrets/carduus/AcmeCorpPublicKey}}
Notebook setup¶
Ensure the carduus
package is installed in your notebook's python environment. The package can also be installed across the entire cluster.
%pip install carduus
Tokenization¶
For more information on tokenization see the Carduus getting started guide.
from carduus.token import tokenize, OpprlPii, OpprlToken
tokens = tokenize(
pii,
pii_transforms=dict(
first_name=OpprlPii.first_name,
last_name=OpprlPii.last_name,
gender=OpprlPii.gender,
birth_date=OpprlPii.birth_date,
),
tokens=[OpprlToken.token1, OpprlToken.token2]
)
display(tokens)
label | opprl_token_1 | opprl_token_2 |
---|---|---|
1 | d2tUj3yRFPIBSwR/ntUi8v1B/A9H+Q0iNwlz0+OVpO54MER9bRnTgHxOO8Q4IM+gdoxKGJGV9STb6DH2hxKIb50v6IFa+StqSnKRy7GJG4U= | DQhKG+AMgrFh16dLiogYy/U9YDLsATVvhMtQ4cHkWmyr8+JmuABYYQV3OJCNLYupaevu9qapeWo80zq+VVumrKwuUXqUCjQzs7ui5qceUVc= |
1 | d2tUj3yRFPIBSwR/ntUi8v1B/A9H+Q0iNwlz0+OVpO54MER9bRnTgHxOO8Q4IM+gdoxKGJGV9STb6DH2hxKIb50v6IFa+StqSnKRy7GJG4U= | DQhKG+AMgrFh16dLiogYy/U9YDLsATVvhMtQ4cHkWmyr8+JmuABYYQV3OJCNLYupaevu9qapeWo80zq+VVumrKwuUXqUCjQzs7ui5qceUVc= |
2 | 2wnRWhN9Y4DBMeuvwbriLCmz5xVyCNJO1liVdO/bu/nwtTkudQCfmHazpSYftTGVDVfIz8z0gy5eAAap9qKY7D6LmCM4Df6wRttrlh5XBQw= | A7fUKAZi/Ra2T6p8y4nsW0DRuI0P4KGatyuj7XyI4LIU+6lRCYukLDxPepOB52xPLDU2J2B9n8YGO9x0dmK+t4ZEAjJNsBnZrKy3hPvN5PY= |
3 | I17X+CT3kjqB9l0rAOyiGVYKTzFrSLp4KSHfdUR89rYTyC/d24QcuYZ2VHWVzPawTFNqaIr7oMjgoMCX8gruUSgU42YLoq64KHdRiqTBuQ8= | qGCUVyI7MJLkQ9SSrz0uNIFQXDJBcCNdMVneZ/fLdAfn39ZNZgD+6Vm4l20pM/0zKeLRYAHCpJh/AH7UB21ssqzfbSdOkjrGDL51lYZ9fK8= |
4 | 6Ee3NabHaa/lyKsrpou/DqufDgEOVyq3Hb9RKg7GpkoAY3ZlPn9k1iMj6NhJuo4t35i5aixmOr8hXYnBvql0DVQqJwujrk1rdOnNYwhulvs= | dbwNqy0rN6hFoHJYnWo9G9REKGLLbJKvvnWYS9uR1gj3x8z6oQENN6K0DqU9INgldMikxac7EG5ZR9wicpVSMDBAy2HbqFcBL4ZHoNWWDoM= |
Sender Transcryption¶
For more information on transcryption see the Carduus getting started guide.
from carduus.token import transcrypt_out
tokens_to_send = transcrypt_out(
tokens,
token_columns=("opprl_token_1", "opprl_token_2"),
destination="AcmeCorp",
)
display(tokens_to_send)
label | opprl_token_1 | opprl_token_2 |
---|---|---|
1 | d8JaP02dS0M+COGihZCp+Kr2piXyNU5yvPhFR1/ajN7m+gr63NsQee8v7WMKnEJkY6/BDOjREQ8sf1iD0xWDsvCM8rCqlG8uFlVKQ4FY2atSGuqHUzHcjGf8eNEsbpM2AZnLliFDssvljVgBy9CpYUti20c8L/TgrVMKd1rSgJaDXjXRHS5abWD6EaD77IV3KpELoii7XhE4xqLV3PIH+5TaF6Yaa7FbLaYtG0+wIQ+huC9H4QLHgGam8JE71k/aznve90koz8/hO0vM5/RpLd+dTQ4MMRCPxt/SBVtJp3fuvPtbVdR+rf5KPuCkS3IvUtOGgUlYKVMovNYA2uUDNg== | hyh5cNCkhptFNbbndY/l43GBuTBcDbAdNQWQdznelQVsugwQ9Fip7rC1N0Adj2oBxe+0qL86Mv079BKoDRpHCRIw6aqvHbTJHBV+CuDRmmLUcbFUVTKI3AwGEom2i076wulOX0Wjvs8fXe4vCTX5AZXWwpPCMJUBaMu8zLqgIPtiJhD0UMUsoBwTeFEzxR6sIVBU7E3pFHToYCkT7ISZ55rY+CcsIAI8I5TS0qtYAgDuCWWhp8ZJ2NNRWcWd9qHPMqXWkQ//k75vS0ng018Z4hH0/EVGVVf9iOKGb6ibgTrfmxM0lWDpZqBfjZZ/uExeZsI3EVcQUmzFccWpyB1nYw== |
1 | Pj3ZIaIj9/i5N8ZnBpe23u6WfeEXR3nka8J+Y+5pawvmgnAnRFzhTekmcx350Viyp4FHsKh4ogO2J6714oSLm4zaELZhcaPAI0slee/7LnaYZNAArXtg4bxqNdz/vifIvcWj8S+TyrEXsxmzIpgS5nKYG9SRbQ0IO5/Ungxx+GouhGeGC54nlL7dt1dHlgUmMkQx+UH89t+D/C55F/lDmRkhddjiSO1LFIZYyo3xT0F/j5o7b7cI0OoZpvN6OYdrE9IfsfqYFh/RoHhMjfpajYDSUcy6Cq4evcjhEy5XvGqNJpQTToC/A1rmTB5WtFO1oCyVF0HGb+FgHwO9ccVIjw== | N3M6t4ZW5/DnzfhYIcUJdCHAorl7lWvrBnZGtInrno3z62V6n1odgQmY6fgXHy5sTTHdujeIn/dRi6PD9rBANjTpi42puSGHf7VLIx9uqr+kSiXGTCBnAHy8C3DNfpxtx4I391/8aM2mbnS/5icuyIKgzZP+fWsFxehfMSLDgW3YMQsgZy0JBSpetk+8Jl1L06pQYHt967zcAvi0JRiUmZHshOGAdkVO1hQGgUDOsHuBrElqkooFfuIw0QC4v1aeqQDtVH3Un/9cJMonmsRtnuRYUZE7aW1Zl/RGYKq1LhegIzL/+JlYuqOHZoSPyB/0O6t7u/USRkkZCUKM+V7ryg== |
2 | bLkqSZYRsRZ8/7EQsxE/DU2Rf5n2qxjNiJgNUuJHfTcMLVC3C0tPlS+AztVVwWBj2G+GaA8QoeM7jqfxzou/xK13ggOOIiXWAapqOMT96Z5C3y+GG8och8g9w52Mr/3zKINIe5VLaxoMetH67J5d85gxFtixKTJiQq6GPUiJdm4ice/XACuG1bYtUKB3hfoRE3ZUsgOxQDlocSOcvrLvYK2SRgoRTqP3Upk+Cuemc0n8a0fLEcm3nXBrte2LKL2othx0Zost+ev8SL1LG22LJwjAKhRXDf6s7/DooEsoWaQM4mGImpZAp7CNxuLzL5BAdE2njkq8VTeLM6zkFSxw8Q== | N9soErrrifQW4t0ZzwlErWKlbXaSggSve+MT+L0PzZpS+2m/qg+sKSJHDTlM4WWyly/dy0fHgLjANWl4vLAgWl6BwZY51pDVPldBfOD2NfjRmWKhk5vvh0j2BPOEkRn7k0UisUoCqSWglMZMxo6gA5vhUS1s9Gg2e8hg410eCengabReUSEpZpcbyB4yPdPXniwScXLBQX6AjIHEK2lRxtjLI5qWaSwOKeOP+b12Hrxgu3HB9eZ2Yr2Rt1O9+R8z/A5Xvz4NQ2oTxxLAwJ2LBbWbY6CvS/8n5460oSBe28xAG2rl2I96RbzRzoTXni7GNDEjkh27298t5v/HaOD/WA== |
3 | lyaQetBbfzUuDoK+u6MLs18H6pH4iafWZNrwc8CKjmilA0mTuUHt9+joXXexkGhkKV37WGnpz4+0lzNOAVQ0c6/LMqRi2FImWh4a1xlk7BM4MDKUY9SNfa562yvxvm3Sast3NHfrYbs/9Nwrd5eXsHwI60DmfRkxbc+P3mUMoK7trvnOG9vHPedKqDh3uXGVf0oAK+d6WLOxN5r5PCldEaqFt/ErDsHUgcvy3F0b6W9+Xx/a9s2x2/ZvcLkfY5oF7+td7a8ghQ9xWzeC8hNt35FcbDe0v1vPkV32UiiO3atKIz1rr2Gag29xaAEonXucwtaIaCGGywFzbMV/CvLC/w== | iFz4K/xh2sKLl8dobmZeB1xwtgXySSGmtn8qMPtPk3bJ5rxFKwgM/HxI77td3FFVsPS5sAN1WSDOuVBMwSWkgzX5qGB2qt6FgOdFj4Eh3ihSb9YqKOzTDGD4AzUwTbhYpdF406kji1ERurefGompR+qxgtPWDx9hqBbqwtC+qyp35Wiqp0tMgCn1ZPUC8Bgqb6SC9BakYS/XF8MAnA4mI7ARFvw74VRXdmHmaPox/lFSkUai5mfO2cOZaU6cqlFsLy6AzYTGhJsSQDIooFmyXjnmKq+sNxsliR/p4sCZUCF5nIXQo8uHWFcl+PI0eDidQOop/MVrB/P38E+3dttu4g== |
4 | CQdw+tk0Z2/bUzRVsTHv4i0DRPNE7ji/WqaiX3BDNlYKGRim96C6IxuBwElbz0gnvzrc19pc68SBUHF6+aNobR2uLEpQpJiOPGtrfFTvr101kQL/WqKsczgNkU9UVNEcNOiB9yBBNbH+3oevbbQ5uOkblUgf6gNbcN1n3mfRr6b0npaET/EiBwjJMAfHCGZ8KBacqfUn9M8Hv4Ie/tCs5B80fDcmm+GLVKZMCVeUa4olTtl8gOSqGlNNhUbX0K0xlFNzrxQ2G6NiQOy8M0Z4my+adW9o37EbcT03ypzHIKbYvyuiEE73lkAZKR0Rf1uhh51Y0pvTi2ELMXgjDcBn9A== | It9sQjB4X15z/1xgmZLg1UrZfmeM7mT++Me5AYQM9weuH8WXs9lQ+AdBeuH2yRly3ETlCz08vEqm/eHl0/z24eLQeHJz0t/S/8Qz2CnO/KX9AKlLr/rj5XVXCFWKkoh/qDPAd28d0OfL5WcbIkVB7uF9UbjZp2SdWEDaPRzFMiBE9OJ2lu/5bZSMjAqRylwwLKDc5jSJNtHzquMvoG+X7b08fzzPe2Nc7ViP9UvqADFGu8vZFhkQ9UiBdglHS05VV0PIcUmpn40pr+evnphX6fOlWgMh/F3T85EVtAtU1AWnH3vqnyqhhMIArHM5W2dflKp657d9x/DytVfa4iiy+Q== |
Recipient Transcryption¶
You can also load secrets via dbutils
and provide a custom key provider to Carduus.
from carduus.keys import SimpleKeyProvider
keys = SimpleKeyProvider(
private_key=dbutils.secrets.getBytes("carduus", "PrivateKey"),
public_keys={},
)
from carduus.token import transcrypt_in
tokens2 = transcrypt_in(
tokens_to_send,
token_columns=("opprl_token_1", "opprl_token_2"),
key_service=keys,
)
display(tokens2)
label | opprl_token_1 | opprl_token_2 |
---|---|---|
1 | O47siK/9rItAv6lwaKgbH9MVuvZe0IwaBzEkQBRgHx9MWi02k8Gn/VvxoTsbA+4NB5DFMcAZeCS44azIKPV6M3tNySHQwSSDCQCPo+vGYPc= | uoPojmvjl3Mk734UlQb/bDHms7vHiptbYKbWCV1VWiE1svY/nAfp+Lm14b6Y/xJViuvqOrNIzpfItlOgU0UN//VCxg54FDLaZIZGfRhW1co= |
1 | O47siK/9rItAv6lwaKgbH9MVuvZe0IwaBzEkQBRgHx9MWi02k8Gn/VvxoTsbA+4NB5DFMcAZeCS44azIKPV6M3tNySHQwSSDCQCPo+vGYPc= | uoPojmvjl3Mk734UlQb/bDHms7vHiptbYKbWCV1VWiE1svY/nAfp+Lm14b6Y/xJViuvqOrNIzpfItlOgU0UN//VCxg54FDLaZIZGfRhW1co= |
2 | LfRQxBPEW0tEskwxtnooILPEi6/AtbfKCsWsPKMAS455xUeGfzxXdoTIdVnvKrpmT81aQA85dk7QqF+2K7Vjcg291ov2Skz+eMjF7n8DdZM= | P9PLhpSz+kWSSVmYQ8yCwAbQpohy/Ua8R1sSPytPpYb+CLN1KigfKedlIHZUWhf1tTq5gOjPeZp80rHtjJpsPbXAG0rWe9pjhP/Vjhdp140= |
3 | QWvpT0wMEezlurVFMa+yJIKJdAJQXuie8/VuqqKsSQDaGMDkV1s0PX9RTumY75oQdAwPGTFZvCoOGWn0jIIrdY93T7s0qK3O0raXgZCKtHg= | W9BxmxZf1/yqIDx3WCwqq0aPLznw7GATRx7ETrxLYfNBiuozxUBTs5Z3Ig9sErG7owJm16suzKHJ4ICYRtteoni8Ystu2ou7ODgnny/BxOs= |
4 | aS/BfI5qw8UnhOWUXyUNQMAaZRKGCC3/l/htFpHYhVecUv1EtRE0ht893KPkZbVpNsw7nORH9TlRbl8vU1WaNz2Kf04sw2iqcMVHx+VjD3s= | nWfWyxQcrSfeiMEHr98Z+mq7u3DMT1xdSJ0nrw8nTeKR0s/Gh78f+pDJNooeJUv4vfuuJFFqWEI2dkC5Q99ZBkdwiYnp/cj5Htif/Gk9lHY= |