The Person Data Standardization tool takes a table with personally identifiable information (PII; e.g., names, addresses, and dates of birth), and standardizes the PII so that you have higher quality PII data for tasks such as CASS/NCOA, person matching, and identity resolution.
The Person Data Standardization tool can be found on the top navigation bar under the Tools > Enhancements > More Enhancements... tab.
Expected Input and Output
Your input PII table is either a Redshift table on Civis Platform or a Civis file.
If your input table is a Redshift table, then the output will be a new Redshift table. Your input table is not altered in any way.
If your input table is a Civis file, it has to be parsable as a CSV file that has a header with column names and uses the comma as the delimiter. The output table will be a new Civis file (in the CSV format, with the comma as the delimiter and all cells quoted by the double quote). Your input table is not altered in any way.
The output table contains only the fields specified in your field mapping (more on this below), plus the additional, derived fields from those you have mapped.
The column names of the output table are the recognized field names listed below.
If your input table has extra columns irrelevant for person data standardization, these columns are simply ignored, and are not found in the output table.
Field Mapping
The Person Data Standardization tool recognizes a set of PII field names. You will have to input the mapping from this set of recognized field names to the column names of your input table.
Recognized Fields
Here are the full set of the recognized PII field names:
Field | Notes |
primary_key
|
A unique identifier. This field is required |
full_name |
Composed of either first name + last name, or first name + middle name + last name |
first_name |
|
middle_name |
|
last_name |
|
full_address |
For a column that appears to combine house number, street, and apartment/suite number |
address_1 |
address line 1 |
address_2 |
address line 2 |
house_number |
|
street |
|
unit |
apartment/suite/unit number |
city |
|
zip |
Either the 5-digit or 9-digit form |
zip4 |
The zip+4 code. You may have it from the output of CASS/NCOA, for example |
state |
Full state name, e.g., Illinois |
state_code |
2-letter state code, e.g., IL |
birth_date |
Any common US date formats such as "19601230", "1960-12-30", "12-30-1960", "12/30/1960", "December 30, 1960", etc. |
birth_year |
|
birth_month |
Numerical forms such as "9", "09" or full/abbreviated forms such as "September", "Sept", "Sep" are all acceptable |
birth_day |
|
phone |
|
gender |
|
email |
|
lat |
|
lon |
Conflicting Fields
Certain fields may potentially provide conflicting information, and therefore these combinations in the field mapping are disallowed and would fail the job immediately.
If you use this field... | You cannot use.... |
full_name |
first_name middle_name, and last_name |
full_address |
address_1 or address_2 |
full_address |
house_number, street, and unit |
address_1 or address_2 |
house_number, street, and unit |
state |
state_code |
birth_date |
birth_year, birth_month, and birth_day |
Inputting The Field Mapping
In the "FIELD MAPPING" box, input your field mapping in YAML format, where the keys are the recognized PII field names from the list above, and the values are the column names of your table. For instance:
primary_key: your_primary_key_column
first_name: your_first_name_column
last_name: your_last_name_column
You do not have to map all of the recognized fields. Simply map those you would like to use based on the columns of your input table.
What standardization steps are taken?
Basic Standardization
For all fields (except primary_key
-- it is untouched):
- Strip leading and trailing whitespace.
- Force everything into uppercase.
- Convert all non-ASCII characters (e.g., characters with accents) into their closest ASCII counterpart (essentially characters available in the US keyboard layout).
Derived Fields
Certain fields are parsed to create derived fields. For instance, if you use birth_date
, your output table has birth_date
as well as its derived fields birth_year
, birth_month
, and birth_day
. Here is the complete list of derived fields:
Field(s) | Derived Field(s) |
full_name |
first_name , middle_name , and last_name |
first_name and last_name |
full_name and middle_name |
state |
state_code |
birth_date |
birth_year , birth_month , and birth_day |
birth_year , birth_month , and birth_day |
birth_date |
zip |
zip4 |
full_address , house_number , street , unit , address_1 , and address_2. For these address-related fields, any of them in your input field mapping will derive all of the other address-related |
Field-specific Standardization
zip
: All non-digit characters are removed for a 5-digit form; zeros are left-padded if there are fewer than 5 digits. If a 9-digit form is detected, thenzip
only takes the first 5 digits, andzip4
will take the last 4 digits (ifzip4
is not already available).phone
: All non-digit characters are removed for a 10-digit form. If a 11-digit form with an initial "1" (= country code for the US and Canada) is detected, then the initial "1" is removed, resulting in a 10-digit form.gender
: For how gender is encoded in the sources of the Civis data assets, "F" or "Female" (case-insensitive) is converted to "F", and "M" or "Male" (case-insensitive) to "M", and all other non-empty strings are converted to "U".email
: If a Gmail address is detected, all periods "." in the local-part (in "local-part@domain") are removed. A plus sign followed by an alphabetic string in the local-part is also removed (e.g., "foo+bar@gmail.com" becomes "foo@gmail.com").- All fields except
primary_key
,email
,unit
,address_2
,lat
, andlon
: Leading and trailing punctuation characters (i.e.,!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~
) are stripped.
Data Quality Statistics
When your job finishes, data quality statistics (together with your job's metadata) are shown in the run log pane. These statistics summarize the standardization operations and flags by field.
FAQs
-
Q: How should the output table of the Person Data Standardization job be used as the input to a matching job (e.g., Civis Data Match)?
A: The Person Data Standardization job derives extra fields based on what fields are available in your input table (see the section "Derived Fields" above). Some of these derives fields (e.g.,
full_name
,address_1
, andaddress_2
) are needed for other person data enhancement jobs (e.g., CASS/NCOA). For matching, the general principle is to use the most granular fields available (e.g.,birth_year
,birth_month
, andbirth_day
) from the output table of the Person Data Standardization job, instead of a compound field (e.g.,birth_date)
. To be concrete, here are the fields that should be used in matching:Use This Not This first_name, middle_name, last_name
full_name
house_number, street, unit
full_address, address_1, address_2
state_code
state
zip
zip4
birth_year, birth_month, birth_day
birth_date
-
Q: Does the Person Data Standardization job also perform CASS/NCOA and geocoding?
A: No, as far as address standardization goes, the Person Data Standardization job attempts to parse the addresses for the various components (especially
house_number
,street
, andunit
) to the extent possible, but does not run CASS/NCOA or geocoding.
Comments
0 comments
Please sign in to leave a comment.