Only use alphanumeric or these 3 special characters: period (.), dash (-), and underscore (_)
Ampersand (&) should be replaced by “and” if needed
Each header must be unique
e.g. can’t have two headers called "duration"
Units of measure should be omitted
Units can and must be provided with the data dictionary
Keep short (less than 30 characters)
A full description can and must be provided with the data dictionary
Unique identifiers should be in the left-most column if applicable
Date and time variables should be in the first column for time series data
Fixed or classified variables should be ordered with the highest-level variable on the left and most granular variable on the right, for example
Observed variables should always be on the rightmost columns, these are measured variables often numeric, the following are some example field names that could be observed variables:
Duration
Number of Units
Number of Stories
Year Built
People Served
UTF-8 encoding should be used
This ensures that special characters can be decoded by users
No line breaks within cells
This can break parsing in software like Excel, introducing data integrity issues
There are many ways to remove and detect line breaks, but this can vary based on how you're extracting data
Text should be presented in the easiest to interpret/read format where appropriate.
Title case
Address String
Categories when either the source system presents them this way or it is easy to interpret from the source consistently
Upper case
Acronyms - e.g - PSA (Park Service Area)
States - e.g. CA
Lower case
Categories when the source system presents them in caps and there's no way to interpret them to title case
Research suggests lower case as opposed to uppercase is easier to read for humans and just as useful to machines, note exceptions above
No commas
e.g. "1000" instead of "1,000"
No units of measurement
Units should be in metadata instead
Express as full number where possible
e.g. "1200000" instead of "1.2" (million)
If rounded, indicate in metadata
No rounding if possible
Give raw numbers as far as possible
If rounding is needed, try to provide at least 2 decimal places of precision and indicate rounding in metadata
Percentages can be expressed as either a proportion out of 1 or 100.
e.g. 20% can be expressed as 20 or 0.2
The representation of percentages must be consistent throughout your dataset (e.g. among different percentage fields)
You must indicate how percentages are expressed in the data dictionary
Based on ISO8601, an international standard for representing date and time. We chose the "extended format" with the hyphens because it is more human readable.
Compare 2016-01-01 to 20160101
All date and time variables must be local time (UTC -8hrs Pacific Standard Time UTC -7hrs Pacific Daylight Savings Time) unless specified.
Use the data dictionary to specify any important information about time encoding
For fiscal periods, prefix "fiscal_" to column name
Fiscal year start date must be indicated in the data dictionary
e.g. The fiscal year starts on July 1 and ends on June 30 for the State of California
ISO 8601 uses 24 hour clock system in hh:mm:ss format sometimes referred to as military time (do not use AM or PM)
e.g. 13:00 is equivalent to 1:00 PM
Specify the timezone if it is not local time (UTC -8hrs Pacific Standard Time UTC -7hrs Pacific Daylight Savings Time):
In certain cases you may want to provide a single variable representing the number or name of an individual date component, a day, a month, etc. There's no requirement to provide these, but follow this guidance:
Durations can be automatically calculated if you provide a separate start and end period in your dataset. If you also want to provide a duration, please:
Provide the milliseconds between the start and end period (include the duration unit in the data dictionary)
Milliseconds can be rolled up to other time intervals
Use duration in your column name but prepend with a useful descriptor, e.g:
flight_duration
response_duration
dwell_time_duration
travel_duration
Do not duplicate any of the duration column names per the guidance on columns
Note: ISO 8601 does have separate guidance on duration formatting, but we find this more cumbersome than just calculating milliseconds between a period for which there are many standard programming libraries.
Referenced from:
This section covers format and structure standards for datasets being shared with others. These standards are designed to make sure that field level information is shared as consistently as possible to minimize:
Errors
Rework
Repetitive questions
This reference covers:
Interval | Column name | Format | Range of values | Example |
---|---|---|---|---|
Interval | Column name | Format | Example |
---|---|---|---|
Type | Column name | Format | Example |
---|---|---|---|
Type | Column name | Format | Example |
---|---|---|---|
Extract | Column name | Type | Range of values |
---|---|---|---|
This section is adopted from with many thanks to . The bulk of which made its way into this section with additions and modifications.
Want to provide feedback on future data prep and data quality guides?
.
Annual
year
YYYY
YYYY: any valid year
2022
Monthly
month
YYYY-MM
MM: 01 to 12
2022-01
Daily
date
YYYY-MM-DD
DD: 01 to 31
2022-01-01
Weekly
week
YYYY-[W]WW
[W]WW: W01 to W52
2022-W01
Quarterly
quarter
YYYY-[Q]Q
[Q]Q: Q1 to Q4
2022-Q1
Half-yearly
half_year
YYYY-[H]H
[H]H: H1 or H2
2022-H1
Fiscal, annual
fiscal_year
YYYY
2015
Fiscal, monthly
fiscal_month
YYYY-MM
2015-01
Fiscal, quarterly
fiscal_quarter
YYYY-[Q]Q
2015-Q1
Fiscal, half-yearly
fiscal_half_year
YYYY-[H]H
2015-H1
Date + time
date_time
YYYY-MM-DD[T]hh:mm
2015-01-01T13:00
or YYYY-MM-DD[T]hh:mm:ss
2015-01-01T13:00:00
Time only
time
hh:mm
13:00
or hh:mm:ss
13:00:00
Date + time
date_time
YYYY-MM-DD[T]hh:mm+hh:mm
2015-01-01T12:00+00:00
or YYYY-MM-DD[T]hh:mm:ss+hh:mm:ss
2015-01-01T12:00:00+00:00:00
Year
year_num
integer
any valid year
Month
month_num
integer
1 to 12
Month Name
month_name
string
January, February, March, April, May, June, July, August, September, October, November, December
Week of Year
woy_num
integer
1 to 52
Day
day_num
integer
1 to 31 (varies by month)
Day of Week
dow_num
integer
1 to 7
Day of Week Name
dow_name
string
Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday
Hour
hour_num
integer
1 to 24
Minute
minute_num
integer
1 to 60
Second
second_num
integer
1 to 60
Consistent formatting of valid addresses is important for accurately mapping and referencing geographic information
A poorly formed address could end up mapping to the wrong geographic reference or not at all, reducing the usefulness of the data
Poorly formed addresses can make cleanup of data labor intensive and result in reporting errors where geography (neighborhoods, census, etc.) is concerned
Poorly formed addresses could also result in additional costs because of things like:
Undeliverable/returned mail
Failure to apply benefits to recipients appropriately based on geography
Poor routing of vehicles or people in the field
Addresses should be output with the level of detail relevant to the data
e.g. permits applied down to the sub-address level
If providing addresses in a complete string, make sure the addresses are well formed and consistent for easy parsing, for example:
741 Ellis Street, Unit 5, San Francisco, CA 94109
901 Bayshore Boulevard, Unit 209, San Francisco, CA 94124
When providing multiple addresses within a dataset, prepend your column names with the type of address
e.g. address vs. mailing_address
Below are some common elements of an address (but not all)
Not all addresses will have all elements
Address granularity will be driven by the business need, so not all systems will collect every element
Note: systems can be designed to validate or lookup addresses on entry, minimizing error
Make sure the individual elements of an address line up with the guidance below
You can publish addresses as either single strings or break into separate fields
Note: this guidance is provided to promote consistency across the bulk of shared tabular datasets and not as a comprehensive guide to address standards. For a comprehensive standard on addressing, see the
Element | Data Type | Definition |
---|
From Address Number | Numeric | First part of a range: 1000-1100 Main Street, San Francisco, CA 94102 |
To Address Number | Numeric | Second part of a range: 1000-1500 Main Street, San Francisco, CA 94102 |
Address Number Prefix | Numeric | The portion of the Complete Address Number that precedes the Address Number itself: B315 Main Street, San Francisco, CA 94102 |
Address Number | Numeric | The numeric identifier for a land parcel, house, building, or other location along a thoroughfare or within a community: 315A Main Street, San Francisco, CA 94102 |
Address Number Suffix | Text | The portion of the Complete Address Number that follows the Address Number itself: 315 A Main Street, San Francisco, CA 94102 |
Street Name Pre Modifier | Text | A word or phrase in a Complete Street Name that 1. Precedes and modifies the Street Name, but is separated from it by a Street Name Pre Type or a Street Name Pre Directional or both, or 2. Is placed outside the Street Name so that the Street Name can be used in creating a sorted (alphabetical or alphanumeric) list of street names.: 315A Old Main Street, San Francisco, CA 94102 |
Street Name Predirectional | Text | A word preceding the street name that indicates the directional taken by the thoroughfare from an arbitrary starting point, or the sector where it is located: 315A East Main Street, San Francisco, CA 94102 |
Street Name Pretype | Text | A word or phrase that precedes the Street Name and identifies a type of thoroughfare in a Complete Street Name: US Route 101, San Francisco, CA |
Street Name | Text | The portion of the Complete Street Name that identifies the particular thoroughfare (as opposed to the Street Name Pre Modifier, Street Name Post Modifier, Street Name Pre Directional, Street Name Post Directional, Street Name Pre Type, Street Name Post Type, and Separator Element (if any) in the Complete Street Name.): 315A Main Street, San Francisco, CA 94102 |
Street Name Posttype | Text | A word or phrase that follows the Street Name and identifies a type of thoroughfare in a Complete Street Name: 315A Main Street, San Francisco, CA 94102 |
Street Name Postdirectional | Text | A word following the street name that indicates the directional taken by the thoroughfare from an arbitrary starting point, or the sector where it is located: 315A Main Street East, San Francisco, CA 94102 |
Street Name Post Modifier | Text | A word or phrase in a Complete Street Name that follows and modifies the Street Name, but is separated from it by a Street Name Post Type or a Street Name Post Directional or both: 315A Main Street Extended, San Francisco, CA 94102 |
Occupancy Type | Text | The type of occupancy to which the associated Occupancy Identifier applies. (Building, Wing, Floor, Apartment, etc. are types to which the Identifier refers.): 315A Main Street, Apt 2, San Francisco, CA 94102 |
Occupancy Identifier | Text | The letters, numbers, words, or combination thereof used to distinguish different subaddresses of the same type when several occur within the same feature: 315A Main Street, Apt 2, San Francisco, CA 94102 |
City | Text | The city the address sits within: 315A Main Street, San Francisco, CA 94102 |
State Name | Text | The names of the US states and state equivalents: the fifty US states, the District of Columbia, and all U.S. territories and outlying possessions. A state (or equivalent) is "a primary governmental division of the United States." The names may be spelled out in full or represented by their two-letter USPS or ANSI abbreviation: 315A Main Street, San Francisco, CA 94102 |
ZIP code | Numeric | A system of 5-digit codes that identifies the individual Post Office or metropolitan area delivery station associated with an address: 315A Main Street, San Francisco, CA 94102 |
ZIP+4 | Numeric | A 4-digit extension of the 5-digit Zip Code (preceded by a hyphen) that, in conjunction with the Zip Code, identifies a specific range of USPS delivery addresses: 315A Main Street, San Francisco, CA 94102-1212 |