Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
It takes a team to deliver high quality open data. There are 3 roles that will help you ensure publishing moves forward. They are pictured below and described in the checklist following. In some cases a single person may fulfill multiple roles, but you shouldn't be on your own entirely.
Others may be needed as you move through the process
These three roles are core, but you may need to bring in others as you move through publishing and approval. Legal and communications staff, for example, may be needed in your process. You can rely on your data coordinator to advise on who else to involve.
Start identifying the fields you want to publish. This will help you and others clarify what you are planning to publish. Keep your documentation template (linked below) somewhere safe as you'll return to it again as you move toward publishing data.
Hold on to your data documentation template, you'll need it later
By the time you publish, you should create meaningful definitions, which is covered in step 3 Create Metadata and Data Dictionary. You'll use this same template in that section. You will eventually copy certain elements over when you upload your dataset on the portal in section 4 Upload the Dataset.
If your dataset needs to be published on a regular schedule, it's good to start thinking of what to do about that now. Even if you don't need automation, thinking about how to help others publish the data early will save you headaches later.
When data is published more frequently than quarterly, we highly recommend automation. Each organization will vary in its approach, but you can work with your IT department to investigate whether automation is possible.
If you're updating your dataset quarterly or more frequently:
Pro tip
Consider alternate publishing strategies like initially publishing manually and then following up with automation when resources are ready.
If automation is not possible, or this is a dataset that gets updated infrequently (like once a year), you may update data manually. You'll still want to make sure the approach to publishing is well documented so you can easily cross-train others on updates. Start with the following:
This section provides instructions on how to upload a dataset. You can refer to the page on detailed steps for uploading data to the portal for deeper guidance.
The following only applies to publishing directly on .
If you publish to another portal, please follow the directions provided by your State organization. .
All reviewers must have accounts on data.ca.gov to review the private dataset
If you haven't checked, make sure all reviewers have access to publishing within your organization on the portal. If you need to request access, .
Check out these complementary videos on uploading data to the portal
Reminder: select Other (Public Domain) for your data license
The portal gives you an option for license. At this time, we cannot default the license in the system, so enter this as Other (Public Domain).
Don't make your dataset public until you've received approval
Share your private dataset link with appropriate reviewers identified with your for feedback and approval
Publishing requires approval. Work with your to make sure you're following your organization's process.
This section provides high level things to check when preparing your data. You will likely need to do additional quality checks that are specific to your data. Please take this section as minimum things to check.
See below to help us build out more detailed guides and references for data preparation and data quality.
If merging tables from multiple sources (like counties or regions)
Check all expected fields are accounted for across data sources
Check the same number of rows exist in your merged dataset as there are in your individual tables
Check that data types are consistent within fields in your dataset. For example if the field is supposed to be an integer, confirm that it only contains integers
Save your tabular data file as a delimited file such as a comma-separated values (CSV) file
Resources on using and exporting delimited files like CSV
The Data Steward is the person most knowledgeable about the data including the sources, collection methods, and limitations.
The Data Coordinator acts as a liaison between internal Information Technology staff, organizational programs and leadership, and portal managers.
The Data Custodian is the person most knowledgeable about how the data is stored and protected and have technical knowledge on how to query and extract data.
They prepare data for publishing on the portal and work with Data Custodians for any system access needs and work with the Data Coordinator for publishing approval.
They are best positioned to convey to the appropriate parties any specific needs of the open data portal and program. They are trusted partners in open data within their organization.
They advise and help with data access and navigate technical options for automation.
Metadata is data about data. Metadata describes the dataset’s structure, data elements, its creation, access, format, and content. A data dictionary is a type of metadata that focuses on the data elements.
Metadata is necessary to improve the discoverability of data within the open data portal and on external search engines. The more relevant information the search engine has about your data resources, the easier it will be for users to find.
Without good metadata, datasets are prone to getting lost. Below we define minimum standards and best practices for:
Resource reminder!
Use the metadata template started in Step 1 to document according to this guide.
Fill in the metadata fields relevant to your dataset - see metadata field definition reference
Make sure your dataset title is accessible and user friendly - see best practices below
Ensure your dataset description is accessible and user friendly - see best practices below
A data dictionary is the information you provide that defines the fields in your data and how the data can be used.
For each field, document the field name, field label, data type, definition and valid values if applicable - see detailed reference on these elements
Write field definitions in user friendly language - see best practices below
Are there changes flagged by your reviewers after sharing in the previous step on uploading the data?
Yes - implement the changes, and re-send for review and approval
No - receive final approval per your organization’s process and then set the visibility of the dataset to public - see this guide on changing the visibility of a dataset
Celebrate your publishing, here are some things you can do:
Work with your communications team to advertise your dataset on social media
Write a blog post incorporating an interesting analysis of the open data
Version 1.0.1 | Last Updated May 31, 2022
This handbook is for California State employees that want to publish open data on the State’s Open Data Portal (https://data.ca.gov)
The guidance on uploading and publishing data (sections 4 and 5) only applies to direct publishing on https://data.ca.gov. However, the rest of the guidance establishes minimum expectations for preparing data for publishing.
You can see a list of State organizations that maintain their own open data portals. Reach out to your portal administrator or data coordinator if you have questions about publishing on those portals.
This guidance will evolve and grow with feedback. Throughout, we've called out opportunities for feedback on additional guidance indicated with a megaphone emoji (). Any feedback can be submitted to opendata@state.ca.gov.
Before diving in, it's important to understand a common definition of open data. This handbook guides you through publishing in a way that is consistent with this definition.
The Open Knowledge Foundation has developed a standard open data and content definition, summarized below.
Open data is data that can be freely used, shared and built-on by anyone, anywhere, for any purpose.
Building on this, open data must be openly licensed, accessible, machine readable, and published in an open format.
You can read more detail on what is open data, what is not open data, and value propositions for open data in the reference section of this handbook.
Publishing a new open dataset takes some planning and coordination, but it doesn’t have to be difficult. This handbook is designed to provide a reference guide you can return to as you go.
Get started by:
Skimming the handbook to familiarize yourself with its content
Starting immediately with the pre-publishing checklist
Bookmarking and coming back as you move through getting your data ready for publishing
The handbook is divided into sections that line up to a general publishing process. The diagram below shows those steps and are listed down below with links to each section.
Publishing process steps
Review the pre-publishing checklist. Summarizes things to get started on and be aware of early in the process to minimize surprises later on.
Prepare data for publishing. Guidance on preparing your dataset for publishing including identifying and implementing any necessary cleaning and merging of data as needed.
Create metadata and data dictionary. Guidance on minimum metadata and documentation needed to make the dataset useful to others.
Upload the dataset. Guidance on uploading the dataset to the open data portal for those publishing directly to data.ca.gov.
Get final publishing approval. Guidance on getting final approval to make the dataset publicly available.
Update and maintain the dataset. Guidance on ways to make sure the dataset is updated and maintained appropriately.
It is important to maintain data updates according to the target frequency indicated in the metadata.
While automated updates are ideal, especially for more frequently updated data, we know that sometimes that's not an option. Below are considerations for making sure you can maintain data updates in line with your target frequency.
Automation will most likely be set up by a technical team in your organization. The following are practices to make sure your automation is resilient.
Dataset update steps are captured and annotated through an automated process (e.g. scripts or ETL/ELT platform)
Standardized and user-friendly alerts for automation success and failure are sent to appropriate points of contact including technical and program contacts
Sufficient logs are kept to ease troubleshooting issues and identifying root causes of ongoing problems
Place update procedure documents () in a common document repository where other staff can access as needed
Identify one or more staff who could manage the dataset and help address issues in your absence or upon leaving your role
Cross-train staff on updates to ensure continuity
Data Steward
Data Coordinator
Data Custodian
Additional Resources. Refer to Data Dictionary: What to Include for further guidance on what to include in the data dictionary.
If you are in one of the following State organizations, you will publish open data directly through their open data portal, otherwise you can publish to the Statewide Open Data Portal.
Executive Agency Portals
Individual State Entity Portals
Is there a CA open data portal missing from this list?
Please reach out to your data coordinator or portal administrator for any specific guidance. You should still prepare data in line with this handbook, even though you'll publish to a different portal.
When publishing to your organization’s portal, in many cases, those datasets will be "harvested" automatically so they are discoverable through the California Open Data Portal (you don’t need to know exactly how harvesting works to publish open data).
Only use alphanumeric or these 3 special characters: period (.), dash (-), and underscore (_)
Ampersand (&) should be replaced by “and” if needed
Each header must be unique
e.g. can’t have two headers called "duration"
Units of measure should be omitted
Units can and must be provided with the data dictionary
Keep short (less than 30 characters)
A full description can and must be provided with the data dictionary
Unique identifiers should be in the left-most column if applicable
Date and time variables should be in the first column for time series data
Fixed or classified variables should be ordered with the highest-level variable on the left and most granular variable on the right, for example
Observed variables should always be on the rightmost columns, these are measured variables often numeric, the following are some example field names that could be observed variables:
Duration
Number of Units
Number of Stories
Year Built
People Served
Do's
Dont's
Do's
Dont's
Do's
Dont's
Referenced from:
This section covers format and structure standards for datasets being shared with others. These standards are designed to make sure that field level information is shared as consistently as possible to minimize:
Errors
Rework
Repetitive questions
This reference covers:
This section is adopted from guidance published by DataSF with many thanks to Singapore's Open Data Program for providing a Data Quality Guide for Tabular Data. The bulk of which made its way into this section with additions and modifications.
Want to provide feedback on future data prep and data quality guides?
The Open Knowledge Foundation (OKF) has a standard definition of open in both short and detailed form. Below is the short definition offered by OKF, which we further define specific to California standards.
Open data is data that can be freely used, shared and built-on by anyone, anywhere, for any purpose.
To accomplish this, open data:
Is released in the Public Domain. Data must be in the Public Domain and provided at no cost to users.
Is accessible and discoverable. Data must be published to an official State open data portal without restrictions. Any additional information necessary for attribution or citation must also accompany the data.
Is published with timely updates. Data must be published in a manner to minimize time between the creation and dissemination of the data.
Is machine readable. Data must be provided in a form readily processable by a computer and where the individual elements of the work can be easily accessed and modified.
Is in an open format. Data must be provided in an open format. An open format is one which places no restrictions, monetary or otherwise, upon its use and can be fully processed with at least one free/libre/open-source software tool. For example, the most common and usable open formats for tabular data are: CSV and JSON.
There are many things the State does to share data or reports about data. You can consider these "data products," but they are not open data by the definition above.
The table below describes several data products that are sometimes confused with open data, reason why they aren't, and ways "upgrade" to open data.
Data product and description | Why this is not open data | How to upgrade to open data |
---|---|---|
Open data is not just something we do for the sake of open data. There are real benefits including:
Stimulating new ideas and services. By releasing open data, State organizations may help to stimulate new and innovative ideas from Californians. There is great potential for open data to act as the fuel for new solutions and even new businesses that can address common problems or challenges facing those that live in, work in, or travel to the State of California. For example, see projects developed as part of the California Water Data Challenge.
Increasing cross-organizational data sharing. If data can be shared in the open, you can leverage the open data portal as an interface to data between departments and agencies and other external organizations. This can also save from additional costly investments in data infrastructure. Combining information from different State departments and agencies can also provide valuable insights into important areas that many organizations touch including health equity, climate change, and drought response to name just a few.
Simplifying Public Records Act (PRA) Requests. Open data releases can be an effective way of responding to requests for data made under the Public Records Act. One open data release may address multiple requests for information that can be repetitive and costly to respond to if addressed on an individual basis.
Improving data quality. Having more eyes on data helps improve the quality over time. Open data publishing allows and encourages users to provide feedback on accuracy, consistency, and other quality measures, important feedback that can help departments get better results from their own internal data uses.
Reducing unwanted web traffic. Publishing open data can also help reduce unwanted web traffic on department and Agency websites, which is often the result of “data scraping” by individuals seeking to obtain data in bulk from the State through public applications. This puts unnecessary stress on the State's technology infrastructure and unneeded burden on IT staff.
Changing how we use data. Ultimately, open data can serve as a platform to change how we use, share, and consume our data externally and internally, transform data into services, and foster continuous improvement in decision making and the business of government. Ultimately, open data is about enabling use of data to help support a range of positive outcomes.
UTF-8 encoding should be used
This ensures that special characters can be decoded by users
No line breaks within cells
This can break parsing in software like Excel, introducing data integrity issues
There are many ways to remove and detect line breaks, but this can vary based on how you're extracting data
Text should be presented in the easiest to interpret/read format where appropriate.
Title case
Address String
Categories when either the source system presents them this way or it is easy to interpret from the source consistently
Upper case
Acronyms - e.g - PSA (Park Service Area)
States - e.g. CA
Lower case
Categories when the source system presents them in caps and there's no way to interpret them to title case
for humans and just as useful to machines, note exceptions above
Keep titles concise and informative.
Avoid using CA or California in the title if it does not meaningfully clarify the scope.
Avoid using jargon and spell out acronyms. Avoid placing dates or years in your dataset title (e.g. 2016-2021). Instead make sure your data includes relevant date information as fields. Describe any useful limitations on observed dates in your dataset description instead of title.
Create a summary paragraph that details the contents of your data table. The first few sentences are the most important.
Include purpose of dataset including the programs or polices the data supports.
Include related legislation if applicable (especially if it defines the method and/or attributes of collection).
Include data collection method and source (not the name of the database, but from what process, people, or organizations does the data come).
Include relevant acronyms, but make sure to clearly define them at least once. Highlight common questions or important notes about the dataset like limitations, missing periods of time, etc. If your description is long, consider linking to a more detailed document and summarizing the key points in your description.
Avoid using acronyms in your first few sentences without definition. Avoid naming just the database the data comes from. Instead highlight the process and methods for collecting the data.
Be precise, unambiguous, and concise.
Include relevant acronyms, but make sure to clearly define them at least once.
If the value is a date, document the time zone of the recording, e.g. PDT (Pacific Daylight Time).
If the values are calculated, the source of raw data and calculation method should be included.
Include units of measurement if applicable.
Include any known limitations of the data collected, e.g. groundwater levels were not measured in the month of January.
If the field is a category, include the list of allowable values.
Avoid writing these definitions from the perspective of an expert; write with the novice user in mind.
No commas
e.g. "1000" instead of "1,000"
No units of measurement
Units should be in metadata instead
Express as full number where possible
e.g. "1200000" instead of "1.2" (million)
If rounded, indicate in metadata
No rounding if possible
Give raw numbers as far as possible
If rounding is needed, try to provide at least 2 decimal places of precision and indicate rounding in metadata
Percentages can be expressed as either a proportion out of 1 or 100.
e.g. 20% can be expressed as 20 or 0.2
The representation of percentages must be consistent throughout your dataset (e.g. among different percentage fields)
You must indicate how percentages are expressed in the data dictionary
Web application. A public-facing application that allows users to search for specific data and possibly generate reports
Develop an automated process from your backend system to extract the raw data in the application and load to the open data portal. Once there, users can access as a single download or through an Application Programming Interface. Provide a link from your application to enable discovery of raw and bulk data, which will take burden off of your application. You can also link to your application from the published open data.
Dashboard. An interactive application that allows users to visualize data in pre-created reports
Provide the underlying data in raw and bulk forms through the open data portal. Provide a link to your dashboard from the open data portal and to the published open data from your dashboard. This enables discovery of your resources.
Report. A document providing both data and context often published as a PDF and to satisfy an administrative or legislated requirement
Publish the data behind the report on the open data portal. If the report is based on administrative data that is collected more regularly than the reporting period, publish the underlying data on a more frequent and automated basis. Provide a link in your report to the published data and link to reports from your published data enable discovery of your resources.
Consistent formatting of valid addresses is important for accurately mapping and referencing geographic information
A poorly formed address could end up mapping to the wrong geographic reference or not at all, reducing the usefulness of the data
Poorly formed addresses can make cleanup of data labor intensive and result in reporting errors where geography (neighborhoods, census, etc.) is concerned
Poorly formed addresses could also result in additional costs because of things like:
Undeliverable/returned mail
Failure to apply benefits to recipients appropriately based on geography
Poor routing of vehicles or people in the field
Addresses should be output with the level of detail relevant to the data
e.g. permits applied down to the sub-address level
If providing addresses in a complete string, make sure the addresses are well formed and consistent for easy parsing, for example:
741 Ellis Street, Unit 5, San Francisco, CA 94109
901 Bayshore Boulevard, Unit 209, San Francisco, CA 94124
When providing multiple addresses within a dataset, prepend your column names with the type of address
e.g. address vs. mailing_address
Below are some common elements of an address (but not all)
Not all addresses will have all elements
Address granularity will be driven by the business need, so not all systems will collect every element
Note: systems can be designed to validate or lookup addresses on entry, minimizing error
Make sure the individual elements of an address line up with the guidance below
You can publish addresses as either single strings or break into separate fields
Based on ISO8601, an international standard for representing date and time. We chose the "extended format" with the hyphens because it is more human readable.
Compare 2016-01-01 to 20160101
All date and time variables must be local time (UTC -8hrs Pacific Standard Time UTC -7hrs Pacific Daylight Savings Time) unless specified.
Use the data dictionary to specify any important information about time encoding
Interval | Column name | Format | Range of values | Example |
---|---|---|---|---|
For fiscal periods, prefix "fiscal_" to column name
Interval | Column name | Format | Example |
---|---|---|---|
Fiscal year start date must be indicated in the data dictionary
e.g. The fiscal year starts on July 1 and ends on June 30 for the State of California
ISO 8601 uses 24 hour clock system in hh:mm:ss format sometimes referred to as military time (do not use AM or PM)
e.g. 13:00 is equivalent to 1:00 PM
Specify the timezone if it is not local time (UTC -8hrs Pacific Standard Time UTC -7hrs Pacific Daylight Savings Time):
In certain cases you may want to provide a single variable representing the number or name of an individual date component, a day, a month, etc. There's no requirement to provide these, but follow this guidance:
Durations can be automatically calculated if you provide a separate start and end period in your dataset. If you also want to provide a duration, please:
Provide the milliseconds between the start and end period (include the duration unit in the data dictionary)
Milliseconds can be rolled up to other time intervals
Use duration in your column name but prepend with a useful descriptor, e.g:
flight_duration
response_duration
dwell_time_duration
travel_duration
Do not duplicate any of the duration column names per the guidance on columns
Note: ISO 8601 does have separate guidance on duration formatting, but we find this more cumbersome than just calculating milliseconds between a period for which there are many standard programming libraries.
Referenced from:
Title. A clear name for the dataset that does not include dates and limits the use of California or CA. See .
Description. A plain description that will display below the title on the open data portal. See . The description field can accept markdown formatting for creating things like bullets and headers using text ()
Tags. Descriptive keywords or phrases that users will search for to find the data. These can be used for providing common synonyms, legal references, and other shorthand users may use to find your data. You don’t need to repeat terms that are in your title or description and should avoid using generic terms that could apply to almost any dataset (opendata, open, transparency, etc.). Must provide at least 1 and separate each tag by a comma.
Publisher. This is the organization you’re publishing on behalf of (your agency, department, board, or commission).
Topic. One of the following:
COVID-19
Economy and Demographics
Government
Health and Human Services
Natural Resources
Transportation
Water
Note: if you are publishing on an agency or department portal, these will be different. In this case, the topics are automatically mapped from agency and department portals to the statewide portal.
Frequency. How often you intend to update the data resources. One of the following:
Irregular
Continuously updated
Hourly
Daily
Twice a week
Semiweekly
Weekly
Biweekly
Semimonthly
Monthly
Every two months
Quarterly
Semiannual
Annual
Biennial
Decennial
Program Contact Name. The specific group inside the agency, department, board or commission that produces the data that can best answer questions about the data.
Program Contact Email. The generic email address for the program referenced above. (e.g. )
Public Access Level. For data to be shared with the public, always Public. Other options on the portal not currently applicable.
Rights. Always enter “No restrictions on public use.”
License. Default to Public Domain unless there is a valid business reason to select a different open license.
Author. The agency, group, department, board, or commission that authors the data resource and has ultimate responsibility for the creation of the data. If this is the same as the publisher, no need to enter. Use this field if your organization is publishing on behalf of a different author (research institution, other local or federal organization, etc.) or if you’d like to indicate a division or program as the author. If the author is actually another State entity, they should publish the data. Rare exceptions will be considered.
Spatial/Geographic Coverage. The geographical area the data table covers (e.g. statewide versus a sub-state region like the Bay Area). Specification should include a named area that also names California (San Francisco Bay Area 9 County Region, California) and may include geographic coordinates. In general give enough description so people can determine it's location if comingled with other non-California datasets.
Temporal Coverage. Start date and End date for the data in your data resource. Entered as a range using ISO8601 formatted date strings (e.g. 2017-01-01 to 2020-12-31)
Homepage URL. URL for the page on your website that has useful information about the data resource or the group that updates it. It's a webpage that gives context about the data and cross-links to the open data.
Language. The language of the published data and metadata.
Granularity. Specify the smallest unit of analysis represented within the dataset. This can apply to both geography (address, parcel, census block, etc.) or time (year, month, day, hour, etc.)
Additional Information. Enter any additional notes or information you’d like to highlight. Note, if you find yourself putting lots of information here, consider putting it in the dataset description.
Related Content. Enter secondary source(s) info: If your data resource is partially made from other data sources, please provide descriptive name(s) and/or URLs of resource(s) from which the data table is derived.
Released in the public domain
Accessible and discoverable
Published with timely updates
Machine readable
In an open format
Released in the Public Domain
Accessible and discoverable
Published with timely updates
Machine readable
In an open format
Released in the Public Domain
Accessible and discoverable
Published with timely updates
Machine readable
In an open format: if PDF
Note: this guidance is provided to promote consistency across the bulk of shared tabular datasets and not as a comprehensive guide to address standards. For a comprehensive standard on addressing, see the
Element | Data Type | Definition |
---|
Type | Column name | Format | Example |
---|---|---|---|
Type | Column name | Format | Example |
---|---|---|---|
Extract | Column name | Type | Range of values |
---|---|---|---|
Data Standard. This is used to identify a standardized specification the dataset conforms to, if any. Provide a URI directly to the website that describes the standard. You can find a reference list online at
Annual
year
YYYY
YYYY: any valid year
2022
Monthly
month
YYYY-MM
MM: 01 to 12
2022-01
Daily
date
YYYY-MM-DD
DD: 01 to 31
2022-01-01
Weekly
week
YYYY-[W]WW
[W]WW: W01 to W52
2022-W01
Quarterly
quarter
YYYY-[Q]Q
[Q]Q: Q1 to Q4
2022-Q1
Half-yearly
half_year
YYYY-[H]H
[H]H: H1 or H2
2022-H1
Fiscal, annual
fiscal_year
YYYY
2015
Fiscal, monthly
fiscal_month
YYYY-MM
2015-01
Fiscal, quarterly
fiscal_quarter
YYYY-[Q]Q
2015-Q1
Fiscal, half-yearly
fiscal_half_year
YYYY-[H]H
2015-H1
Date + time
date_time
YYYY-MM-DD[T]hh:mm
2015-01-01T13:00
or YYYY-MM-DD[T]hh:mm:ss
2015-01-01T13:00:00
Time only
time
hh:mm
13:00
or hh:mm:ss
13:00:00
Date + time
date_time
YYYY-MM-DD[T]hh:mm+hh:mm
2015-01-01T12:00+00:00
or YYYY-MM-DD[T]hh:mm:ss+hh:mm:ss
2015-01-01T12:00:00+00:00:00
Year
year_num
integer
any valid year
Month
month_num
integer
1 to 12
Month Name
month_name
string
January, February, March, April, May, June, July, August, September, October, November, December
Week of Year
woy_num
integer
1 to 52
Day
day_num
integer
1 to 31 (varies by month)
Day of Week
dow_num
integer
1 to 7
Day of Week Name
dow_name
string
Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday
Hour
hour_num
integer
1 to 24
Minute
minute_num
integer
1 to 60
Second
second_num
integer
1 to 60
From Address Number | Numeric | First part of a range: 1000-1100 Main Street, San Francisco, CA 94102 |
To Address Number | Numeric | Second part of a range: 1000-1500 Main Street, San Francisco, CA 94102 |
Address Number Prefix | Numeric | The portion of the Complete Address Number that precedes the Address Number itself: B315 Main Street, San Francisco, CA 94102 |
Address Number | Numeric | The numeric identifier for a land parcel, house, building, or other location along a thoroughfare or within a community: 315A Main Street, San Francisco, CA 94102 |
Address Number Suffix | Text | The portion of the Complete Address Number that follows the Address Number itself: 315 A Main Street, San Francisco, CA 94102 |
Street Name Pre Modifier | Text | A word or phrase in a Complete Street Name that 1. Precedes and modifies the Street Name, but is separated from it by a Street Name Pre Type or a Street Name Pre Directional or both, or 2. Is placed outside the Street Name so that the Street Name can be used in creating a sorted (alphabetical or alphanumeric) list of street names.: 315A Old Main Street, San Francisco, CA 94102 |
Street Name Predirectional | Text | A word preceding the street name that indicates the directional taken by the thoroughfare from an arbitrary starting point, or the sector where it is located: 315A East Main Street, San Francisco, CA 94102 |
Street Name Pretype | Text | A word or phrase that precedes the Street Name and identifies a type of thoroughfare in a Complete Street Name: US Route 101, San Francisco, CA |
Street Name | Text | The portion of the Complete Street Name that identifies the particular thoroughfare (as opposed to the Street Name Pre Modifier, Street Name Post Modifier, Street Name Pre Directional, Street Name Post Directional, Street Name Pre Type, Street Name Post Type, and Separator Element (if any) in the Complete Street Name.): 315A Main Street, San Francisco, CA 94102 |
Street Name Posttype | Text | A word or phrase that follows the Street Name and identifies a type of thoroughfare in a Complete Street Name: 315A Main Street, San Francisco, CA 94102 |
Street Name Postdirectional | Text | A word following the street name that indicates the directional taken by the thoroughfare from an arbitrary starting point, or the sector where it is located: 315A Main Street East, San Francisco, CA 94102 |
Street Name Post Modifier | Text | A word or phrase in a Complete Street Name that follows and modifies the Street Name, but is separated from it by a Street Name Post Type or a Street Name Post Directional or both: 315A Main Street Extended, San Francisco, CA 94102 |
Occupancy Type | Text | The type of occupancy to which the associated Occupancy Identifier applies. (Building, Wing, Floor, Apartment, etc. are types to which the Identifier refers.): 315A Main Street, Apt 2, San Francisco, CA 94102 |
Occupancy Identifier | Text | The letters, numbers, words, or combination thereof used to distinguish different subaddresses of the same type when several occur within the same feature: 315A Main Street, Apt 2, San Francisco, CA 94102 |
City | Text | The city the address sits within: 315A Main Street, San Francisco, CA 94102 |
State Name | Text | The names of the US states and state equivalents: the fifty US states, the District of Columbia, and all U.S. territories and outlying possessions. A state (or equivalent) is "a primary governmental division of the United States." The names may be spelled out in full or represented by their two-letter USPS or ANSI abbreviation: 315A Main Street, San Francisco, CA 94102 |
ZIP code | Numeric | A system of 5-digit codes that identifies the individual Post Office or metropolitan area delivery station associated with an address: 315A Main Street, San Francisco, CA 94102 |
ZIP+4 | Numeric | A 4-digit extension of the 5-digit Zip Code (preceded by a hyphen) that, in conjunction with the Zip Code, identifies a specific range of USPS delivery addresses: 315A Main Street, San Francisco, CA 94102-1212 |
We have had a number of people help with the creation and feedback on this handbook. This wouldn't be possible without their help.
First, many thanks to the student team that kickstarted user research through Stanford's CS184 (Bridging Policy and Tech Through Design) class: Emily Bunnapradist, Jenn Hu, and Sejal Jhawer. Thank you for bringing fresh eyes and design thinking to the open data publisher's journey and this resulting handbook.
And thank you to all those that provided their subject matter expertise and feedback as we developed the handbook (with apologies if we missed anyone): Benjamin Brezing, Colin Stevens, David Altare, David Harris, Jarma Bennett, Karen Henderson, Kate Spiess, Mahesh Gautam, Michael Tagupa, Ping Zhong, Rafael Maestu, Rodney Garcia, Sam Hayashi, Scott Fujimoto, MD, MPH, Tuba Demir Dagdas, Will Wheeler, and Yanyi Djamba.
Referenced from:
For each variable, a Data Dictionary lists:
Field Name. The name of the field as it's written in the source data table. It’s okay for these to be shorter, and you often won’t have complete control over these. The field title is where you can write something more descriptive that will be a reference for users.
Field Label. The common English title for the data contained in this column. Avoid using abbreviations here.
Data Type. Can be one of the following
Note: these data types are the ones supported by data.ca.gov which is a CKAN portal. These are the ones you choose when initially uploading your dataset. Choosing the right format makes it easier for data users to use the dataset.
text. An arbitrary series of alphanumeric characters
json. Nested json data e.g. {"foo": 42, "bar": [1, 2, 3]}.
date. Date without time stored in an ISO8601:extended format e.g. 2015-05-25
time. Time without a date in 24 hour format e.g. 15:00:05
timestamp. Date and time stored in an ISO8601:extended format e.g. 2015-05-25T15:00:05
int. An integer number (no decimals)
Only use it if this field is meant to be used in a calculation. Otherwise use “text”.
float. A floating point number (with decimals)
Only use if this field is meant to be used in a calculation. Otherwise use “text”.
bool. A true/false (boolean) value; valid formats: true/false, 1/0, on/off
Field Definition. Full description of what information is included for the field. See best practices for writing definitions.
Valid Values. (if applicable) Indicate what the expected set of valid values is for the field. This could be a list of controlled values, a range (for numbers and dates), or a minimum or maximum value (for numbers and dates).
Referenced from:
Below are the detailed steps broken up into the following sections:
Experience errors during the upload process?
Reach out to the open data team at the Department of Technology.
Step | Screenshot |
---|---|
All reviewers must have accounts on data.ca.gov to review the private dataset
If you haven't checked, make sure all reviewers have access to publishing within your organization on the portal. If you need to request access, contact the open data team.
You can copy the link to your private dataset and send to the reviewers and work with your Data Coordinator on final publishing approval
Reviewers must log in with their accounts to see the private dataset
Referenced from:
Template:
Hi -insert recipient's name-,
I'm sending this email to give you a heads up that I will be working on publishing open data on the through -insert timeframe here-.
Based on your experience, I would like to invite you to join the project as a -insert data publishing role identified here-. As a -insert data publishing role identified here-, you will be responsible for -insert role description-.
You can read more about the role and the open data publishing process in the California Open Data Publisher's Handbook, linked .
If you currently lack the bandwidth to join the project, it would be extremely helpful if you can refer me to somebody else who you think will be a good fit for the role.
Thanks!
Best,
-insert your name-
Example:
Hi John,
I'm sending this email to give you a heads up that I will be working on publishing open data on the through June 2022.
Based on your experience, I would like to invite you to join the project as a Data Coordinator. As a Data Coordinator, you will be responsible for conveying to the appropriate parties any specific needs of the open data portal and program.
If you currently lack the bandwidth to join the project, it would be extremely helpful if you can refer me to somebody else who you think will be a good fit for the role.
Feel free to let me know if you have any further questions or concerns.
Thanks!
Best,
Jenn
Template:
Hi -insert recipient's name-,
Do you know if automated publishing is possible, and if so, what are the options up for consideration?
If these questions lie outside your knowledge, I would appreciate it if you could refer me to someone who you think would be able to assist me on this.
Feel free to let me know if you have any questions or concerns.
Thanks!
Best,
-insert your name-
Example:
Hi John,
Do you know if automated publishing is possible, and if so, what are the options up for consideration?
If these questions lie outside your knowledge, I would appreciate it if you could refer me to someone who you think would be able to assist me on this.
Feel free to let me know if you have any questions or concerns.
Thanks!
Best,
Jenn
Below we document significant changes to the handbook. We won't log minor fixes like typos or grammar. If interested you can see (these are called commits in git).
Released May 31, 2022
Changes based on feedback including:
Updated missing links
Clarification of some terms
Fixing of typos
Released April 26, 2022
Initial release of the handbook covering:
6 overarching steps for data publishing
More detailed guidance linked from those steps as references
The Data Coordinator acts as a liaison between internal Information Technology staff, organizational programs and leadership, and portal managers.
They are best positioned to convey to the appropriate parties any specific needs of the open data portal and program. They are trusted partners in open data within their organization.
The Data Custodian is the person most knowledgeable about how the data is stored and protected and have technical knowledge on how to query and extract data.
They advise and help with data access and navigate technical options for automation.
The Data Steward is the person most knowledgeable about the data including the sources, collection methods, and limitations.
They prepare data for publishing on the portal and work with Data Custodians for any system access needs and work with the Data Coordinator for publishing approval.
Within data.ca.gov, a dataset (or data set) is a collection of data and resources. .
ELT is the process of extracting data from one or multiple sources and loading it into a data warehouse. Instead of transforming the data before it is written, ELT takes advantage of the system where the data is to be stored to perform the data transformation. This is another approach to automating data updates to the open data portal. In this case the final transformed dataset in the warehouse is synced to the open data portal.
ETL is a type of data integration that consists of three steps (extract, transform, load) used to blend data from multiple sources. During this process, data is taken (extracted) from a source, converted (transformed) into a format that can be analyzed, and stored (loaded) into a data warehouse or other system. This is one common approach to automating data updates to the open data portal.
A flat file is an informal term for a single table of data from which all word processing or other structure characters or markup have been removed. A flat file stores data in plain text format. Because of their simple structure, flat files can only be read, stored and sent. Comma-separated-values (CSV) files are one of the most common types of flat files. They are text files where the fields are separated by commas and each row is a new line.
Harvesting is a process where the data portal automatically imports (“harvests”) datasets from multiple CKAN websites and other non-CKAN sources into a single CKAN website. This automated process is what enables the statewide portal to contain data from other agency and department portals.
The harvests are set up and monitored by system administrators of the portals. It is not something a publisher needs to worry about when publishing.
Personally identifiable information is any data that can be used to identify a specific individual. Examples include a full name with Social Security number, mailing or email address, or phone number.
Within data.ca.gov, resources are the actual files, APIs or links that are being shared through the portal. Resource types include csv, html, xls, json, xlsx, doc, docx, rdf, txt, jpg, png, gif, tiff, pdf, odf, ods, odt, tsv, geojson and xml files. If the resource is an API, it can be used as a live source of information for building a site or application.
Step | Screenshot |
---|---|
Step | Screenshot |
---|---|
Step | Screenshot |
---|---|
Step | Screenshot |
---|---|
Step | Screenshot |
---|---|
Step | Screenshot |
---|---|
You can read more about the role and the open data publishing process in the California Open Data Publisher's Handbook, linked .
I am currently working on publishing open data on the . After publishing the data, I hope to update it -insert frequency here-.
I am currently working on publishing open data on the . After publishing the data, I hope to update it on a monthly basis.
ELT is an alternative to the .
Read .
ETL is an alternative to the .
Read .
To harvest a source catalog, there must be a public interface to a data file that represents the catalog in a .
Information or data that is in a format that can be easily processed by a computer without human intervention. To be machine readable, data must be structured in an organized way. , , and among others, are formats that contain structured data that a computer can automatically read and process.
Personal health information, also referred to as protected health information, is any information about health status, provision of health care, insurance information and other data that a healthcare professional collects to identify an individual and determine appropriate care. Under the (HIPAA), data is considered PHI if it includes one or more of the . If these identifiers are removed, the information is considered de-identified protected health information, which is not subject to the restrictions of the .
Click on the My Datasets tab
Click the button labeled Add Dataset
Enter metadata by copying from the Metadata Template to the relevant fields. Step 3 in this handbook covers the creation of metadata. Fields are ordered in the template the same as they are in the interface.
Ensure the field License is entered as Other (Public Domain)
Ensure the Visibility is set to Private. This is the default.
Click the button labeled Next: Add Data
After clicking Next Add Data in the previous step, you will see an interface to add files by uploading or linking
Click the button labeled Upload
Select your data file and click Open. Data files must be in an open format like CSV.
Add a Title and Description. See guidance on writing titles and descriptions.
Do not enter anything in Format. This will be detected by the system.
If you have another data file to upload, click the button labeled Save & add another. Repeat the steps starting at the top of this section.
If you want to add more non-data resources like documentation, click the button labeled Save & add another and skip to the next section where you'll continue adding non-data files.
If you are done adding data files and have no other files to add in the next section, click the button labeled Finish
If you do not have additional non-data resources to add, you can skip this section
Click the button labeled Upload
Select your non-data file and click Open. If you are providing additional reference documentation, PDF is the best format to provide this in.
Add a Title and Description
Do not enter anything in Format. This will be detected by the system.
If you have another non-data file to upload, click the button labeled Save & add another. Repeat the steps starting at the top of this section.
If you are done adding files, click the button labeled Finish.
After completing your data and non-data resource uploads, you will be taken to a private view of your dataset. You will see the dataset denoted as Private.
Click the button labeled Manage in the upper right
Click the Resources tab
Click on a data resource (e.g. CSV) to which you want to add a data dictionary
Click on the Data Dictionary tab
For each data field, copy information over from the Metadata Template workbook in the Data Dictionary Template sheet:
Copy Field Label over to Label
Copy Field Definition over to Description
Click the button labeled Save at the bottom
If you have multiple data files, repeat for each data file by clicking on the button labeled All resources at the top. Then select the next file to which you'd like to add data definitions.
From the page listing all of your resources, click View Dataset in the upper right
Review your dataset description for human readability and grammar
Check that your license is specified as Other (Public Domain) at the bottom of the left-most content
Check the accuracy of the other metadata in the Additional Info table at the bottom
If you catch any errors or omissions, click Manage in the upper right, which will take you back to the form entry for metadata
Make changes in the metadata form and click Update Dataset at the bottom
From your private dataset page, scroll down to the section labeled Data and Resources
Click on each resource, which will take you to a preview
If you find any errors or omissions or need to re-upload your resource, click Manage in the upper right
Go back to the private dataset page and continue to check each resource until done
After receiving publishing approval, login to the open data portal
Click on the My Datasets tab
Click on the dataset in your list you want to make public
Click Manage in the upper right
Set Visibility to Public
Click Update Dataset at the bottom of the page
Your dataset is now public