California Open Data Publisher's Handbook

Introduction

Version 1.0.1 | Last Updated May 31, 2022

Who the handbook is for

This handbook is for California State employees that want to publish open data on the State’s Open Data Portal (https://data.ca.gov)

The guidance on uploading and publishing data (sections 4 and 5) only applies to direct publishing on https://data.ca.gov. However, the rest of the guidance establishes minimum expectations for preparing data for publishing.

You can see a list of State organizations that maintain their own open data portals. Reach out to your portal administrator or data coordinator if you have questions about publishing on those portals.

📣 Feedback and questions welcome

This guidance will evolve and grow with feedback. Throughout, we've called out opportunities for feedback on additional guidance indicated with a megaphone emoji (📣). Any feedback can be submitted to [email protected].

A definition of open data

Before diving in, it's important to understand a common definition of open data. This handbook guides you through publishing in a way that is consistent with this definition.

The Open Knowledge Foundation has developed a standard open data and content definition, summarized below.

Open data is data that can be freely used, shared and built-on by anyone, anywhere, for any purpose.
Building on this, open data must be openly licensed, accessible, machine readable, and published in an open format.

You can read more detail on what is open data, what is not open data, and value propositions for open data in the reference section of this handbook.

How to use the handbook

Publishing a new open dataset takes some planning and coordination, but it doesn’t have to be difficult. This handbook is designed to provide a reference guide you can return to as you go.

Get started by:

Skimming the handbook to familiarize yourself with its content
Starting immediately with the pre-publishing checklist
Bookmarking and coming back as you move through getting your data ready for publishing

How the handbook is organized

The handbook is divided into sections that line up to a general publishing process. The diagram below shows those steps and are listed down below with links to each section.

Publishing process steps

Review the pre-publishing checklist. Summarizes things to get started on and be aware of early in the process to minimize surprises later on.
Prepare data for publishing. Guidance on preparing your dataset for publishing including identifying and implementing any necessary cleaning and merging of data as needed.
Create metadata and data dictionary. Guidance on minimum metadata and documentation needed to make the dataset useful to others.
Upload the dataset. Guidance on uploading the dataset to the open data portal for those publishing directly to data.ca.gov.
Get final publishing approval. Guidance on getting final approval to make the dataset publicly available.
Update and maintain the dataset. Guidance on ways to make sure the dataset is updated and maintained appropriately.

Loading...

1. Review the Pre-Publishing Checklist

Loading...

2. Prepare Data for Publishing

Loading...

3. Create Metadata and Data Dictionary

Metadata is data about data. Metadata describes the dataset’s structure, data elements, its creation, access, format, and content. A data dictionary is a type of metadata that focuses on the data elements.

Metadata is necessary to improve the discoverability of data within the open data portal and on external search engines. The more relevant information the search engine has about your data resources, the easier it will be for users to find.

Without good metadata, datasets are prone to getting lost. Below we define minimum standards and best practices for:

Creating metadata
Creating a data dictionary

📂 Resource reminder!

Use the metadata template started in Step 1 to document according to this guide.

Create your metadata

Metadata checklist

Fill in the metadata fields relevant to your dataset - see metadata field definition reference
Make sure your dataset title is accessible and user friendly - see best practices below
Ensure your dataset description is accessible and user friendly - see best practices below

Best practices: dataset title content

Do's

Dont's

✅ Keep titles concise and informative.

❌ Avoid using CA or California in the title if it does not meaningfully clarify the scope.

❌ Avoid using jargon and spell out acronyms. ❌ Avoid placing dates or years in your dataset title (e.g. 2016-2021). Instead make sure your data includes relevant date information as fields. Describe any useful limitations on observed dates in your dataset description instead of title.

Best practices: dataset description content

Do's

Dont's

✅ Create a summary paragraph that details the contents of your data table. The first few sentences are the most important.

✅ Include purpose of dataset including the programs or polices the data supports.

✅ Include related legislation if applicable (especially if it defines the method and/or attributes of collection).

✅ Include data collection method and source (not the name of the database, but from what process, people, or organizations does the data come).

✅ Include relevant acronyms, but make sure to clearly define them at least once. ✅ Highlight common questions or important notes about the dataset like limitations, missing periods of time, etc. ✅ If your description is long, consider linking to a more detailed document and summarizing the key points in your description.

❌ Avoid using acronyms in your first few sentences without definition. ❌ Avoid naming just the database the data comes from. Instead highlight the process and methods for collecting the data.

Create your data dictionary

A data dictionary is the information you provide that defines the fields in your data and how the data can be used.

Data dictionary checklist

For each field, document the field name, field label, data type, definition and valid values if applicable - see detailed reference on these elements
Write field definitions in user friendly language - see best practices below

📂 Additional Resources. Refer to Data Dictionary: What to Include for further guidance on what to include in the data dictionary.

📃 Best practices: field definitions

Do's

Dont's

✅ Be precise, unambiguous, and concise.

✅ Include relevant acronyms, but make sure to clearly define them at least once.

✅ If the value is a date, document the time zone of the recording, e.g. PDT (Pacific Daylight Time).

✅ If the values are calculated, the source of raw data and calculation method should be included.

✅ Include units of measurement if applicable.

✅ Include any known limitations of the data collected, e.g. groundwater levels were not measured in the month of January.

✅ If the field is a category, include the list of allowable values.

❌ Avoid writing these definitions from the perspective of an expert; write with the novice user in mind.

Loading...

4. Upload the Dataset

Loading...

5. Get Final Publishing Approval

Are there changes flagged by your reviewers after sharing in the previous step on uploading the data?
- Yes - implement the changes, and re-send for review and approval
- No - receive final approval per your organization’s process and then set the visibility of the dataset to public - see this guide on changing the visibility of a dataset
🎉 Celebrate your publishing, here are some things you can do:
- Work with your communications team to advertise your dataset on social media
- Write a blog post incorporating an interesting analysis of the open data

Loading...

6. Update and Maintain the Dataset

It is important to maintain data updates according to the target frequency indicated in the metadata.

While automated updates are ideal, especially for more frequently updated data, we know that sometimes that's not an option. Below are considerations for making sure you can maintain data updates in line with your target frequency.

Best practices checklist: automated updates

Automation will most likely be set up by a technical team in your organization. The following are practices to make sure your automation is resilient.

Dataset update steps are captured and annotated through an automated process (e.g. scripts or ETL/ELT platform)
Standardized and user-friendly alerts for automation success and failure are sent to appropriate points of contact including technical and program contacts
Sufficient logs are kept to ease troubleshooting issues and identifying root causes of ongoing problems

Best practices checklist: manual updates

Place update procedure documents (developed in Step 2) in a common document repository where other staff can access as needed
Identify one or more staff who could manage the dataset and help address issues in your absence or upon leaving your role
Cross-train staff on updates to ensure continuity

Loading...

Feedback & Help

Give us feedback on this handbook

Loading...

Reference

The What and Why of Open Data

✅ What is open data?

The Open Knowledge Foundation (OKF) has a standard definition of open in both short and detailed form. Below is the short definition offered by OKF, which we further define specific to California standards.

Open data is data that can be freely used, shared and built-on by anyone, anywhere, for any purpose.

To accomplish this, open data:

Is released in the Public Domain. Data must be in the Public Domain and provided at no cost to users.
Is accessible and discoverable. Data must be published to an official State open data portal without restrictions. Any additional information necessary for attribution or citation must also accompany the data.
Is published with timely updates. Data must be published in a manner to minimize time between the creation and dissemination of the data.
Is machine readable. Data must be provided in a form readily processable by a computer and where the individual elements of the work can be easily accessed and modified.
Is in an open format. Data must be provided in an open format. An open format is one which places no restrictions, monetary or otherwise, upon its use and can be fully processed with at least one free/libre/open-source software tool. For example, the most common and usable open formats for tabular data are: CSV and JSON.

❌ What is NOT open data?

There are many things the State does to share data or reports about data. You can consider these "data products," but they are not open data by the definition above.

The table below describes several data products that are sometimes confused with open data, reason why they aren't, and ways "upgrade" to open data.

Data product and description

Why this is not open data

How to upgrade to open data

Web application. A public-facing application that allows users to search for specific data and possibly generate reports

✅ Released in the public domain

❌ Accessible and discoverable

✅ Published with timely updates

❌ Machine readable

❌ In an open format

Develop an automated process from your backend system to extract the raw data in the application and load to the open data portal. Once there, users can access as a single download or through an Application Programming Interface. Provide a link from your application to enable discovery of raw and bulk data, which will take burden off of your application. You can also link to your application from the published open data.

Dashboard. An interactive application that allows users to visualize data in pre-created reports

✅ Released in the Public Domain

❌ Accessible and discoverable

✅ Published with timely updates

❌ Machine readable

❌ In an open format

Provide the underlying data in raw and bulk forms through the open data portal. Provide a link to your dashboard from the open data portal and to the published open data from your dashboard. This enables discovery of your resources.

Report. A document providing both data and context often published as a PDF and to satisfy an administrative or legislated requirement

✅ Released in the Public Domain

❌ Accessible and discoverable

❌ Published with timely updates

❌ Machine readable

✅ In an open format: if PDF

Publish the data behind the report on the open data portal. If the report is based on administrative data that is collected more regularly than the reporting period, publish the underlying data on a more frequent and automated basis. Provide a link in your report to the published data and link to reports from your published data enable discovery of your resources.

Why open data?

Open data is not just something we do for the sake of open data. There are real benefits including:

Stimulating new ideas and services. By releasing open data, State organizations may help to stimulate new and innovative ideas from Californians. There is great potential for open data to act as the fuel for new solutions and even new businesses that can address common problems or challenges facing those that live in, work in, or travel to the State of California. For example, see projects developed as part of the California Water Data Challenge.
Increasing cross-organizational data sharing. If data can be shared in the open, you can leverage the open data portal as an interface to data between departments and agencies and other external organizations. This can also save from additional costly investments in data infrastructure. Combining information from different State departments and agencies can also provide valuable insights into important areas that many organizations touch including health equity, climate change, and drought response to name just a few.
Simplifying Public Records Act (PRA) Requests. Open data releases can be an effective way of responding to requests for data made under the Public Records Act. One open data release may address multiple requests for information that can be repetitive and costly to respond to if addressed on an individual basis.
Improving data quality. Having more eyes on data helps improve the quality over time. Open data publishing allows and encourages users to provide feedback on accuracy, consistency, and other quality measures, important feedback that can help departments get better results from their own internal data uses.
Reducing unwanted web traffic. Publishing open data can also help reduce unwanted web traffic on department and Agency websites, which is often the result of “data scraping” by individuals seeking to obtain data in bulk from the State through public applications. This puts unnecessary stress on the State's technology infrastructure and unneeded burden on IT staff.
Changing how we use data. Ultimately, open data can serve as a platform to change how we use, share, and consume our data externally and internally, transform data into services, and foster continuous improvement in decision making and the business of government. Ultimately, open data is about enabling use of data to help support a range of positive outcomes.

Loading...

Open Data Portals Managed by State Entities

If you are in one of the following State organizations, you will publish open data directly through their open data portal, otherwise you can publish to the Statewide Open Data Portal.

Executive Agency Portals

Individual State Entity Portals

Is there a CA open data portal missing from this list?

Let us know through our online form.

Please reach out to your data coordinator or portal administrator for any specific guidance. You should still prepare data in line with this handbook, even though you'll publish to a different portal.

When publishing to your organization’s portal, in many cases, those datasets will be "harvested" automatically so they are discoverable through the California Open Data Portal (you don’t need to know exactly how harvesting works to publish open data).

Loading...

Data Preparation and Formatting Guidance

Referenced from:

This section covers format and structure standards for datasets being shared with others. These standards are designed to make sure that field level information is shared as consistently as possible to minimize:

Errors
Rework
Repetitive questions

This reference covers:

This section is adopted from with many thanks to . The bulk of which made its way into this section with additions and modifications.

📣 Want to provide feedback on future data prep and data quality guides?

.

Loading...

Column Headers and Order

Column Headers

Only use alphanumeric or these 3 special characters: period (.), dash (-), and underscore (_)
- Ampersand (&) should be replaced by “and” if needed
Each header must be unique
- e.g. can’t have two headers called "duration"
Units of measure should be omitted
- Units can and must be provided with the data dictionary
Keep short (less than 30 characters)
- A full description can and must be provided with the data dictionary

Column Order

Unique identifiers should be in the left-most column if applicable
Date and time variables should be in the first column for time series data
Fixed or classified variables should be ordered with the highest-level variable on the left and most granular variable on the right, for example
Observed variables should always be on the rightmost columns, these are measured variables often numeric, the following are some example field names that could be observed variables:
- Duration
- Number of Units
- Number of Stories
- Year Built
- People Served

Loading...

Date and Time

Based on ISO8601, an international standard for representing date and time. We chose the "extended format" with the hyphens because it is more human readable.
- Compare 2016-01-01 to 20160101
All date and time variables must be local time (UTC -8hrs Pacific Standard Time UTC -7hrs Pacific Daylight Savings Time) unless specified.
- Use the data dictionary to specify any important information about time encoding

Date variables

Interval

Column name

Format

Range of values

Example

Annual

year

YYYY

YYYY: any valid year

2022

Monthly

month

YYYY-MM

MM: 01 to 12

2022-01

Daily

date

YYYY-MM-DD

DD: 01 to 31

2022-01-01

Weekly

week

YYYY-[W]WW

[W]WW: W01 to W52

2022-W01

Quarterly

quarter

YYYY-[Q]Q

[Q]Q: Q1 to Q4

2022-Q1

Half-yearly

half_year

YYYY-[H]H

[H]H: H1 or H2

2022-H1

For fiscal periods, prefix "fiscal_" to column name

Interval

Column name

Format

Example

Fiscal, annual

fiscal_year

YYYY

2015

Fiscal, monthly

fiscal_month

YYYY-MM

2015-01

Fiscal, quarterly

fiscal_quarter

YYYY-[Q]Q

2015-Q1

Fiscal, half-yearly

fiscal_half_year

YYYY-[H]H

2015-H1

Fiscal year start date must be indicated in the data dictionary
- e.g. The fiscal year starts on July 1 and ends on June 30 for the State of California

Date-time and time variables

ISO 8601 uses 24 hour clock system in hh:mm:ss format sometimes referred to as military time (do not use AM or PM)
- e.g. 13:00 is equivalent to 1:00 PM

Type

Column name

Format

Example

Date + time

date_time

YYYY-MM-DD[T]hh:mm

2015-01-01T13:00

or YYYY-MM-DD[T]hh:mm:ss

2015-01-01T13:00:00

Time only

time

hh:mm

13:00

or hh:mm:ss

13:00:00

Specify the timezone if it is not local time (UTC -8hrs Pacific Standard Time UTC -7hrs Pacific Daylight Savings Time):

Type

Column name

Format

Example

Date + time

date_time

YYYY-MM-DD[T]hh:mm+hh:mm

2015-01-01T12:00+00:00

or YYYY-MM-DD[T]hh:mm:ss+hh:mm:ss

2015-01-01T12:00:00+00:00:00

Date and time extracts

In certain cases you may want to provide a single variable representing the number or name of an individual date component, a day, a month, etc. There's no requirement to provide these, but follow this guidance:

Extract

Column name

Type

Range of values

Year

year_num

integer

any valid year

Month

month_num

integer

1 to 12

Month Name

month_name

string

January, February, March, April, May, June, July, August, September, October, November, December

Week of Year

woy_num

integer

1 to 52

Day

day_num

integer

1 to 31 (varies by month)

Day of Week

dow_num

integer

1 to 7

Day of Week Name

dow_name

string

Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday

Hour

hour_num

integer

1 to 24

Minute

minute_num

integer

1 to 60

Second

second_num

integer

1 to 60

Durations

Durations can be automatically calculated if you provide a separate start and end period in your dataset. If you also want to provide a duration, please:

Provide the milliseconds between the start and end period (include the duration unit in the data dictionary)
- Milliseconds can be rolled up to other time intervals
Use duration in your column name but prepend with a useful descriptor, e.g:
- flight_duration
- response_duration
- dwell_time_duration
- travel_duration
Do not duplicate any of the duration column names per the guidance on columns

Note: ISO 8601 does have separate guidance on duration formatting, but we find this more cumbersome than just calculating milliseconds between a period for which there are many standard programming libraries.

Loading...

Text

UTF-8 encoding should be used
- This ensures that special characters can be decoded by users
No line breaks within cells
- This can break parsing in software like Excel, introducing data integrity issues
- There are many ways to remove and detect line breaks, but this can vary based on how you're extracting data

Character case

Text should be presented in the easiest to interpret/read format where appropriate.

Title case

Address String
Categories when either the source system presents them this way or it is easy to interpret from the source consistently

Upper case

Acronyms - e.g - PSA (Park Service Area)
States - e.g. CA

Lower case

Categories when the source system presents them in caps and there's no way to interpret them to title case
Research suggests lower case as opposed to uppercase is easier to read for humans and just as useful to machines, note exceptions above

Loading...

Numeric

No commas
- e.g. "1000" instead of "1,000"
No units of measurement
- Units should be in metadata instead
Express as full number where possible
- e.g. "1200000" instead of "1.2" (million)
- If rounded, indicate in metadata
No rounding if possible
- Give raw numbers as far as possible
- If rounding is needed, try to provide at least 2 decimal places of precision and indicate rounding in metadata
Percentages can be expressed as either a proportion out of 1 or 100.
- e.g. 20% can be expressed as 20 or 0.2
- The representation of percentages must be consistent throughout your dataset (e.g. among different percentage fields)
- You must indicate how percentages are expressed in the data dictionary

Loading...

Addresses

Why valid addresses matter

Consistent formatting of valid addresses is important for accurately mapping and referencing geographic information
A poorly formed address could end up mapping to the wrong geographic reference or not at all, reducing the usefulness of the data
Poorly formed addresses can make cleanup of data labor intensive and result in reporting errors where geography (neighborhoods, census, etc.) is concerned
Poorly formed addresses could also result in additional costs because of things like:
- Undeliverable/returned mail
- Failure to apply benefits to recipients appropriately based on geography
- Poor routing of vehicles or people in the field

Address formatting

Addresses should be output with the level of detail relevant to the data
- e.g. permits applied down to the sub-address level
If providing addresses in a complete string, make sure the addresses are well formed and consistent for easy parsing, for example:
- 741 Ellis Street, Unit 5, San Francisco, CA 94109
- 901 Bayshore Boulevard, Unit 209, San Francisco, CA 94124
When providing multiple addresses within a dataset, prepend your column names with the type of address
- e.g. address vs. mailing_address

Address elements

Below are some common elements of an address (but not all)

Not all addresses will have all elements
Address granularity will be driven by the business need, so not all systems will collect every element
- Note: systems can be designed to validate or lookup addresses on entry, minimizing error
Make sure the individual elements of an address line up with the guidance below
You can publish addresses as either single strings or break into separate fields

Note: this guidance is provided to promote consistency across the bulk of shared tabular datasets and not as a comprehensive guide to address standards. For a comprehensive standard on addressing, see the Federal Geographic Data Committee (FGDC) United States Thoroughfare, Landmark, and Postal Address Data Standard

Element

Data Type

Definition

From Address Number

Numeric

First part of a range: 1000-1100 Main Street, San Francisco, CA 94102

To Address Number

Numeric

Second part of a range: 1000-1500 Main Street, San Francisco, CA 94102

Address Number Prefix

Numeric

The portion of the Complete Address Number that precedes the Address Number itself: B315 Main Street, San Francisco, CA 94102

Address Number

Numeric

The numeric identifier for a land parcel, house, building, or other location along a thoroughfare or within a community: 315A Main Street, San Francisco, CA 94102

Address Number Suffix

Text

The portion of the Complete Address Number that follows the Address Number itself: 315 A Main Street, San Francisco, CA 94102

Street Name Pre Modifier

Text

A word or phrase in a Complete Street Name that 1. Precedes and modifies the Street Name, but is separated from it by a Street Name Pre Type or a Street Name Pre Directional or both, or 2. Is placed outside the Street Name so that the Street Name can be used in creating a sorted (alphabetical or alphanumeric) list of street names.: 315A Old Main Street, San Francisco, CA 94102

Street Name Predirectional

Text

A word preceding the street name that indicates the directional taken by the thoroughfare from an arbitrary starting point, or the sector where it is located: 315A East Main Street, San Francisco, CA 94102

Street Name Pretype

Text

A word or phrase that precedes the Street Name and identifies a type of thoroughfare in a Complete Street Name: US Route 101, San Francisco, CA

Street Name

Text

The portion of the Complete Street Name that identifies the particular thoroughfare (as opposed to the Street Name Pre Modifier, Street Name Post Modifier, Street Name Pre Directional, Street Name Post Directional, Street Name Pre Type, Street Name Post Type, and Separator Element (if any) in the Complete Street Name.): 315A Main Street, San Francisco, CA 94102

Street Name Posttype

Text

A word or phrase that follows the Street Name and identifies a type of thoroughfare in a Complete Street Name: 315A Main Street, San Francisco, CA 94102

Street Name Postdirectional

Text

A word following the street name that indicates the directional taken by the thoroughfare from an arbitrary starting point, or the sector where it is located: 315A Main Street East, San Francisco, CA 94102

Street Name Post Modifier

Text

A word or phrase in a Complete Street Name that follows and modifies the Street Name, but is separated from it by a Street Name Post Type or a Street Name Post Directional or both: 315A Main Street Extended, San Francisco, CA 94102

Occupancy Type

Text

The type of occupancy to which the associated Occupancy Identifier applies. (Building, Wing, Floor, Apartment, etc. are types to which the Identifier refers.): 315A Main Street, Apt 2, San Francisco, CA 94102

Occupancy Identifier

Text

The letters, numbers, words, or combination thereof used to distinguish different subaddresses of the same type when several occur within the same feature: 315A Main Street, Apt 2, San Francisco, CA 94102

City

Text

The city the address sits within: 315A Main Street, San Francisco, CA 94102

State Name

Text

The names of the US states and state equivalents: the fifty US states, the District of Columbia, and all U.S. territories and outlying possessions. A state (or equivalent) is "a primary governmental division of the United States." The names may be spelled out in full or represented by their two-letter USPS or ANSI abbreviation: 315A Main Street, San Francisco, CA 94102

ZIP code

Numeric

A system of 5-digit codes that identifies the individual Post Office or metropolitan area delivery station associated with an address: 315A Main Street, San Francisco, CA 94102

ZIP+4

Numeric

A 4-digit extension of the 5-digit Zip Code (preceded by a hyphen) that, in conjunction with the Zip Code, identifies a specific range of USPS delivery addresses: 315A Main Street, San Francisco, CA 94102-1212

Loading...

Metadata Field Definitions

Loading...

Data Dictionary: What to Include

Referenced from:

For each variable, a Data Dictionary lists:

Field Name. The name of the field as it's written in the source data table. It’s okay for these to be shorter, and you often won’t have complete control over these. The field title is where you can write something more descriptive that will be a reference for users.
Field Label. The common English title for the data contained in this column. Avoid using abbreviations here.
Data Type. Can be one of the following
- Note: these data types are the ones supported by data.ca.gov which is a CKAN portal. These are the ones you choose when initially uploading your dataset. Choosing the right format makes it easier for data users to use the dataset.
- text. An arbitrary series of alphanumeric characters
- json. e.g. {"foo": 42, "bar": [1, 2, 3]}.
- date. Date without time stored in an ISO8601:extended format e.g. 2015-05-25
- time. Time without a date in 24 hour format e.g. 15:00:05
- timestamp. Date and time stored in an ISO8601:extended format e.g. 2015-05-25T15:00:05
- int. An integer number (no decimals)
  Only use it if this field is meant to be used in a calculation. Otherwise use “text”.
- float. A floating point number (with decimals)
  Only use if this field is meant to be used in a calculation. Otherwise use “text”.
- bool. A true/false (boolean) value; valid formats: true/false, 1/0, on/off
Field Definition. Full description of what information is included for the field. .
Valid Values. (if applicable) Indicate what the expected set of valid values is for the field. This could be a list of controlled values, a range (for numbers and dates), or a minimum or maximum value (for numbers and dates).

Loading...

Detailed Steps for Uploading Data to the Portal

Referenced from:

Below are the detailed steps broken up into the following sections:

Create dataset on the open data portal & add initial metadata
Upload data resources
If applicable, upload non-data related resources (guides, documentation, etc)
Add field definitions to your data
Review your dataset metadata
Review your resources
Share the private dataset with your team for review and approval
Make public after approval

Experience errors during the upload process?

Reach out to the open data team at the Department of Technology.

1. Create dataset on the open data portal & add initial metadata

Step

Screenshot

Click on the My Datasets tab

Click the button labeled Add Dataset

Enter metadata by copying from the Metadata Template to the relevant fields. . Fields are ordered in the template the same as they are in the interface.

Ensure the field License is entered as Other (Public Domain)

Ensure the Visibility is set to Private. This is the default.

Click the button labeled Next: Add Data

2. Upload data resources

Step

Screenshot

After clicking Next Add Data in the previous step, you will see an interface to add files by uploading or linking

Click the button labeled Upload

Select your data file and click Open. Data files must be in an open format like CSV.

Add a Title and Description. See guidance on and .

Do not enter anything in Format. This will be detected by the system.

If you have another data file to upload, click the button labeled Save & add another. Repeat the steps starting at the top of this section.

If you want to add more non-data resources like documentation, click the button labeled Save & add another and skip to the next section where you'll continue adding non-data files.

If you are done adding data files and have no other files to add in the next section, click the button labeled Finish

Step

Screenshot

If you do not have additional non-data resources to add, you can skip this section

Click the button labeled Upload

Select your non-data file and click Open. If you are providing additional reference documentation, PDF is the best format to provide this in.

Add a Title and Description

Do not enter anything in Format. This will be detected by the system.

If you have another non-data file to upload, click the button labeled Save & add another. Repeat the steps starting at the top of this section.

If you are done adding files, click the button labeled Finish.

4. Add field definitions to your data

Step

Screenshot

After completing your data and non-data resource uploads, you will be taken to a private view of your dataset. You will see the dataset denoted as Private.

Click the button labeled Manage in the upper right

Click the Resources tab

Click on a data resource (e.g. CSV) to which you want to add a data dictionary

Click on the Data Dictionary tab

For each data field, copy information over from the Metadata Template workbook in the Data Dictionary Template sheet:

Copy Field Label over to Label
Copy Field Definition over to Description

Click the button labeled Save at the bottom

If you have multiple data files, repeat for each data file by clicking on the button labeled All resources at the top. Then select the next file to which you'd like to add data definitions.

5. Review your dataset metadata

Step

Screenshot

From the page listing all of your resources, click View Dataset in the upper right

Review your dataset description for human readability and grammar

Check that your license is specified as Other (Public Domain) at the bottom of the left-most content

Check the accuracy of the other metadata in the Additional Info table at the bottom

If you catch any errors or omissions, click Manage in the upper right, which will take you back to the form entry for metadata

Make changes in the metadata form and click Update Dataset at the bottom

6. Review your resources

Step

Screenshot

From your private dataset page, scroll down to the section labeled Data and Resources

Click on each resource, which will take you to a preview

If you find any errors or omissions or need to re-upload your resource, click Manage in the upper right

Go back to the private dataset page and continue to check each resource until done

All reviewers must have accounts on data.ca.gov to review the private dataset

If you haven't checked, make sure all reviewers have access to publishing within your organization on the portal. If you need to request access, contact the open data team.

You can copy the link to your private dataset and send to the reviewers and work with your Data Coordinator on final publishing approval
Reviewers must log in with their accounts to see the private dataset

8. Make public after approval

Step

Screenshot

After receiving publishing approval,

Click on the My Datasets tab

Click on the dataset in your list you want to make public

Click Manage in the upper right

Set Visibility to Public

Click Update Dataset at the bottom of the page

Your dataset is now public

Loading...

Email Templates

Referenced from:

Invite your team to help you publish open data

Template:

Hi -insert recipient's name-,

I'm sending this email to give you a heads up that I will be working on publishing open data on the California Open Data Portal through -insert timeframe here-.

Based on your experience, I would like to invite you to join the project as a -insert data publishing role identified here-. As a -insert data publishing role identified here-, you will be responsible for -insert role description-.

You can read more about the role and the open data publishing process in the California Open Data Publisher's Handbook, linked here.

If you currently lack the bandwidth to join the project, it would be extremely helpful if you can refer me to somebody else who you think will be a good fit for the role.

Thanks!

Best,

-insert your name-

Example:

Hi John,

I'm sending this email to give you a heads up that I will be working on publishing open data on the California Open Data Portal through June 2022.

Based on your experience, I would like to invite you to join the project as a Data Coordinator. As a Data Coordinator, you will be responsible for conveying to the appropriate parties any specific needs of the open data portal and program.

You can read more about the role and the open data publishing process in the California Open Data Publisher's Handbook, linked here.

If you currently lack the bandwidth to join the project, it would be extremely helpful if you can refer me to somebody else who you think will be a good fit for the role.

Feel free to let me know if you have any further questions or concerns.

Thanks!

Best,

Jenn

Reach out to your IT team about automation

Template:

Hi -insert recipient's name-,

I am currently working on publishing open data on the California Open Data Portal. After publishing the data, I hope to update it -insert frequency here-.

Do you know if automated publishing is possible, and if so, what are the options up for consideration?

If these questions lie outside your knowledge, I would appreciate it if you could refer me to someone who you think would be able to assist me on this.

Feel free to let me know if you have any questions or concerns.

Thanks!

Best,

-insert your name-

Example:

Hi John,

I am currently working on publishing open data on the California Open Data Portal. After publishing the data, I hope to update it on a monthly basis.

Do you know if automated publishing is possible, and if so, what are the options up for consideration?

If these questions lie outside your knowledge, I would appreciate it if you could refer me to someone who you think would be able to assist me on this.

Feel free to let me know if you have any questions or concerns.

Thanks!

Best,

Jenn

Loading...

Glossary

Data Coordinator

The Data Coordinator acts as a liaison between internal Information Technology staff, organizational programs and leadership, and portal managers.

They are best positioned to convey to the appropriate parties any specific needs of the open data portal and program. They are trusted partners in open data within their organization.

Data Custodian

The Data Custodian is the person most knowledgeable about how the data is stored and protected and have technical knowledge on how to query and extract data.

They advise and help with data access and navigate technical options for automation.

Data Steward

The Data Steward is the person most knowledgeable about the data including the sources, collection methods, and limitations.

They prepare data for publishing on the portal and work with Data Custodians for any system access needs and work with the Data Coordinator for publishing approval.

Dataset

Within data.ca.gov, a dataset (or data set) is a collection of data and resources. See definition of resources.

Extract, Load, Transform (ELT)

ELT is the process of extracting data from one or multiple sources and loading it into a data warehouse. Instead of transforming the data before it is written, ELT takes advantage of the system where the data is to be stored to perform the data transformation. This is another approach to automating data updates to the open data portal. In this case the final transformed dataset in the warehouse is synced to the open data portal.

ELT is an alternative to the ETL process defined below.

Read this article for more information on the differences between ELT and ETL.

Extract, Transform, Load (ETL)

ETL is a type of data integration that consists of three steps (extract, transform, load) used to blend data from multiple sources. During this process, data is taken (extracted) from a source, converted (transformed) into a format that can be analyzed, and stored (loaded) into a data warehouse or other system. This is one common approach to automating data updates to the open data portal.

ETL is an alternative to the ELT process defined above.

Read this article for more information on the differences between ELT and ETL.

Flat file

A flat file is an informal term for a single table of data from which all word processing or other structure characters or markup have been removed. A flat file stores data in plain text format. Because of their simple structure, flat files can only be read, stored and sent. Comma-separated-values (CSV) files are one of the most common types of flat files. They are text files where the fields are separated by commas and each row is a new line.

Harvesting

Harvesting is a process where the data portal automatically imports (“harvests”) datasets from multiple CKAN websites and other non-CKAN sources into a single CKAN website. This automated process is what enables the statewide portal to contain data from other agency and department portals.

To harvest a source catalog, there must be a public interface to a data file that represents the catalog in a DCAT-US compliant format.

The harvests are set up and monitored by system administrators of the portals. It is not something a publisher needs to worry about when publishing.

Machine Readable

Information or data that is in a format that can be easily processed by a computer without human intervention. To be machine readable, data must be structured in an organized way. CSV, JSON, and XML among others, are formats that contain structured data that a computer can automatically read and process.

Personal Health Information (PHI)

Personal health information, also referred to as protected health information, is any information about health status, provision of health care, insurance information and other data that a healthcare professional collects to identify an individual and determine appropriate care. Under the Health Insurance Portability and Accountability Act (HIPAA), data is considered PHI if it includes one or more of the 18 identifiers listed here. If these identifiers are removed, the information is considered de-identified protected health information, which is not subject to the restrictions of the HIPAA Privacy Rule.

Personally Identifiable Information (PII)

Personally identifiable information is any data that can be used to identify a specific individual. Examples include a full name with Social Security number, mailing or email address, or phone number.

Resources

Within data.ca.gov, resources are the actual files, APIs or links that are being shared through the portal. Resource types include csv, html, xls, json, xlsx, doc, docx, rdf, txt, jpg, png, gif, tiff, pdf, odf, ods, odt, tsv, geojson and xml files. If the resource is an API, it can be used as a live source of information for building a site or application.

Loading...

Acknowledgements

We have had a number of people help with the creation and feedback on this handbook. This wouldn't be possible without their help.

First, many thanks to the student team that kickstarted user research through Stanford's CS184 (Bridging Policy and Tech Through Design) class: Emily Bunnapradist, Jenn Hu, and Sejal Jhawer. Thank you for bringing fresh eyes and design thinking to the open data publisher's journey and this resulting handbook.

And thank you to all those that provided their subject matter expertise and feedback as we developed the handbook (with apologies if we missed anyone): Benjamin Brezing, Colin Stevens, David Altare, David Harris, Jarma Bennett, Karen Henderson, Kate Spiess, Mahesh Gautam, Michael Tagupa, Ping Zhong, Rafael Maestu, Rodney Garcia, Sam Hayashi, Scott Fujimoto, MD, MPH, Tuba Demir Dagdas, Will Wheeler, and Yanyi Djamba.

Loading...

Version and Changelog

Below we document significant changes to the handbook. We won't log minor fixes like typos or grammar. If interested you can see (these are called commits in git).

Version 1.0.1

Released May 31, 2022

Changes based on feedback including:
- Updated missing links
- Clarification of some terms
- Fixing of typos

Version 1.0

Released April 26, 2022

Initial release of the handbook covering:
- 6 overarching steps for data publishing
- More detailed guidance linked from those steps as references

Loading...

California Open Data Publisher's Handbook

Introduction

Who the handbook is for

📣 Feedback and questions welcome

A definition of open data

How to use the handbook

How the handbook is organized

1. Review the Pre-Publishing Checklist

Identify your publishing team

Key open data publishing roles defined

Find the person that fills each of the roles

Form your team

Start documenting your data

Identify possible needs for publishing support

Considerations for automated publishing

Considerations for manual publishing

2. Prepare Data for Publishing

Checklist

3. Create Metadata and Data Dictionary

Create your metadata

Metadata checklist

Best practices: dataset title content

Best practices: dataset description content

Create your data dictionary

Data dictionary checklist

📃 Best practices: field definitions

4. Upload the Dataset

Checklist

Login and add new dataset

Enter metadata

Add data and other relevant resources

Share private dataset for review

5. Get Final Publishing Approval

6. Update and Maintain the Dataset

Best practices checklist: automated updates

Best practices checklist: manual updates

Feedback & Help

Give us feedback on this handbook

Reference

The What and Why of Open Data

✅ What is open data?

❌ What is NOT open data?

Why open data?

Open Data Portals Managed by State Entities

Data Preparation and Formatting Guidance

Column Headers and Order

Column Headers

Column Order

Date and Time

Date variables

Date-time and time variables

Date and time extracts

Durations

Text

Character case

Numeric

Addresses

Why valid addresses matter

Address formatting

Address elements

Metadata Field Definitions

Required Metadata

If-Applicable Metadata

Optional Metadata

Data Dictionary: What to Include

Detailed Steps for Uploading Data to the Portal

1. Create dataset on the open data portal & add initial metadata

2. Upload data resources

3. If applicable, upload non-data related resources (guides, documentation, etc)

4. Add field definitions to your data

5. Review your dataset metadata

6. Review your resources

7. Share the private dataset with your team for review and approval

8. Make public after approval

Email Templates

Invite your team to help you publish open data

Reach out to your IT team about automation

Glossary

Data Coordinator

Data Custodian