Challenges With Finding Federal Data
Across the Federal government, identifying and locating desired data on programs can be very challenging. Important dimensions of leveraging data such as modes of access, the condition and quality of documentation, and services to support the findability of data vary considerably by agency or data source. Federal and private inventory programs such as data.gov improve the situation by offering a single, common interface, and a rapid way to search across many agencies and data sources. However, even with such services, success in identifying data by guessing the right keywords or domains to engage in meaningful search will depend on experience, knowledge, skill, or luck.
To better understand this challenge at DOL, ODG staff performed comparisons between (1) the keywords and descriptors that DOL agencies offer with their public data resources, (2) search terms that the public frequently enters into agency websites to try to find data, and (3) search terms entered into popular internet search engines. This comparison found almost no overlap between the search terms used by the public to find DOL data, and the language used by DOL programs to describe their public data offerings.
DOL agencies often operate with a vocabulary of technical terms with specific, legal meanings that describe events, actions, actors, and outcomes in their data. These terms derive from laws and policies related to the mission and work of each DOL program. While DOL programs produce many data sets that are readily available and potentially useful for a broad swath of public, private, and other Federal data users, it can be difficult for people without intimate knowledge of these terms and vocabularies to discover the specific data they want or need efficiently. A resource to help users understand these special terms, and to facilitate more direct methods for data location is needed.
DOL Leads Efforts to Make Data More Findable
ODG and DOL’s Data Board lead efforts to develop a standard, plain-language approach to describing data sets. These efforts began with comprehensive reviews of DOL’s data, and various data categorization, keyword, and tagging systems. This work ultimately led to the creation of a simple and unified data description system that makes DOL data more findable and includes:
- Taxonomy: clear, plain-language categories and sub-categories to describe data, a focus on comprehensibility for outside users, and linking parochial categories and terms to a broader or more generic set of concepts (i.e. actions, actors, events, outcomes); and
- Controlled Vocabularies: a prespecified set of descriptors across all programs and systems to describe data that ensures predictable and consistent tagging and identification.
How do Taxonomy and Controlled Vocabulary Help?
Taxonomy simplifies search by providing a consistent and predictable set of categories in which keywords exist such as reporting forms, laws associated with the collection, descriptors of actors (who) described in the data along with events and outcomes (what happened), and descriptors of the specificity and coverage of data elements such as geographic variables within the data.
Controlled vocabularies provide a complete set of searchable terms within a category, so users can identify appropriate search terms and use those search terms to identify DOL data sets with characteristics, format, or content that they want. Examples of taxonomy categories, and the vocabulary terms within that can support more natural search for data of interest include:
- Data sources associated with specific laws such as Civil Rights Act, The Fair Labor Standards Act, and the Occupational Safety and Health Act;
- General descriptors of the content and themes represented within the data such as benefits, child labor, enforcement, health insurance, labor unions, legal, financial, budget, grants, emergency response, etc.;
- Practical and logistical aspects of the data such as frequency of update (e.g. annual, quarterly, weekly, daily), data file formats and the organizational structure of the data itself (e.g. event history data, relational data and the fields to link it, text/narrative data, administrative data in rectangular layouts, etc.); and
- Fields that are useful for linking or categorizing data such as geographical indicators and classification schemes (NAICS, DUNS, EIN, SOC, etc.)
Below are links to download files with DOL's tag and keyword along with the controlled vocabulary
DOL Keyword Controlled Vocabulary; Excel Format
DOL Keyword Controlled Vocabulary; CSV Format
The Value in Enterprise Governance of Tags and Keywords
Like other important data governance functions, making data descriptions more predictable and consistent reduces the burden of data management and administration within government, and makes data more findable and accessible, as well as more comprehensible for those outside of DOL. Some important benefits for a wide range of stakeholders include:
- Simplifying complex and parochial semantics across programs and systems;
- Providing an aid to understanding the terms of the special language of DOL work;
- Providing more effective and efficient methods for locating data; and
- Clarifying what categories and terms to use in search to generate optimal results
In addition, by integrating the taxonomy and controlled vocabulary with our enterprise data inventory, we offer the public a clear framework to search for data sets and variables across the enterprise in a way that can be maximally simple and efficient for them, and more readily locate data and data stewardship that can help advance their intended goals. In addition, by enhancing the comprehensibility of data, and ease in finding data related to specific laws, events, programs, or missions, we increase the ability to integrate DOL data with other data sources, both inside and outside the department, and potentially return the greatest value to the Department and its stakeholders.
Below are links to download a list of available DOL Data sets with tags and keywords from our controlled vocabulary that will help users identify the data the are trying to find.