Datasets are the data assets in Spotlight. They enable you to preview your data, describe it, cache it inside Spotlight, discuss it with colleagues, and they can be opened in external tools for further analysis. See Supported data for a list of supported data types, names, and sizes.
Datasets can be added to Spotlight or created inside a Workspace's Workbench.
Added datasets are tables on external data systems and uploaded files that are in one of the supported data file formats (CSV, JSON, XLS/XLSX/XLSB, or Parquet). Spotlight creates virtual copies of all this added data so that the original is never edited.
Created datasets are built in Spotlight Workspaces using the Workbench tool to combine, filter, and otherwise transform the data from added datasets (or existing created ones) into the new forms you need for analysis. Created datasets can only have their contents modified inside the original Workbench where they were created.
If you cannot find the data you need in Spotlight, you can either add it from outside or create a new dataset by combining, filtering, and transforming data already in Spotlight.
Add datasets to Spotlight#
Add to Spotlight and Connect describe the various ways to add data to Spotlight. See Workspaces: References for details on how to reference these datasets in a Workspace so you can combine, filter, and otherwise transform that data by creating new datasets.
Create datasets in Spotlight#
Creating datasets gives you the ability to pull data from Spotlight datasets (whether added or created) and apply operations that transform that data to fit your project's needs.
The first step in creating a new dataset is gathering references to all the data and other analytic assets for your project inside a Workspace. Data can be in existing Spotlight datasets, you can connect new external data systems, or you can upload files directly to Spotlight (see Workspaces: References).
Once everything you need is in a Workspace, you can use the attached Workbench tool to transform the data you have gathered into the form that your project needs. In your Workspace, click the button then follow the step-by-step instructions in "Workbench: Create new dataset".
Add to Workspace#
From the dataset's detail page, use the to add a reference to the dataset to one of your Workspaces.
Open in external tool#
Buttons on the top right of the details page help you open the dataset in an external tool or download it as a CSV. These buttons only appear for Datasets created in Spotlight. Datasets representing file uploads or tables on external databases can also be opened in external tools by adding them to a Workspace and using the Open in buttons in the Workspace's Worbench.
Comments appear on the right side of the dataset details page with the most recent comments at the top. Any user can comment on a dataset or reply to existing comments. Users following the dataset will receive a notification in their Activities panel for new comments and replies.
Add a new comment by clicking on the button on the top right of the Comment section.
Delete a comment, including any replies, by clicking the dark conversations menu in the top right of the comment box.
Reply to an existing comment by clicking on it and typing in the new comment entry box that will appear underneath the comment. You can reply to comments from the dataset detail page and also from the notification of a comment in your Activities panel.
Notifications of new comments go to anyone following the dataset. Owners follow datasets by default.
Note: notifications are tied to the individual asset, so comments on a dataset will not be sent to users following Workspaces where that dataset was created or referenced.
The dataset's owner and Spotlight administrators have the ability to configure various settings for the dataset and enrich it with metadata. For datasets created in a Spotlight Workspace, all collaborators have the same permissions as the dataset's owner.
Each column in the dataset can have a description of up to 400 characters. This enables you to produce a basic data dictionary directly integrated with your dataset and discoverable through the find tool.
From the dataset detail page, click the button at the top of the Table Preview area.
This will open the Column view of your dataset. Click on the Description field next to any column to enter information about that column. Column descriptions can be up to 400 characters long. Changes are saved automatically as you type.
To switch back to the Table Preview view of your dataset, click the button at the top of the Column view.
Open in Workbench#
Quickly review or modify how the dataset is constructed by using the button on the detail page to open the dataset in the Workbench where it was created. (Only available for datasets created in a Spotlight Workbench)
Make a copy inside Spotlight of a dataset's contents. This can be useful to reduce access time or spread system load on connected data systems (databases, S3, etc) and to create a snapshot of volatile data. If a dataset in Spotlight requires authenticating with connected data systems, the cached version will also require authenticating against those systems before you can access it.
The Cache section of your dataset's detail page shows the current status of caching for the dataset. You can control whether the dataset is cached and optionally schedule when to refresh the cached copy. Uploaded files cannot be cached.
The available cache statuses are:
- "Disabled" - (default) - no cache is active for this dataset
- "Queued" - Spotlight is in the process of caching this dataset
- "Available" - a cached version of the dataset is available and in use
- "Outdated" - there has been a structural modification to the dataset since the last time it was cached, rendering the cache unusable even though it is available. Follow the "Refresh" steps below to make your cache available again.
Click on the status to open the cache configuration dialog. Inside the dialog:
- Cache a dataset by clicking the button. This will automatically toggle the "Use Cached Data" switch in that same dialogue.
- Refresh an existing cached copy of a dataset by clicking the button. Toggle the "Enable Scheduler" switch in that same dialogue to have Spotlight refresh the cached copy on a schedule you set.
- Disable caching for a dataset by switching off the "Use Cached Data" switch in the dialogue that opens. Make sure to also switch off the scheduler in that dialogue if you will no longer need to have the cached copy updated.
For datasets created in a Spotlight Workbench, you can access this same dialog by selecting the dataset in your Flow area and clicking the or buttons in the asset toolbox.
Use the configure how this data file should be processed into a dataset. This includes specifying the field delimiter, new line delimiter, quoting character, and whether the data contains a header row. (Only available for datasets created from a file upload)
A dataset's default visibility setting depends on how it is created. File uploads are visible to everyone. Datasets created in a Workbench are only visible to collaborators. Use the button on the details page to change these settings.
Datasets from a connected data system are governed by the connection's visibility setting.
The current setting for a dataset is visible in its "Visible to" metadata field. For datasets using the Specific people setting, this field will include the list of allowed users.
The visibility settings are:
- "Everyone at your company" - Anyone with a Spotlight account will have metadata visibility for the dataset.
- "Specific people" - A dialog will open when you select this option enabling you to search for and select other Spotlight users. Only the owner and the users selected in this dialog will have metadata visibility for the dataset.
- "Nobody except you". Only the owner has metadata visibility for the dataset. Datasets created in a Spotlight Workbench do not have this option because they are always visible to all collaborators in that Workspace.
Note that administrators have metadata visibility for all assets in Spotlight.
Only the dataset's owner or a Spotlight administrator can modify metadata visibility settings. Note that administrators have metadata visibility for all assets in Spotlight.
Delete a dataset#
Datasets that are referenced in any other Workspaces or Datasets that represent external database tables cannot be deleted in Spotlight.
To completely remove a dataset from Spotlight:
- Open the dataset's detail page
- Click the button
- Choose the "Delete" option
- Confirm that you want to delete the dataset in the warning message that opens
- Your dataset has been deleted from Spotlight
An error will inform you if the dataset you are trying to delete is still referenced in other Workspaces.
Spotlight has a default maximum dataset size of 3 million records for datasets that applies both when adding new tables and when a dataset grows as a result of an operation (see Supported data: Data size for details)
The dataset detail page displays user and machine-created metadata to help you better understand where a dataset comes from, what it contains, and how it is being used in Spotlight.
Names can be up to 100 characters, cannot start with an (_) character or a space, and cannot contain a (`) character.
400-characters to provide as much description as possible about the dataset. The more a dataset is documented, the more useful it becomes for collaboration.
Properties help you capture the state and features of an asset as structured data. This includes "Status" (listed below) and other values that are customized for each asset type by your organization. Common properties include the person responsible for an asset's maintenance, how often the asset is supposed to be updated, and whether it has been trusted by a data steward. Using properties makes it easier to search for and categorize an asset (see "Find Assets: Find by other asset features").
Select one of the options from this drop-down menu to indicate the current status of the dataset. The status will be prominently displayed on the detail page and in search results.
Lists who has metadata visibility to this dataset in Spotlight (see Concepts of Spotlight: Visibility and Configure Visibility above). For datasets using the Specific people setting, this field will include the list of allowed users.
The ten row table can be scrolled horizontally or expanded with the button to view the dataset's full 1000 row data preview.
Shows all Workspaces that reference the dataset. If the dataset is referenced from multiple Workspaces, a clickable number will appear in this section. Click it to open the full list of Workspaces referencing this dataset.
A dataset's owner is either the owner of the connection containing the dataset or the person who uploaded the dataset file to Spotlight. The owner has the ability to change basic information about the dataset like its name and description, to delete it (datasets from databases or data warehouses cannot be deleted), and to transfer ownership (user-uploaded files from local computers or S3 cannot be transfered).
Indicates how this data got into Spotlight. For data that lives on a connected external system, you will see the type of system. For uploaded files, you will see which user originally uploaded the file to Spotlight. For datasets created in a Spotlight Workspace, you will see a list of all the upstream datasets that feed information into this dataset.
Tags help quickly identify, search, and categorize assets in Spotlight. Add some here to make this dataset easier to find and easier to tell apart from the rest of your related work. A tag can be up to 80 characters (spaces are not allowed). Clicking on a tag launches a search for all assets with that tag in Spotlight.
Displays the dates on which the dataset was created and last modified. A dataset is considered to have been modified whenever the title or description is edited. The last modified date is used to sort some search results.
Shows the file type for this dataset, based on the file's media type (previously MIME type). Only shown for file uploads.
Shows the number of users following the dataset and is also a toggle button so that you can follow/un-follow the dataset.
Displays the initials of the five users who most frequently open or edit the dataset. Hover over these initials to see each user's full name.
This section uses graphics to indicate the number of times a dataset is opened or edited directly via Spotlight or an external tool (JDBC or ODBC queries). Queries by all Spotlight users are included in this count. Click the Direct square to toggle histogram display of dataset queries.
Click the Update square to toggle a histogram display of the number of changes to the dataset operations stack and metadata. You can hover over a histogram bar to see the per-day count. The timeline pulldown lets you set histogram display for the last 7, 30, or 90 days.