Operations enable you to combine, filter, and transform your various sources of data into the new datasets you need for analysis.
Operations are applied in the Workbench of a Workspace and can only be applied to datasets created in that Workbench, not to references (What is a reference?). All operations already applied to a dataset are displayed in a stack in the asset toolbox, with the most recent on top.
Add a new operation to your active dataset (the one colored green in the Flow) by clicking the green button in the right-hand asset toolbox area and selecting the operation you want from the dialog that opens. You can also add an operation to any dataset in your Flow by right-clicking it and selecting the "Add Operation" option from the context menu.
Delete an operation from the stack by moving your cursor over the operation you wish to delete and clicking the button that will appear to the right. Because each operation depends on the output of the operations below it, only the top operation can be deleted. If you need to delete an operation further down the stack, you must first delete the operations on top of it.
Examine an existing operation by clicking on it in the stack of operations underneath the button in the right-hand asset toolbox area. The most recently applied operations are on the top, the oldest on the bottom.
Modify an operation by clicking on it in the stack and changing the existing values. Note that changes will fail if they would cause errors in operations higher up the stack.
Each Spotlight operation is documented in the relevant section below.
Combine with the contents of a second dataset based on shared content in one or more columns.
Use the Sources section to select another dataset and configure how the data from the two will be combined. By default Spotlight uses an "Outer Left" blend mode which will include all information from your starting dataset and any content from the second dataset that has an entry in the column(s) you select in the Columns section. More info on these options is available at W3Schools.
Once you configure the second dataset Spotlight will suggest column options underneath the Flow area and provide previews of the resulting combinations.
The Columns section lets you choose which columns between the two datasets overlap and should be used in combining the two datasets. You can select one of the column pairs Spotlight suggests or configure columns manually by selecting names from the drop down menus and clicking the small "+" button to the right of these drop down menus.
Add multiple column pairings to refine your results. Only rows with content that overlaps in all your specified column pairs will be used in combining the two datasets.
Note that when system resources are being heavily used, Spotlight may occasionally be unable to generate a preview of an operation.
Use a formula to create a new column from existing data.
The formula can contain any standard SQL or SparkSQL functions, including basic mathematical operations like *, -, +, etc, as well as logical, date-related, and other more complex functions. You must provide a name for the new column you are creating.
Column names are case sensitive but SQL function names are not, so "
DateDiff", and "
DATEDIFF" will all work equally well. Column names containing spaces, dashes, or other special characters should be quoted with ` (backtick) marks.
Using this data:
A formula of
DaysOpen as the new column name produces:
If your formula uses a function where results are based on more than one row of data (ex.
count()), make sure to check the Aggregate box. By default, aggregate functions are applied to the entire dataset, and will create a column with the same global aggregate result for each row.
Use the Group By menu to calculate and display results based on the values in one or more specific columns. Click the "+" next to the drop down once you have selected a column. You can add multiple Group By columns.
Using this data:
A formula of
avg(Price) with new column name
Average with the Aggregate box checked produces:
If you instead select
Item column in the Group By menu and press the "+" button next to it, then the
avg(Price) formula produces:
If you add both the
Region columns using the Group By menu, then the
avg(Price) formula produces:
|Who||Item||Region||Price||Product Regional Average|
Create a new column from a fixed slice of characters in one of your existing columns.
The Start Index you enter is counted from the left and specifies how many characters to skip before starting the content of your new column. To start with the first character in your column, set the Start Index to 0.
The Length you enter is the number of characters after the start index to copy into your new column.
You can also set the name for the newly created column. By default this name will be the name of your source column with "_extract" added to the end.
Start Index of 4, a
Length of 2, and a new column name of "Month" would transform:
Extract JSON Object
Extract JSON Object#
Create a new column from the values of a single key/value pair in a JSON object.
Select a column from the list and the dialog will presents you a menu of each key/value pair in the JSON object inside that column. The column must contain only a JSON object. If there is additional information in your column, the "Split" and "Extract" operations may help you isolate the JSON object content. If you wish to create columns for every key/value pair in your object, use "Expand JSON Object" instead.
When you hit a new column will be created named after the key you selected and filled with the key's key/value pair.
And selecting "Home" from the list of objects in the dialog would produce:
Expand JSON Object
Expand JSON Object#
Create a new column for each key in your JSON object's key/value pair(s) and fills the column with the pair value.
The column must contain only a JSON object. If there is additional information in your column, the "Split" and "Extract" operations may help you isolate the JSON object content. If you wish to pull out only a single key/value pair from your JSON object, use "Extract JSON Object" instead.
Explode JSON Array
Explode JSON Array#
Create a new column and enough duplicate rows so that each item from the JSON Array is filled into the new column once. This new column will be named "col".
"Explode JSON Array" transforms the following data:
Filter the rows in your dataset based on their content. Filter operations can include a single filter condition or multiple conditions.
Each filter condition runs against a single column and requires you to select one of the available operators (Equals, Not Equals, Contains, Does Not Contain, Starts With, Does Not Start With, Ends With, Does Not End With, Is Empty, Is Not Empty, In, Not In) and provide a value to filter against.
Values are case sensitive and do not accept wildcards or regular expressions. The filter operation can also be added by right-clicking on a column in the data preview area and selecting "filter" from the context menu. In that case, the column will be preselected in the filter operation dialog.
Once you set a filter condition and click Apply, you are returned to the first screen in the filter dialogue where you can choose to apply the filter operation as configured or to add another filter condition.
To add another condition, select a column from the list. This can be any column, even one you have previously used in this operation. If you configure multiple filter conditions, the resulting dataset will contain rows that satisfy all those criteria (e.g. "criteria_1 AND criteria_2").
Using the following data:
A filter on the
Type column for
Not Equals and
cat would result in a dataset containing only:
A filter on the
Name column for
i would result in a dataset containing only:
Combining both filters would result in a dataset containing only:
Control which columns are included in the dataset.
Checked names are included, un-checked one are excluded. All columns are checked by default. The Find columns search box at the top of the dialog filters the list of visible columns so you can quickly locate columns in large datasets. Un-check the Select all box to uncheck all columns. Checking Select all will re-select all columns.
Use a basic search and replace function to create a modified copy of an existing column.
The search and replace is case sensitive but does not accept wild cards or regular expressions. You can also set the name for the newly created column. By default this name will be the name of your source column with "_replace" added to the end.
If you need to search for a string containing spaces, escape them in the Current Value field with a
\. You do not need to escape spaces in your replacement text.
Searching for "
m" and replacing it with "
M" would transform:
Copy one column into two or more columns broken up by a pattern you specify. The split operation can also be added by right-clicking on a column in the data preview area and selecting "Split" from the context menu. In that case, the column will be preselected in the split operation dialog.
The pattern for your deliminator is case sensitive but does not accept wildcards. Regular expressions are supported if you check the Regular Expression box underneath the deliminator text entry box.
If your data contains multiple instances of the deliminator, you may want to increase the column Limit, which controls how many columns will be created. The default Limit is 2, which is also the minimum. Any elements from your original column beyond the column Limit you specify will be dropped. If you set a higher Limit than there are occurrences of your deliminator in the data, empty columns will be created.
You can enter a custom name for newly created columns or accept the default, which is the the source column name appended with "_split". Each new column name, whether custom or default, is sequentially appended with a number, from 1 to the column limit.
Using this data and a new column name of "IP_Address_Octet":
Splitting on the
. deliminator with the default Limit of
Splitting on the
. deliminator with a Limit of
Whatever you put together here will apply to your entire dataset. If you select individual columns, those will be the only ones in the resulting dataset. If you use aggregating functions like
COUNT(), your resulting data may contain only a single value. If you wish to use these aggregate functions, or any SQL functions, to create a new column without otherwise transforming your existing table, look at the "Derive Column" operation instead.
Column names are case sensitive but SQL function names are not so
select will all work equally well. Each SQL operation must have at least one reference to
_current_view as the current table name. Column names containing spaces, dashes, or other special characters should be quoted with ` (backtick) marks.
With this data:
|Joel S.||[email protected]|
|Melody M.||[email protected]|
|Dr. Fitch||[email protected]|
|Ms. Florence||[email protected]|
SELECT * FROM _current_view WHERE Name NOT LIKE "% %"
SELECT DISTINCT Mail FROM _current_view
SELECT COUNT(DISTINCT Mail) FROM _current_view
Supported column names#
A column name in Spotlight may contain any character except a
` (backtick). Note that column names inside the same dataset must be unique and are not case sensitive (e.g. column names "State" and "STATE" cannot both be used in the same dataset).
Dataset size limits#
Spotlight has a default maximum dataset size of 3 million records for datasets that applies both when adding new tables and when a dataset grows as a result of an operation. Datsets above this limit are flagged with an alert on search cards and on their detail pages.
Dataset that exceed the record count limit will still appear in Spotlight search results but cannot be referenced in Workspaces or opened in external tools through Spotlight.
If a dataset grows beyond the record count limit, it will stop feeding data to any downstream datasets that have been built from it in Spotlight.