Skip to content

Operations

Operations enable you to combine, filter, and transform your various sources of data into the new datasets you need for analysis.

Operations are applied in the Workbench of a Workspace and can only be applied to datasets created in that Workbench, not to references (What is a reference?). All operations already applied to a dataset are displayed in a stack in the asset toolbox, with the most recent on top.

Manage Operations#

Add a new operation to your active dataset (the one colored green in the Flow) by clicking the green 'Add operation' button button in the right-hand asset toolbox area and selecting the operation you want from the dialog that opens. You can also add an operation to any dataset in your Flow by right-clicking it and selecting the "Add Operation" option from the context menu.

Delete an operation from the stack by moving your cursor over the operation you wish to delete and clicking the Trashcan button button that will appear to the right. Because each operation depends on the output of the operations below it, only the top operation can be deleted. If you need to delete an operation further down the stack, you must first delete the operations on top of it.

Examine an existing operation by clicking on it in the stack of operations underneath the 'Add operation' button button in the right-hand asset toolbox area. The most recently applied operations are on the top, the oldest on the bottom.

Modify an operation by clicking on it in the stack and changing the existing values. Note that changes will fail if they would cause errors in operations higher up the stack.

Operations#

Each Spotlight operation is documented in the relevant section below.

Blend

Blend#

Combine with the contents of a second dataset based on shared content in one or more columns.

Use the Sources section to select another dataset and configure how the data from the two will be combined. By default Spotlight uses an "Outer Left" blend mode which will include all information from your starting dataset and any content from the second dataset that has an entry in the column(s) you select in the Columns section. More info on these options is available at W3Schools.

Once you configure the second dataset Spotlight will suggest column options underneath the Flow area and provide previews of the resulting combinations.

The Columns section lets you choose which columns between the two datasets overlap and should be used in combining the two datasets. You can select one of the column pairs Spotlight suggests or configure columns manually by selecting names from the drop down menus and clicking the small "+" button to the right of these drop down menus.

Add multiple column pairings to refine your results. Only rows with content that overlaps in all your specified column pairs will be used in combining the two datasets.

Note that when system resources are being heavily used, Spotlight may occasionally be unable to generate a preview of an operation.

Derive Columns

Derive Columns#

Use a formula to create a new column from existing data.

The formula can contain any standard SQL or SparkSQL functions, including basic mathematical operations like *, -, +, etc, as well as logical, date-related, and other more complex functions. You must provide a name for the new column you are creating.

Column names are case sensitive but SQL function names are not, so "datediff", "DateDiff", and "DATEDIFF" will all work equally well. Column names containing spaces, dashes, or other special characters should be quoted with ` (backtick) marks.

Example 1

Using this data:

Opened Closed
2019-01-01 2019-02-23
2019-03-04 2019-03-10

A formula of datediff(Closed,Opened) with DaysOpen as the new column name produces:

Opened Closed DaysOpen
2019-01-01 2019-02-23 53
2019-03-04 2019-03-10 6
Aggregate Functions

If your formula uses a function where results are based on more than one row of data (ex. min() or count()), make sure to check the Aggregate box. By default, aggregate functions are applied to the entire dataset, and will create a column with the same global aggregate result for each row.

Use the Group By menu to calculate and display results based on the values in one or more specific columns. Click the "+" next to the drop down once you have selected a column. You can add multiple Group By columns.

Example 2

Using this data:

Who Item Region Price
Kim Pear North 5
Kim Apple North 9
Eve Pear North 11
Eve Apple North 2
Kim Pear South 3
Kim Apple South 7
Eve Pear South 9
Eve Apple South 12

A formula of avg(Price) with new column name Average with the Aggregate box checked produces:

Who Item Region Price Average
Kim Pear North 5 7.25
Kim Apple North 9 7.25
Eve Pear North 11 7.25
Eve Apple North 2 7.25
Kim Pear South 3 7.25
Kim Apple South 7 7.25
Eve Pear South 9 7.25
Eve Apple South 12 7.25

If you instead select Item column in the Group By menu and press the "+" button next to it, then the avg(Price) formula produces:

Who Item Region Price Product Average
Kim Pear North 5 7.0
Kim Apple North 9 7.5
Eve Pear North 11 7.0
Eve Apple North 2 7.5
Kim Pear South 3 7.0
Kim Apple South 7 7.5
Eve Pear South 9 7.0
Eve Apple South 12 7.5

If you add both the Item and Region columns using the Group By menu, then the avg(Price) formula produces:

Who Item Region Price Product Regional Average
Kim Pear North 5 8.0
Kim Apple North 9 5.5
Eve Pear North 11 8.0
Eve Apple North 2 5.5
Kim Pear South 3 8.0
Kim Apple South 7 5.5
Eve Pear South 9 6.0
Eve Apple South 12 9.5
Extract

Extract#

Create a new column from a fixed slice of characters in one of your existing columns.

The Start Index you enter is counted from the left and specifies how many characters to skip before starting the content of your new column. To start with the first character in your column, set the Start Index to 0.

The Length you enter is the number of characters after the start index to copy into your new column.

You can also set the name for the newly created column. By default this name will be the name of your source column with "_extract" added to the end.

Specifying a Start Index of 4, a Length of 2, and a new column name of "Month" would transform:

Date
20191208
20111130

into

Date Month
20191208 12
20111130 11
Extract JSON Object

Extract JSON Object#

Create a new column from the values of a single key/value pair in a JSON object.

Select a column from the list and the dialog will presents you a menu of each key/value pair in the JSON object inside that column. The column must contain only a JSON object. If there is additional information in your column, the "Split" and "Extract" operations may help you isolate the JSON object content. If you wish to create columns for every key/value pair in your object, use "Expand JSON Object" instead.

When you hit 'Apply' button a new column will be created named after the key you selected and filled with the key's key/value pair.

Starting with:

Phone
{"Home":15558675309,"Work":1555999123}
{"Home":15553141592,"Work":15556535897}

And selecting "Home" from the list of objects in the dialog would produce:

Phone Home
{"Home":15558675309,"Work":1555999123} 15558675309
{"Home":15553141592,"Work":15556535897} 15553141592
Expand JSON Object

Expand JSON Object#

Create a new column for each key in your JSON object's key/value pair(s) and fills the column with the pair value.

The column must contain only a JSON object. If there is additional information in your column, the "Split" and "Extract" operations may help you isolate the JSON object content. If you wish to pull out only a single key/value pair from your JSON object, use "Extract JSON Object" instead.

Transforms:

col
{"User":1,"Name":"Anna","Customer":true}
{"User":2,"Name":"John","Customer":true}

into:

col User Name Customer
{"User":2,"Name":"John","Customer":true} 2 John true
{"User":1,"Name":"Anna","Customer":true} 1 Anna true
Explode JSON Array

Explode JSON Array#

Create a new column and enough duplicate rows so that each item from the JSON Array is filled into the new column once. This new column will be named "col".

The column must contain only a JSON array. If there is additional information in your column, the "Split" and "Extract" operations may help you isolate the JSON array content.

"Explode JSON Array" transforms the following data:

array
[1, 2]

into:

array col
[1, 2] 2
Filter

Filter#

Filter the rows in your dataset based on their content. Filter operations can include a single filter condition or multiple conditions.

Each filter condition runs against a single column and requires you to select one of the available operators (Equals, Not Equals, Contains, Does Not Contain, Starts With, Does Not Start With, Ends With, Does Not End With, Is Empty, Is Not Empty, In, Not In) and provide a value to filter against.

Values are case sensitive and do not accept wildcards or regular expressions. The filter operation can also be added by right-clicking on a column in the data preview area and selecting "filter" from the context menu. In that case, the column will be preselected in the filter operation dialog.

Once you set a filter condition and click Apply, you are returned to the first screen in the filter dialogue where you can choose to apply the filter operation as configured or to add another filter condition.

To add another condition, select a column from the list. This can be any column, even one you have previously used in this operation. If you configure multiple filter conditions, the resulting dataset will contain rows that satisfy all those criteria (e.g. "criteria_1 AND criteria_2").

Example 1

Using the following data:

Name Type
Tom horse
Ginger dog
Max cat
Kiki cat

A filter on the Type column for Not Equals and cat would result in a dataset containing only:

Name Type
Tom horse
Ginger dog

A filter on the Name column for Contains and i would result in a dataset containing only:

Name Type
Ginger dog
Kiki cat

Combining both filters would result in a dataset containing only:

Name Type
Ginger dog
Manage Columns

Manage Columns#

Control which columns are included in the dataset.

Checked names are included, un-checked one are excluded. All columns are checked by default. The Find columns search box at the top of the dialog filters the list of visible columns so you can quickly locate columns in large datasets. Un-check the Select all box to uncheck all columns. Checking Select all will re-select all columns.

Replace

Replace#

Use a basic search and replace function to create a modified copy of an existing column.

The search and replace is case sensitive but does not accept wild cards or regular expressions. You can also set the name for the newly created column. By default this name will be the name of your source column with "_replace" added to the end.

If you need to search for a string containing spaces, escape them in the Current Value field with a \. You do not need to escape spaces in your replacement text.

Searching for "m" and replacing it with "M" would transform:

Title
Ms.
mr.
Dr.

into:

Title Title_replace
Ms. Ms.
mr. Mr.
Dr. Dr.
Split

Split#

Copy one column into two or more columns broken up by a pattern you specify. The split operation can also be added by right-clicking on a column in the data preview area and selecting "Split" from the context menu. In that case, the column will be preselected in the split operation dialog.

The pattern for your deliminator is case sensitive but does not accept wildcards. Regular expressions are supported if you check the Regular Expression box underneath the deliminator text entry box.

If your data contains multiple instances of the deliminator, you may want to increase the column Limit, which controls how many columns will be created. The default Limit is 2, which is also the minimum. Any elements from your original column beyond the column Limit you specify will be dropped. If you set a higher Limit than there are occurrences of your deliminator in the data, empty columns will be created.

You can enter a custom name for newly created columns or accept the default, which is the the source column name appended with "_split". Each new column name, whether custom or default, is sequentially appended with a number, from 1 to the column limit.

Using this data and a new column name of "IP_Address_Octet":

Address
172.54.1.234
172.224.56.2

Splitting on the . deliminator with the default Limit of 2 produces:

Address IP_Address_Octet1 IP_Address_Octet2
172.54.1.234 172 54
172.224.56.2 172 224

Splitting on the . deliminator with a Limit of 5 produces:

Address IP_Address_Octet1 IP_Address_Octet2 IP_Address_Octet3 IP_Address_Octet4 IP_Address_Octet5
172.54.1.234 172 54 1 234
172.224.56.2 172 224 56 2
SQL

SQL#

Transform your data using standard SQL or SparkSQL statements.

Whatever you put together here will apply to your entire dataset. If you select individual columns, those will be the only ones in the resulting dataset. If you use aggregating functions like MIN() or COUNT(), your resulting data may contain only a single value. If you wish to use these aggregate functions, or any SQL functions, to create a new column without otherwise transforming your existing table, look at the "Derive Column" operation instead.

Column names are case sensitive but SQL function names are not so SELECT, Select, and select will all work equally well. Each SQL operation must have at least one reference to _current_view as the current table name. Column names containing spaces, dashes, or other special characters should be quoted with ` (backtick) marks.

With this data:

Name Mail
Joel [email protected]
Melody [email protected]
Fitch [email protected]
Florence [email protected]
Joel S. [email protected]
Melody M. [email protected]
Dr. Fitch [email protected]
Ms. Florence [email protected]

Using:

    SELECT * FROM _current_view
    WHERE Name NOT LIKE "% %"

Returns:

Name Mail
Joel [email protected]
Melody [email protected]
Fitch [email protected]
Florence [email protected]

Using just:

SELECT DISTINCT Mail FROM _current_view

Returns:

Mail
[email protected]
[email protected]
[email protected]
[email protected]

Using just:

SELECT COUNT(DISTINCT Mail) FROM _current_view
count(DISTINCT Mail)
4

Supported column names#

A column name in Spotlight may contain any character except a ` (backtick). Note that column names inside the same dataset must be unique and are not case sensitive (e.g. column names "State" and "STATE" cannot both be used in the same dataset).

Dataset size limits#

Spotlight has a default maximum dataset size of 3 million records for datasets that applies both when adding new tables and when a dataset grows as a result of an operation. Datsets above this limit are flagged with an warning triangle - yellow alert on search cards and on their detail pages.

Dataset that exceed the record count limit will still appear in Spotlight search results but cannot be referenced in Workspaces or opened in external tools through Spotlight.

If a dataset grows beyond the record count limit, it will stop feeding data to any downstream datasets that have been built from it in Spotlight.