Files in CSV, TSV (column delimited) XLS, JSON, and Parquet formats become datasets in Neebo. Technically, because all data in Neebo is virtualized, every data asset in Neebo is a "dataset" - a reference to another data asset. A dataset is Workspace specific - it can only exist inside one Workspace.
However, a central principle in Neebo is that when data is first added, to either the tool or to a Workspace, it is a non-editable data source. A data sources is a specific type of dataset that represents the external data source (like a table in a database) and can be referenced from multiple Workspaces. Data source icons differ according to their connection type (S3, PostgreSQL, etc.).
Only when you create a reference to the data source does it become editable, as a dataset. Dataset icons are the same regardless of their data type. Any Neebo dataset can in turn be used as a data source in another Workspace, in which case a new copy is created that does not reference or reflect the content of any other datasets.
To create a dataset, right-click on the desired data source or dataset in the Flow area of the Workbench and choose "Create Reference" from the context menu. The new dataset will be highlighted, and you will see arrows that indicate the upstream and downstream reference connections. You can create an unlimited number of datasets from any given data source or dataset.
In the example shown below, "Chromosome X1" and "Blue Eyes" are both data sources that have been added to the Workspace. "Chromosome X1" happens to be a new data connection added to Neebo, whereas "Blue Eyes" is a dataset that already existed in Neebo. When the "Blue Eyes" dataset was added to the Workspace, it became a data source reference there. Regardless of the origin, as data sources they are functionally the same.
A dataset can be renamed at any time, from its Details page. It is useful to establish a naming convention and use descriptions as well as tags to identify and distinguish datasets.
Neebo default is to name downstream datasets by appending hierarchical, sequential numbering where the original is the implicit number 1. This numbering is helpful for tracking data lineage, as shown below. Dataset "Blue Eyes 2" is created directly from the data source "Blue Eyes". The datasets created from "Blue Eyes 2" are named "Blue Eyes 2 2," "Blue Eyes 2 3," and "Blue Eyes 2 4." The dataset created from "Blue Eyes 2 4" is in turn named "Blue Eyes 2 4 2".
Deleting and removing datasets
When a dataset is removed from a Workspace it still exists in Neebo and can be be added to Workspaces in the future. Only the owner or a collaborator can remove or delete a dataset. To remove a dataset from a Workspace, select the dataset in the Workbench Flow area and choose "Remove" from its context menu.
When a dataset is deleted, it no longer exists in Neebo. From the dataset's details page, use the rightmostbutton and choose "Delete." If the dataset has no downstream dependencies, you will be prompted to confirm the action. If the dataset is referenced in one or more other Workspaces, it must be removed from those Workspaces in order to be deleted. Note that a specific dataset reference is being removed from a Workspace. A data source cannot be deleted.
Every data source or dataset has a Details page that provides important metadata such as the Owner - the person who first added the data to Neebo, and the date and time the dataset was Created and Last Modified.
There is also a five row, horizontally scrollable preview table that can be expanded to scroll vertically through all rows with the button. You can add Comments to a dataset at any time by clicking on the "plus" button on that panel."View in Fullscreen"
The breadcrumb path in the header shows the Workspace name, then (separated with a carat >) the dataset name, so you can identify the specific dataset you are viewing.
The Details page also provides a set of buttons that allow you to add a dataset to one or more other Workspaces, open the dataset in the Workbench for the current Workspace, or to delete the dataset.
The name field is editable. A dataset name can be one to 100 characters, and must be unique.
(Please provide some information)
Use this 400-character description field to provide as much description as possible about the dataset. Help other users understand the data asset and how they may be able to use it!
Tags are a searchable attribute intended for categorization and finding, that can be applied to datasets and Workspaces. They can be up to 80 characters (spaces are not allowed). Clicking on a Workspace tag starts a search for that tag.
Shows the Workspace to which the selected dataset belongs.
Indicates the type of the data source connection.
Shows the number of users following the dataset and is also a toggle button.
Displays the initials of the five users who most frequently open or edit the dataset.
This section uses graphics to indicate a dataset's popularity as gauged by how many times the dataset was referenced or queried by all Neebo users. Click the Direct square to toggle histogram display of dataset queries. (A query includes the number of times a dataset is opened or edited directly via Neebo or an external tool (JDBC, ODBC), but not queries of dependent datasets upstream.) Click the Update square to toggle histogram display of the number of changes to the dataset operations stack and metadata. You can hover over a histogram bar to see the per-day count. The timeline pulldown lets you set histogram display to the last 7, 30, or 90 days.
Click the plus sign to enter comments pertaining to the dataset.
An important feature in Neebo is the ability to cache a dataset, which creates a temporary copy that allows Neebo to access data without revisiting the source system. Caching can be scheduled in order to optimize system resources based on priority, compute availability, efficiency, etc. You can use cached data for as long as you like, and you can update the data from the source either on demand or by setting a schedule.
When a dataset is cached, the status link under "Cache" will read AVAILABLE. When you click on the link, the cache is refreshed. When the dataset is not cached, (the default) the status link will read DISABLED; when you click on the link, caching is enabled. Remember that for data sources that require credentials, those must be reentered when a cache is updated.