Skip to content

Sample Data#

Data sampling is a statistical analysis technique used to select, process, and analyze a representative subset of data from your dataset. The purpose is to identify patterns and trends in the larger dataset. Data sampling allows you to work with a small, manageable amount of data. This subset of the data is used to build analytical models faster and run faster than would be possible with the total amount of data.

Overview#

The Workbench provides information whether the full dats set or sample data is used. Find the information under the Flow Area. A visual indicator, that sample data is used is the '!' icon next to the dataset node within the Flow Area.

Example for full data:

The dataset contains 25 records in total. All records are displayed and used for creating views.

Example for sample data:

The dataset contains 20,000,000 records in total. 100,000 of the records are used to create sample data on which transformations are calculated and displayed.

Configuring Sample Data Via Custom Property#

For optimized performance and an accurate representation of your data, Datameer provides a default initial sample of 100,000 records when you start your analysis.

Datameer Admins can configure the settings for sample data usage individually. Find more information here.

Note: Increasing the values and therefore having larger samples may lower the precessing performance.

Troubleshooting#

Sometimes it can happen that you see no records in the data preview after applying an operation because of a sampled source dataset, e.g. you want to perform a Join from two huge datasets and the sample dataset has no matches but the actual full dataset has.

We recommend to perform the following temporary workaround in this case:

  1. Perform the needed operation(s) to your dataset.

  2. As soon as you notice that the data preview is empty, go on performing your operations anyway.

  3. (Optional) To validate your upstream dataset, you can publish the node as a table to your Snowflake.

  4. Perform further operations on the upstream dataset and not on the published table to sustain the pipeline.

  5. After all operations have been performed, publish your view/ table as usual to Snowflake.

Note: Since these cases require a temporary workaround we would appreciate if you contact our support in such a case.