Datasets

Dataset is a collection of images, PDFs, text documents, audio files, and videos.

  1. Navigate to Datasets icon on the left navigation bar to view all the existing datasets.

Alternative text
  1. Click the Create New Dataset button to initiate creating a new dataset.

Dataset Setup

  1. Specify a name for the dataset and provide a relevant dataset Description.

  2. Choose an appropriate dataset Type from the list of currently supported types provided below:

Dataset Type

Project Type

Image

Object Detection, Bulk Image Classification, Classification

Native PDF

Named Entity Recognition(NER) , Classification

Plain Text

Named Entity Recognition(NER) , Classification

Scanned (OCR)

OCR, Comprehend NER

Video

Media Transcription , Classification

Audio

Media Transcription

AWS Security

Tensoract Studio provides the choice to enforce authentication with an AWS account. You have the option to enable AWS credentials and can choose to specify an S3 location for item processing when working on annotation tasks

  1. Use Global IAM Credentials: Using this option, Tensoract Studio leverages the global IAM credentials to authenticate to the S3 bucket.

  2. AWS Credentials: In this option, you have the ability to configure their AWS Access Key, Secret Key, and Region within Tensoract Studio to enable authentication with AWS.

  3. Role ARN: In this option, you specify the AWS Role ARN and External ID for Tensoract Studio to authenticate with AWS. This method is considered the most secure and recommended approach for authentication.

    Along with all the above options, you need to specify the Intermediate S3 Path that Tensoract Studio will utilize for task processing.

Dataset Metadata

You can provide dataset metadata as label-value pairs. It helps in filtering datasets based on metadata attributes.

Upload files to Dataset

You have the capability to upload documents to the dataset for processing. The following list outlines the compatible file formats for different dataset types:

  1. Scanned Documents (OCR) - The compatible file formats include .pdf, .jpg, .png, .jpeg, .tif, and .tiff.

  2. Native PDF - The compatible file formats include .pdf.

  3. Plain Text - The compatible file formats include .txt.

  4. Image - The compatible file formats include .jpg,.png,.jpeg,.tif,.tiff.

  5. Video - The compatible file formats include .mp4.

  6. Audio - The compatible file formats include .wav,.mp3.

There are following ways to upload the documents into the dataset:

  • Select Files - You can upload compatible files from local workstation directly to the dataset.

  • Select Folder - You choose a folder for the upload process. Subsequently, the system will enlist the files that are compatible with the dataset.

  • Select From Cloud - You can upload files from S3 bucket. If AWS credentials are not set for the dataset, the system will prompt for them.

  • Select Manifest/CSV/Txt - You can upload a manifest, csv or txt file. The system will parse the file and extract the valid URLs, which will then be added to the list of dataset items.

Note

  • You can select a subfolder in the S3 bucket. The service will fetch all compatible files in the path and add them to the list of dataset items.

  • You have the flexibility to add both local files and S3 files to the same list for uploading. Additionally, they can provide the full file path of a specific file, such as s3://*/abc.pdf.

  • You can choose to Upload All Files, Upload Selected Files and Delete Selected Files.

Add Teams to Dataset

You can specify the collaborators who will be working on the dataset.

  • Add Team member - You can select the collaborators’s name/email and the desired role from the drop-down menu. Click the Add Collaborator button.

  • Delete Team member - Click x symbol against the team member to remove them from dataset.

To get a quick overview of the steps involved in creating a dataset, watch the following video:

_images/OCR-Dataset-Updated.gif

Dataset Items

Tensoract Studio provides the following options to browse through the dataset items:

  • Filter Dataset Items - You can leverage the capabililty to filter datasets items based on labels and values to narrow down the results based on specific filter attributes. Following filters are available:

    1. All Fields

    2. Dataset Type - Type of Dataset

    3. ItemId - Unique id for each dataset item.

    4. Labels - Annotations from the Completed tasks in the project.

    5. Name - Name of the dataset item/file.

    6. State - State of the dataset items such as “Processing”, “Processing Error”, or “Deleted”.

    7. Source - Source of dataset upload when the dataset contains both Local and S3 items.

    8. Tags - Tags applied by Admins or Dataset Supervisors to dataset item.

    9. Text Preview - Text contained within dataset item.

    10. ProjectID - After annotators and reviewers have completed the task flow, the ProjectId filter is automatically added to the dataset.

    11. Version - As new versions are created, Version filter is automatically updated.

    12. Model Run - When model is invoke for the items, the corresponding filter is automatically added to the dataset. This filter is applied to ensure that the dataset reflects the specific items used during the model execution, facilitating easy filtering and retrieval of relevant data.

Additionally, if there is metadata associated with the dataset items, it will also be available in the filters dropdown.

In addition, following options available to refine the dataset item serach and narrow down the results based on specific attributes:

  • Sort by - Sort the filtered items in ascending or descending order.

  • GroundTruth - Filter based on a project as a groundtruth, after annotator and reviewer have completed the flow.

  • Image Size - The image scale can be adjusted to increase or decrease the size of the image.

  • Page Size - The page size scale can be adjusted to increase or decrease the size of the page.

  • Details (Visible/Hidden) - This control allows you to hide or unhide the available details within the item.

  • Controls (Visble/Hidden) - This control is associated with Image Dataset. This control allows you to hide or unhide RGB controls withi the items.

Dataset Items View

  1. Hover Over (5 sec) should show zoomed image of the dataset item in iframe.

  2. Single Click is for multi-selecting dataset items.

  3. Double click will open detailed view of the dataset item.

  4. Press and Hold Ctrl key and single click will copy the ItemId of the dataset item.

Dataset Actions

Dataset items page provides an “Action” button, featuring the following available options:

  1. Export items - Export selected items, filtered items or the entire dataset. Users can also export dataset associated with the Groundthruth project. To see the details of export fo

  2. Clone items to new dataset - Clone the items into a new dataset.

  3. Add items to project - Admin can directly send items from the dataset to the desired project.

  4. Create bookmark - Bookmark can be created for the dataset, to access the dataset in read-only mode.

  5. Find similar items - Retrieve similar matches to a dataset item.

  6. Delete items - Admin or Dataset Supervisors can delete the items.

  7. Add tag - Tags can be added to items to select specific items in dataset.

  8. Remove tag - Tags can be removed from the dataset items.

  9. Run Model - The system will initiate predictions using the selected model.

For a quick overview of Dataset Actions, watch the following video:

_images/Dataset-Action-Updated.gif

Model Run

  1. Model Run refers to the execution or application of a model to make predictions of labels or annotations.

  2. In order to use Model Run in dataset, the model needs to be first registered in the studio. Refer to Model Management for more details.

  3. You can choose and execute models, generating predictions for selected items. Admin can see Model View and compare Model runs.

  4. In addition, you can compare the predictions generated from the model run with the projects available in Groundtruth Filters.

  5. Within the detailed view of a dataset item, you can see the predicted labels from model run as well as the Groundtruth labels from project. The labels from model run that match with the labels from Groundtruth are displayed in green color, indicating consensus.

For a quick overview of model comparisons for an OCR dataset, watch the following video:

_images/OCR-MODEL-COMPARISON-Updated.gif

Find similar items

This functionality allows you to retrieve records or data items in a dataset that are conceptually similar or related to a given search query. You are required to have an embeddings model configured as described in the Model Management section. On dataset setup page, you have an option to enable and select the embeddings model. Tensoract Studio uses the selected embeddings model to search for similar items.

Once the embeddings model is enabled, you get an option to “Find similar items” among the available actions when you select a dataset item on the Items tab. You can enter the maximum number of items you want to retrive as similar.

The results are displayed as shown below.

Alternative text

Export

This sections shows a list of exports, that can also be downloaded by users.