Datasets

Datasets are the basis for machine learning and data analysis. In Smart Vision, datasets are the basis for training your own proprietary large models.

You can add public datasets from Dataset Store (for details, see: Dataset Store), or you can build private datasets in My Studio-Datasets. These datasets will be displayed on My Studio-Datasets, allowing you to train the enterprise's customized large model through dataset.

In My Studio-Datasets, you can create a private dataset in two ways: ① Data Generation; ② Upload Dataset File.

Data Generation

Introduction

Data Generation is to generate sample data through large models. After selecting a large model, the large model learns from a large amount of training data and can capture complex patterns and relationships in the dataset, thereby generating more realistic and accurate sample data.

Note: Large models usually require large amounts of training data to learn complex patterns and knowledge, so the quantity and quality of training data for large models have a great impact on the generated datasets. Furthermore, in order to guarantee the quality of the generated datasets, the generated data also need to be properly evaluated and verified to ensure that they meet the expected standards and requirements.

Applicable Scenarios

  1. Insufficient labeled data: When there is insufficient labeled data in a specific field or task, data generation can make up for the problem.
  2. Pre-trained models are available: If there is a powerful language model pre-trained on large-scale text data, its language understanding and generation capabilities can be used to generate synthetic data for specific tasks through data generation.
  3. The sample diversity requirements are not high: In some tasks, the sample diversity requirements are relatively low, and the pre-training of the model can provide sufficient language knowledge. In this case, the use of data generation methods may be effective.
  4. Exploratory research or prototype development: During the exploratory research or prototype development stage, data generation can help quickly test and verify the basic performance of the model without spending a lot of time and resources to collect real labeled data.
  5. Data requirements for specific scenarios: For some specific scenarios, such as security testing, adversarial attacks, or model behavior in edge cases, the robustness of the model can be targeted through data generation.

Operation Guide

  1. Start: In "My Studio - Datasets", click the "Data generation" button in the upper right corner, and enter the "Model dataset generation" page.

    image_1733711280916

  2. Generate sample dataset: Select a large model, configure the parameters to generate the dataset, and click the "Generate Sample Data" button to generate a sample dataset.

    image_17337269797503

    image_17337270471878

    • Model Name: Different models have different architectures, parameters, and training methods, which determine the types of data the model can handle and the characteristics of the generated datasets. For example, some models may be better suited for generating text data, while others may be better suited for processing image or audio data.

    • Generation options: Datasets can be rigorous or flexible, depending on the desired application scenarios and target audience. The choice of generation options will affect the style, format, and constraints of the dataset, as well as the generation capabilities and flexibility of the model.

    • Generation type: Four methods are supported, including Content topic, Dataset, JSON file, and Tool Calling.

      • Content topic: Sample datasets should be generated around a clear topic or domain, which helps ensure that the resulting datasets are consistent and relevant. The choice of content topic will directly affect the content and purpose of the data set. For example, if the topic is labor law, the dataset will contain law-related text and data.
      • Dataset: The generated dataset samples are expanded based on the uploaded dataset, and the generated samples will be more accurate. The model will analyze the uploaded dataset to understand the structure, characteristics and data types of the dataset. Detailed format and content will make the generated dataset closely match the uploaded dataset.
      • JSON file: The generated dataset sample is copied, modified, and synthesized based on the JSON file you uploaded (JSON file format creation requirements: create content in the form of key-value pairs, with three elements by default, including instruction, input, and output).
      • Tool Calling: Dataset samples are generated based on your selection of internal tools/workflows and the target dataset type. Three target dataset types are supported: simple (the tool list contains only a single tool or a workflow, and the tool will be called once for the user's questioning the model), multiple (the tool list contains multiple tools/workflows, and multiple tools will be called for the user's questioning the model), irrelevance (The user's question is not related to any tools in the tool list, and the model will not call any tools). It is usually recommended that you select three types at the same time (that is, the default option) to ensure the diversity of the dataset.
    • Generalization index: Generalization ability refers to the ability of a model to perform on unseen data. The generalization index measures the quality and diversity of the dataset and the model's ability to adapt to new data. A high generalization index usually means that the dataset is more versatile and scalable and can support a wider range of application scenarios.

  3. Confirm sample data: Confirm whether the generated sample data meets the requirements. If not, you can adjust the model or parameters in "Generate a sample dataset".

  4. Create a dataset generation task: After confirming that the dataset sample meets your needs, fill in the dataset name, number of generated items, etc., and click the "Generate" button in the upper right corner to create a dataset generation task.

  5. Complete: After clicking the "Generate" button, the generated dataset will be displayed in "Generating". After the dataset generation task is completed, it will be displayed in "My Datasets".

    image_17338015588747

    image_17338016064223

Introduction

The principle of uploading dataset file involves data transmission, reception, parsing, storage and providing access interfaces:

  1. Data preparation: You need to have a dataset, and check whether the size and format of the dataset meet the requirements of the platform (Smart Vision supports uploading files in txt and JSON formats, and the file size cannot exceed 30M) to ensure that your dataset has been appropriately organized and pre-processed in the required format. Make sure you have the rights to upload and use the dataset, and if you are using someone else's dataset, make sure you have obtained the appropriate license or authorization. Make sure your network connection is stable to avoid interruptions or errors during the upload process. Before uploading, it's a good idea to back up your original dataset in case there are issues during the upload process.

  2. Data transmission - reception - analysis - storage: After uploading the data set to My Studio - Datasets through the upload function, Smart Vision will receive the uploaded files and analyze different types of files accordingly. After processing, the dataset file is stored in Smart Vision for use in large model training.

  3. Access interface: For successfully uploaded datasets, Smart Vision provides an access interface to facilitate users to access the uploaded data sets through applications or the network (click on the successfully uploaded dataset, enter the dataset details - Dataset Files, click on the right " Clone" button to view the access path via HTTPS or SSH).

    image_17338026597885

Applicable Scenarios

  1. Collaborative research: In collaborative research, multiple research teams or institutions may need to share and use the same dataset. By uploading dataset file, different teams or institutions can easily share and use the same dataset, improving the efficiency of collaborative research and data analysis.
  2. Data privacy protection: In some cases, it is inconvenient to share datasets directly with others due to data privacy or sensitive information restrictions. By uploading dataset file, users can upload the datasets to Smart Vision and share access methods only with those who need to use them, thereby ensuring data security.

Operation Guide

  1. Start: Enter "My Studio - Datasets", click the "Upload dataset file" button in the upper right corner, and enter the "Upload" page.

    image_17338113851989

  2. Fill in the dataset information: Fill in the dataset title, label, description and other information.

    image_17338120406859

  3. Upload dataset file: Upload the prepared dataset file. After the upload is successful, click the "OK" button in the lower right corner.

    image_17338120962954

  4. Complete: The uploaded dataset will be displayed in "My Datasets".

    image_173381417421