Convert raw dataset to D3M dataset

The raw dataset converter requires python version >=3.6. It requires the core d3m package be installed. It currently supports the following data types:

  • Text

  • Video

  • Image

  • Time-Series

  • Tabular

  • Audio

  • Graph

Note: Some task/data type(s) may not be entirely automated (Eg., object detection, graph problems). TRAIN, TEST hierarchies specific to D3M datasets will be made available. However, datasetDoc.json might need to be customized for linking resources/tables for the specific task. For this purpose, example datasets are provided for reference purposes.

Interface

Command Line

create_d3m_dataset /path/to/train_data /path/to/test_data label metric -t task [-t ...] [-o /path/to/output_folder]

Python API

from autonml import createD3mDataset

training_data_csv = 'path/to/train/data'
testing_data_csv = 'path/to/test/data'
output_dir = 'path/to/formatted/data/output/directory'
label = 'Target'
metric = 'f1Macro'
tasks = ['classification']

createD3mDataset(training_data_csv,
                 testing_data_csv,
                 output_dir,
                 label,
                 metric,
                 tasks)

Usage and Explanation

Both the CLI command and the createD3mDataset API require 6 arguments:

  • Input Training Data : Path to input training data. Can be a file or a directory, depending on input data type. See Input data types below.

  • Input Testing Data : Path to input testing data. Can be a file or a directory, depending on input data type. See Input data types below.

  • Output directory : Path to output directory for storing the converted D3M dataset. Creates the directory if it does not exist, but the parent directory must exist.

  • Label : Column name which points to the targets. This must be consistent between training and testing data.

  • Metric : Metric for evaluation. See Metrics below.

  • Tasks : A list of tags that define the task type for AutonML. See Task types below.

Input data types

Creation of text D3M datasets follows a similar methodology to creation of image datasets, please refer to the documentation on image datasets above for a similar archetype to text datasets.

Sample D3M datasets for different data types:

Metrics

Valid metrics for different task types:
  • classification / linkPrediction / graphMatching / vertexNomination / vertexClassification: accuracy, f1Macro, f1Micro, rocAuc, rocAucMacro, rocAucMicro

  • regression / forecasting / collaborativeFiltering: rSquared, meanSquaredError, meanSquaredError, meanAbsoluteError

  • communityDetection / clustering: normalizedMutualInformation

Task types

Valid task types:
  • classification

  • regression

  • forecasting

  • collaborativeFiltering

  • communityDetection

  • graphMatching

  • linkPrediction

  • vertexClassification

  • vertexNomination

  • clustering

  • objectDetection

  • semiSupervised

  • remoteSensing

Sample D3M datasets for different task types:

Examples

Some examples of valid commands are -

create_d3m_dataset train_data.csv test_data.csv output_dir Label accuracy -t classification
create_d3m_dataset train_data.csv test_data.csv output_dir Value meanSquaredError -t regression

Output

This script will create a directory structure “raw” for your dataset in D3M format. This dataset should be used as input to ./scripts/start_container.sh

This is the structure created for a generated D3M dataset:

raw$ tree
.
├── TEST
│   ├── dataset_TEST
│   │   ├── datasetDoc.json
│   │   ├── metadata.json
│   │   └── tables
│   │       └── learningData.csv
│   └── problem_TEST
│       └── problemDoc.json
└── TRAIN
    ├── dataset_TRAIN
    │   ├── datasetDoc.json
    │   ├── metadata.json
    │   └── tables
    │       └── learningData.csv
    └── problem_TRAIN
        └── problemDoc.json

8 directories, 8 files