Formatting Data for AI Training and Modeling
Data is an extremely important part of the AI training and modeling process. The quality and format of the data are directly linked to the quality and function of the model created.
This article will cover:
- The syntax of data tables
- Which data types are supported
- Missing or non-existent data
Rows vs Columns
Most datasets are in the form of a data table, containing both rows and columns. This table is then represented as a CSV file to the AI. In CSV, each row is represented by a line of text and each column is represented by a comma.
For more information on the CSV format please refer to RFC4180.
In the dress scenario from our Walkthrough our rows were the various dress IDs and the columns were of each dress’ attributes (season, price, etc.) While this may all seem simple, there are two key things to remember when forming your rows and columns:
- Each column needs a “header”, a top row containing all their names.
- Each Row is treated individually.
Making sure each column has a header is usually not an issue, but this is important to double-check to make sure both you and the AI know which column is which.
Both the training dataset and datasets used later after the model is made must have the exact same column names.
Because each row is treated individually, any change made to one row will have no effect on any other rows. So when a prediction is being made, whatever you do to one row will not change the prediction for any other row.
When the table is saved to file in must be in ASCII or UTF-8 encoding.
Columns and Data Types
A key mechanic of columns is that they must be of one data type for all rows. Data types must be consistent throughout a column due to the fact that while the OneClick platform can read many different data types, even if they are all in one table, the columns must still contain only one type throughout.
Different columns can have different data types.
There are several data types supported by OneClick.ai, and they can be categorized as numeric, date/time, categorical, and text:
- Numeric: Data can be float-point and integers that you can compare their values by quantities.
- Numeric data types in the dress example were the dress ratings (4.2, 5.0, etc)
- Make sure the Target Column is formatted in number format.
- For example, if you see: 1,234.78, the formatting is incorrect. Update to number, which will remove the comma and display 1234.78.
- Date/Time: The OneClick Platform will try detecting the date/time format automatically. Most common date/time formats (text or numeric) are supported.
- Categorical: Can be integers or any text. They are IDs representing discrete information about each row. Categorical values are not comparable.
- Categorical data types in the dress example were size, season, or price (XL, summer, high)
- Text (free form text): We support English and Chinese, in ASCII or UTF-8 encoding
Take note that when a column is textual, and the text includes a “break” (commas, bars, semi-colon) all text in that cell must be between double quotation marks. (Most programs like Excel and Numbers do this automatically) If your text has a double quote then it must be preceded by another double quote. So if you want “great” in an Excel cell then type “””great””” ( “great” = “””great””” )
Values that could not be recorded or are not applicable for that column are called “Missing Values”.
In general, they can both be viewed as reporting a “blank”, “nothing”, or an “empty cell”. However, in modeling, and especially when dealing with different data types things are more nuanced:
- Missing Values for numeric data can only be represented by a blank (a literal empty cell)
- Missing values for categorical data can be represented by a blank cell or a consistent text class
- A common text class used is the word “null”, but it can be any word, as long as it is used consistently to refer to a missing value in your table.
- Missing Values for text data can be represented by a blank.
Sometimes it is necessary to distinguish an empty string (text with zero length) from a missing value. In this case, we use “” (quoted empty string) for the empty string, and blank cell for the missing value.
The label for training data can never have blanks or missing values. This is because the AI will not know what to learn from.