Use Parallel Computing to Optimize Big Data Set for Analysis

This example shows how to optimize data preprocessing for analysis using parallel computing.

By optimizing the organization, storage of time series data, you can simplify and accelerate any downstream applications like predictive maintenance, digital twins, signal-based AI, and fleet analytics.

In this example, you transform large raw data into a state ready for future analysis and save it to Parquet files using parallel workers. Parquet files offer efficient access to data because they efficiently store column-oriented heterogeneous data, allowing you to conditionally filter the files by row and only load the data you need. Next, you can use the out-of-memory data to train a simple AI model.

If you have data stored on a cluster, you can use this code for similar data optimization without downloading the data. To see an example that performs analysis on data stored in the cloud, see Process Big Data in the Cloud.

Start a parallel pool of process workers.

pool = parpool("Processes");

Starting parallel pool (parpool) using the 'Processes' profile . 15-Jan-2024 12:03:49: Job Running. Waiting for parallel pool workers to connect . Connected to parallel pool with 6 workers.

Download Flight Data

This example uses sample aircraft sensor data provided by NASA [1].

If you want to try these data preprocessing techniques yourself, you must download the aircraft sensor data. NASA provides data for approximately 180,000 flights, with one MAT file representing each flight. For more information, see Sample Flight Data.

This code creates a folder in your current folder and downloads data for the first year of aircraft tail number 652, which occupies approximately 1.6 GB of disk space. Downloading the data can take several minutes. To confirm that you want to download the data, select " true" from the drop-down list before you run the example.

downloadIfTrue = false; if downloadIfTrue downloadNASAFlightData(pwd,"small"); dataRoot = fullfile(pwd,"data"); else disp("Confirm and download flight data to proceed."); return end

Organizing MAT files into folders. MAT files organized into folders.

The downloadNASAFlightData function downloads and organizes the files for tail 652 into subfolders for each month.

Convert Data to Nested Tables

Examine a sample of the flight data. Each MAT file comprises of 186 structure arrays, with each structure array representing a sensor. Each structure array stores the metadata associated with the sensor, along with the sensor readings in a nested array. Additionally, the file name contains important metadata such as the flight ID, tail number, and start time.

sampleData = matfile(fullfile(dataRoot,"mat","Tail_652\200101\652200101092009.mat")); sampleData.ABRK

ans = struct with fields: data: [1972×1 double] Rate: 1 Units: 'DEG' Description: 'AIRBRAKE POSITION' Alpha: 'ABRK'

It is not efficient to store the data for each sensor as a separate structure variable or to embed the metadata within the filename. Instead, you can organize the data into a nested schema. This approach enables you to easily search the metadata and reduce the number of rows in the table by nesting the sensor values. Use the struct2table function to organize the sample structure array.

struct2table(sampleData.ABRK,AsArray=true)

ans=1×5 table data Rate Units Description Alpha _______________ ____ _______ _____________________ ________ 1

The returnNestedTable helper function applies the struct2table function to each sensor data in the sample MAT file and vertically concatenates the results.

head(returnNestedTable(fullfile(dataRoot,"mat","Tail_652\200101\652200101092009.mat")))

StartTime TailNumber FlightId Rate Alpha Description Units data _______________________ __________ _______________ ____ ________ ____________________________ ___________ _______________ 2001-01-09 20:09:00.000 652 652200101092009 0.25 '> < 493×1 double>2001-01-09 20:09:00.000 652 652200101092009 0.25 '> < 493×1 double>2001-01-09 20:09:00.000 652 652200101092009 0.25 '> < 493×1 double>2001-01-09 20:09:00.000 652 652200101092009 0.25 '> < 493×1 double>2001-01-09 20:09:00.000 652 652200101092009 1 2001-01-09 20:09:00.000 652 652200101092009 1 2001-01-09 20:09:00.000 652 652200101092009 0.25 < 493×1 double>2001-01-09 20:09:00.000 652 652200101092009 1

Create File Datastore

A datastore is a repository for collections of data that are too large to fit in memory. You can read and process data stored in multiple files as a single entity. To learn more, see Getting Started with Datastore.

Create a FileDatastore object with the data files from the first year of tail 652. You must use the returnNestedTable custom read function to read the data in the MAT files.

dsFlight = fileDatastore(fullfile(dataRoot,"mat","Tail_652"), . ReadFcn=@returnNestedTable,IncludeSubfolders=true, . FileExtensions=".mat",UniformRead=true);

Preview the datastore. The table output is the same as the table output when you call the returnNestedTable read function without the datastore.

preview(dsFlight)

ans=186×8 table StartTime TailNumber FlightId Rate Alpha Description Units data _______________________ __________ _______________ ____ _________ _____________________________ ___________ _______________ 2001-01-09 20:09:00.000 652 652200101092009 0.25  '> < 493×1 double>2001-01-09 20:09:00.000 652 652200101092009 0.25  '> < 493×1 double>2001-01-09 20:09:00.000 652 652200101092009 0.25  '> < 493×1 double>2001-01-09 20:09:00.000 652 652200101092009 0.25  '> < 493×1 double>2001-01-09 20:09:00.000 652 652200101092009 1    2001-01-09 20:09:00.000 652 652200101092009 1    2001-01-09 20:09:00.000 652 652200101092009 0.25    < 493×1 double>2001-01-09 20:09:00.000 652 652200101092009 1    2001-01-09 20:09:00.000 652 652200101092009 1    2001-01-09 20:09:00.000 652 652200101092009 1    2001-01-09 20:09:00.000 652 652200101092009 4    2001-01-09 20:09:00.000 652 652200101092009 4    2001-01-09 20:09:00.000 652 652200101092009 1    2001-01-09 20:09:00.000 652 652200101092009 4    2001-01-09 20:09:00.000 652 652200101092009 4    2001-01-09 20:09:00.000 652 652200101092009 4    ⋮

Clean Data

Next, prepare the data for future analysis by cleaning it. Use the transform function to perform some table manipulation and change the data types of table variables. The datastore defers performing the transformation until you read or write from it.

Rename the Alpha and data variables to SensorName and Data .

tdsFlight1 = transform(dsFlight,@(t) renamevars(t,["Rate","Alpha","data"], . ["SampleRate","SensorName","Data"]));

Convert all variables that are cell arrays of character vectors into string arrays. To categorize the data later, convert the Units variable to a categorical array, and SampleRate variable into a single array. Preview a sample of the results of the transformed datastore.

tdsFlight2 = transform(tdsFlight1,@(t) convertvars(t,vartype("cellstr"),"string")); tdsFlight3 = transform(tdsFlight2,@(t) convertvars(t,"Units","categorical")); tdsFlight4 = transform(tdsFlight3,@(t) convertvars(t,"SampleRate","single")); preview(tdsFlight4)

ans=8×8 table StartTime TailNumber FlightId SampleRate SensorName Description Units Data _______________________ __________ _______________ __________ __________ __________________________ ___________ _______________ 2001-01-09 20:09:00.000 652 652200101092009 0.25 "1107" "SYNC WORD FOR SUBFRAME 1" < 493×1 double>2001-01-09 20:09:00.000 652 652200101092009 0.25 "2670" "SYNC WORD FOR SUBFRAME 2" < 493×1 double>2001-01-09 20:09:00.000 652 652200101092009 0.25 "5107" "SYNC WORD FOR SUBFRAME 3" < 493×1 double>2001-01-09 20:09:00.000 652 652200101092009 0.25 "6670" "SYNC WORD FOR SUBFRAME 4" < 493×1 double>2001-01-09 20:09:00.000 652 652200101092009 1 "A/T" "THRUST AUTOMATIC ON" 2001-01-09 20:09:00.000 652 652200101092009 1 "ABRK" "AIRBRAKE POSITION" DEG 2001-01-09 20:09:00.000 652 652200101092009 0.25 "ACID" "AIRCRAFT NUMBER" < 493×1 double>2001-01-09 20:09:00.000 652 652200101092009 1 "ACMT" "ACMS TIMING USED T1HZ"

Standardize the missing values for the units variable. MATLAB uses the token to mark missing data in categorical values, but some rows of the Unit variable also show or UNITS , which you can treat as missing in this data set. Use a transformation function to standardize this so every missing value in the units variable uses a uniform missing token. Preview a sample of the results from the transformed datastore.

tdsFlight5 = transform(tdsFlight4,@(t) standardizeMissing(t,["","UNITS"], . DataVariables="Units")); preview(tdsFlight5)

Write to Parquet Files in Parallel

Now that the data is optimized and ready for future analysis, save the data in the final transformed datastore as Parquet files using the writeall function. The Parquet file format supports the efficient compression, encoding and extraction of column-oriented heterogeneous data. When you set UseParallel to true, the writeall function automatically uses the workers of the open parallel pool to apply the transformations functions and write the contents of the transformed datastore to files.

This code creates one Parquet file for each MAT file in the datastore and saves the Parquet files in the parquet_sample folder, preserving the folder structure of the original MAT files. This process writes 2.6 GB of data to disk. To confirm that you want to save the data, select "true" from the drop-down list before you run the example.

saveIfTrue = false; if saveIfTrue outdir = fullfile(dataRoot,"parquet_sample"); if isfolder(outdir) rmdir(outdir,"s"); end writeall(tdsFlight5,outdir,FolderLayout="duplicate", . OutputFormat="parquet",UseParallel=true) disp("Parquet files saved to the parquet_sample folder.") else disp("Confirm and save modified flight data to proceed.") end

Parquet files saved to the parquet_sample folder.

The writeall function saves the Parquet files into subfolders for each month.