Processing Big Data Files
1 vue (au cours des 30 derniers jours)
Afficher commentaires plus anciens
I have txt file of 120MB. It has around 3600000 rows in it. I need to read this data using script generated from import data menu.
But when i tried to run script it gives out of memory error. Is there any other way to read that big data ?
I have i7-7700HQ cpu @2.80Ghz and 8 gb of RAM, msi laptop computer.
%% Initialize variables.
filename = 'sicaklik.txt';
delimiter = '|';
startRow = 2;
formatSpec = '%s%s%s%s%s%s%s%[^\n\r]';
%% Open the text file.
fileID = fopen(filename,'r','n','UTF-8');
%% Skip the BOM (Byte Order Mark).
fseek(fileID, 3, 'bof');
%%Read columns of data according to the format.
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'TextType', 'string', 'HeaderLines' ,startRow-1, 'ReturnOnError', false, 'EndOfLine', '\r\n');
%% Close the text file.
fclose(fileID);
% Convert the contents of columns containing numeric text to numbers.
%% Replace non-numeric text with NaN.
raw = repmat({''},length(dataArray{1}),length(dataArray)-1);
%%
for col=1:length(dataArray)-1
raw(1:length(dataArray{col}),col) = mat2cell(dataArray{col}, ones(length(dataArray{col}), 1));
end
%%
numericData = NaN(size(dataArray{1},1),size(dataArray,2));
for col=[1,3,4,5,6,7]
% Converts text in the input cell array to numbers. Replaced non-numeric
% text with NaN.
rawData = dataArray{col};
for row=1:size(rawData, 1)
% Create a regular expression to detect and remove non-numeric prefixes and
% suffixes.
regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\,]*)+[\.]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\,]*)*[\.]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)';
try
result = regexp(rawData(row), regexstr, 'names');
numbers = result.numbers;
% Detected commas in non-thousand locations.
invalidThousandsSeparator = false;
if numbers.contains(',')
thousandsRegExp = '^\d+?(\,\d{3})*\.{0,1}\d*$';
if isempty(regexp(numbers, thousandsRegExp, 'once'))
numbers = NaN;
invalidThousandsSeparator = true;
end
end
% Convert numeric text to numbers.
if ~invalidThousandsSeparator
numbers = textscan(char(strrep(numbers, ',', '')), '%f');
numericData(row, col) = numbers{1};
raw{row, col} = numbers{1};
end
catch
raw{row, col} = rawData{row};
end
end
end
%% Split data into numeric and string columns.
rawNumericColumns = raw(:, [1,3,4,5,6,7]);
rawStringColumns = string(raw(:, 2));
%% Make sure any text containing <undefined> is properly converted to an <undefined> categorical
idx = (rawStringColumns(:, 1) == "<undefined>");
rawStringColumns(idx, 1) = "";
%% Create output variable
all_cities = table;
all_cities.Istasyon_No = cell2mat(rawNumericColumns(:, 1));
all_cities.Istasyon_Adi = categorical(rawStringColumns(:, 1));
all_cities.YIL = cell2mat(rawNumericColumns(:, 2));
all_cities.AY = cell2mat(rawNumericColumns(:, 3));
all_cities.GUN = cell2mat(rawNumericColumns(:, 4));
all_cities.SAAT = cell2mat(rawNumericColumns(:, 5));
all_cities.SICAKLIK_C = cell2mat(rawNumericColumns(:, 6));
%Clear temporary variables
clearvars filename delimiter startRow formatSpec fileID dataArray ans raw col numericData rawData row regexstr result numbers invalidThousandsSeparator thousandsRegExp rawNumericColumns rawStringColumns idx;
0 commentaires
Réponses (1)
Fangjun Jiang
le 24 Oct 2019
Split the large file to smaller files and apply Tall Array
0 commentaires
Voir également
Catégories
En savoir plus sur Text Files dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!