hi guys , i want to read a text file line by line and remove the lines which have NA and the duplicated columns

1 vue (au cours des 30 derniers jours)

Afficher commentaires plus anciens

chocho le 15 Fév 2017

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/325176-hi-guys-i-want-to-read-a-text-file-line-by-line-and-remove-the-lines-which-have-na-and-the-duplica

Modifié(e) : Walter Roberson le 20 Fév 2017

Réponse acceptée : dpb

COADREAD_methylation.txt

Ouvrir dans MATLAB Online

d = fopen('COADREAD_methylation.txt','r');
this_line=0;
all={};
while this_line~=-1
 % C= textscan( d, '%f%s'  ) ;
    this_line=fgetl(d);
   if this_line~=-1
       all=[all;this_line];
   end
end
fclose(d);

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Stephen23 le 17 Fév 2017

Modifié(e) : Stephen23 le 17 Fév 2017

Réponse acceptée

dpb le 15 Fév 2017

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/325176-hi-guys-i-want-to-read-a-text-file-line-by-line-and-remove-the-lines-which-have-na-and-the-duplica#answer_254913

Modifié(e) : dpb le 16 Fév 2017

Ouvrir dans MATLAB Online

Well, 'NA' is easy, not sure what defines the repeated columns; not enough time at present to try to parse that input file to figure out what is/isn't unique without a description being supplied...

fid = fopen('COADREAD_methylation.txt','r');
data={};
while ~feof(fid)
  l=fgetl(fid);
  if isempty(strfind(l,'NA')), data=[data;{l}]; end
end
fid=fclose(fid);

If the presence of 'NA' is all that's needed to get all the offending records, then you're done; otherwise need more details on how to tell so folks here don't have to try to work it out on their own.

13 commentaires
Afficher 11 commentaires plus anciensMasquer 11 commentaires plus anciens

dpb le 16 Fév 2017

Modifié(e) : dpb le 17 Fév 2017

Ouvrir dans MATLAB Online

Are such rows the only ones remaining that have a semi-colon in them? If so, finding them is the same as finding the 'NA' string except you now want to process the ones containing, not missing the found target instead of skipping them.

Breaking up the lines is again some pretty simple string processing; simply locate the semi-colon and the last tab-delimiter between the fields to retain the same input format.

>> l='cg00008493  0.987979722052904  "COX8C;KIAA1409"';
>> l=strrep(l,'"','');  % remove the superfluous "
>> idx=strfind(l,';');  % locate the semi-colon separator
>> itab=find(l==char(9),1,'last');  % and the last \t before 
>> l=[{l(1:ix-1)}; {[l(1:itab) l(ix+1:end)]}];  % build two lines from one
>> l
l = 
  'cg00008493  0.987979722052904  COX8C'
  'cg00008493  0.987979722052904  KIAA1409'
>>

NB: the result is a cellstr array as the lengths of the two substrings aren't the same; this is same "trick" used earlier in concatenating the lines while removing the unwanted lines.

chocho le 20 Fév 2017

Modifié(e) : Walter Roberson le 20 Fév 2017

Ouvrir dans MATLAB Online

hi friend, i want to make this code like this format

Note: i want to get every line and check if it has a NA remove it and get the second line, if not ckeck the columns of this line and see which column have ';' split this column and make 2 rows

fid = fopen('COADREAD_methylation.txt','r');
data={};
while ~feof(fid)
  l=fgetl(fid);   %get the lines
    if isempty(strfind(l,'NA')),  %remove NA rows
    else 
        %read next line
      idx=regexp(l,'\t','split');   %split the colmuns of this line which don't have NA and look for ';' in every column and split it 
      [nrow,ncol]=size(idx);  
           for i=1:ncol  
                 if idx(i)==';'  %look for columns which have ';'and split it 
                     split this column into 2 columns and put the second column
                     into a new row
                      %D = regexp(idx,';','split')
                      %l=[{l(1:idx-1)}; {[l(1:itab) l(idx+1:end)]}]; %split the line into 2
                 end
                     i=i+1;
           end
            save this line % this line will have no NA and if have ; will be splitted
      end
  end
  fid=fclose(fid);

chocho le 20 Fév 2017

Modifié(e) : Walter Roberson le 20 Fév 2017

Ouvrir dans MATLAB Online

inputs:

Hybridization REF  TCGA-A6-2672-11A-01D-1551-05  TCGA-A6-2672-11A-01D-1551-05  TCGA-A6-2672-11A-01D-1551-05
Composite Element REF  Beta_value  Gene_Symbol  Chromosome  Genomic_Coordinate  Beta_value    Gene_Symbol
cg00000292  0.511852232819811  ATP2A1   16  28890100  0.787687855895422  ATP2A1
cg00002426  0.519102187746053  SLMAP    3  57743543  0.932889308560864  SLMAP
cg00006414  NA  "ZNF425;ZNF398"  7  148822837  NA  "ZNF425;ZNF398"  
cg00008493  0.987979722052904  "COX8C;KIAA1409"  14  93813777  0.986128428295584      "COX8C;KIAA1409"  
cg00011459  0.922491239231445  "TMEM186;PMM2"  16  8890425  0.961124285303233  "TMEM186;PMM2"

outputs:

Hybridization REF  TCGA-A6-2672-11A-01D-1551-05  TCGA-A6-2672-11A-01D-1551-05  TCGA-A6-2672-11A-01D-1551-05
cg00000292  0.511852232819811  ATP2A1   0.787687855895422  
cg00002426  0.519102187746053  SLMAP       0.932889308560864  
cg00008493  0.987979722052904  COX8C     0.986128428295584      
cg00008493  0.987979722052904  KIAA1409  0.986128428295584        
cg00011459  0.922491239231445  TMEM186  0.961124285303233  
cg00011459  0.922491239231445  PMM2                0.961124285303233

appreciate your help !

Connectez-vous pour commenter.

Plus de réponses (0)

Connectez-vous pour répondre à cette question.

Catégories

MATLAB Data Import and Analysis Large Files and Big Data

En savoir plus sur Large Files and Big Data dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by

hi guys , i want to read a text file line by line and remove the lines which have NA and the duplicated columns

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Réponse acceptée

13 commentaires
Afficher 11 commentaires plus anciensMasquer 11 commentaires plus anciens

Plus de réponses (0)

Voir également

Catégories

Tags

Community Treasure Hunt

hi guys , i want to read a text file line by line and remove the lines which have NA and the duplicated columns

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Réponse acceptée

13 commentaires Afficher 11 commentaires plus anciensMasquer 11 commentaires plus anciens

Plus de réponses (0)

Voir également

Catégories

Tags

Community Treasure Hunt

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

13 commentaires
Afficher 11 commentaires plus anciensMasquer 11 commentaires plus anciens