I'm dealing with a CSV file that contains over 2 million lines. Quite a big one.
I need to use grep (or any better way) to retrieve the entire row if the value of the second column matches 'jpn' or 'por'.
I tried using grep '<\jpn\>' || grep '<\por\>' file1.csv > file2.csv but had no success so far. The major issue being that 'por' is a common occurrence on the third column, which generates over a million of undesired rows.
Would any more experienced user be se kind and lend me some help?
Thanks in advance! :D
5 Answers
What Linux OS are you on? Try using egrep if you want to use regular expressions in systems whose grep version is outdated (e.g. Solaris).
Anyway, here is an awk solution:
awk -F, '$2 ~ /jpn|por/ {print}' file1.csv > file2.csvExplanations:
awkfor column-based operations-F,to define the column separator/tokenizer, in this case I use a comma$2 ~ /jpn|por/tests column #2 with expression/jpn|por/$2is column #2/jpn|por/is a regular expression to match jpn or por
{print}specifies what the awk should output if it found a matching lineprintto print the whole input line (alternatively,print $3will just print column #3)
... file1.csvspecifies to read from an input file instead of stdin
grep '; jpn;\|; por;' /path/to/file1.csv > file2.csv
Assuming that semicolon (;) is the separator.
I think grep is a limited solution for this problem, because it wouldn't consider different separators and escaped separators.
I'd suggest you check out TextQ (disclaimer - I'm its developer). It can import a big CSV file and allows you to manage its schema/structure.
You can perform queries via a UI Query Builder or via SQL (select, join, group by, etc).
Here is a video demo and more detailed tutorial.
Finally, you can export any query to a CSV file, which can be imported in MS Excel or others. You can get it from the Mac App Store or Microsoft Store (soon).
I'm not sure, but maybe this could work:
grep '^[^,]*,\(por\|jpn\)\(,\|$\)' file1.csv > file2.csvAssuming that "," is the separator.
This seems to work for me:
egrep '^.?,(jpn|por),.*$' file1.csv > file2.csv