I have a csv file with 6 columns. I want to write a script using Python or Bash so that I can generate a new file, maybe a txt file, where for each row the first three columns remain the same (that is column 1, 2 and 3 remain in same rows), but the next three columns (i.e. columns 4, 5 ,6) get inserted into a new row, just below their previous row.
My sample file looks like this:
Column1 Column2 Column3 Column4 Column5 Column6
chr16 2697996 2697997 chr16 2086192 2086193
chr5 112801382 112801383 chr5 138445682 138445683
chr15 26869333 26869334 chr8 119416385 119416386My final file should look like this:
Column1 Column2 Column3
chr16 2697996 2697997
chr16 2086192 2086193
chr5 112801382 112801383
chr5 138445682 138445683
chr15 26869333 26869334
chr8 119416385 119416386 1 2 Answers
Here's an awk approach:
$ awk 'NR>1{for(i=1;i<=NF;i++){ if((i%4)==0) printf "\n"; printf "%s%s",$i,FS;}printf "\n"} END{printf "\n"}' input.csv
chr16 2697996 2697997
chr16 2086192 2086193
chr5 112801382 112801383
chr5 138445682 138445683
chr15 26869333 26869334
chr8 119416385 119416386Assuming the "Columns" is present in the file, it'd be easier to process the file without them, hence NR>1 part. After that is a simple for-loop, which inserts a newline whenever there's no remainder in integer division by 4 in the column number ( the (i%4) == 0 part ). This is a very common trick to deal with short counters or data that needs to occur at specific intervals. Newline is necessary because we're printing data via printf directly as plain string, not a full line.
Since you've requested python, here's a python script as well:
#!/usr/bin/env python3
import sys
with open(sys.argv[1]) as fd: for index,line in enumerate(fd): items=line.strip().split() if index == 0: print(" ".join(items[:len(items)//2])) continue left = items[:len(items)//2] right = items[len(items)//2:] print( " ".join(left) ) print( " ".join(right) )Works as follows:
$ ./break_columns.py ./input.csv
Column1 Column2 Column3
chr16 2697996 2697997
chr16 2086192 2086193
chr5 112801382 112801383
chr5 138445682 138445683
chr15 26869333 26869334
chr8 119416385 119416386 5 You can use Miller () and a bash script. Starting from input
Column1 Column2 Column3 Column4 Column5 Column6
chr16 2697996 2697997 chr16 2086192 2086193
chr5 112801382 112801383 chr5 138445682 138445683
chr15 26869333 26869334 chr8 119416385 119416386and running
<input tail -n +2 | mlr --nidx --ifs ' ' --repifs cut -f 1,2,3 >out_01
<input tail -n +2 | mlr --nidx --ifs ' ' --repifs cut -f 4,5,6 >out_02
mlr --nidx --ifs ' ' --repifs cat out_01 out_02 >out
mlr -I --n2c --fs ' ' cat then label Column1,Column2,Column3 outyou will have
Column1 Column2 Column3
chr16 2697996 2697997
chr5 112801382 112801383
chr15 26869333 26869334
chr16 2086192 2086193
chr5 138445682 138445683
chr8 119416385 119416386If you want a pretty printed output, change the final command in mlr -I --n2p --fs ' ' cat then label Column1,Column2,Column3 out
Column1 Column2 Column3
chr16 2697996 2697997
chr5 112801382 112801383
chr15 26869333 26869334
chr16 2086192 2086193
chr5 138445682 138445683
chr8 119416385 119416386