Manipulate csv file using bash script or python, converting certain columns into rows

I have a csv file with 6 columns. I want to write a script using Python or Bash so that I can generate a new file, maybe a txt file, where for each row the first three columns remain the same (that is column 1, 2 and 3 remain in same rows), but the next three columns (i.e. columns 4, 5 ,6) get inserted into a new row, just below their previous row.

My sample file looks like this:

Column1 Column2 Column3 Column4 Column5 Column6
chr16 2697996 2697997 chr16 2086192 2086193
chr5 112801382 112801383 chr5 138445682 138445683
chr15 26869333 26869334 chr8 119416385 119416386

My final file should look like this:

Column1 Column2 Column3
chr16 2697996 2697997
chr16 2086192 2086193
chr5 112801382 112801383
chr5 138445682 138445683
chr15 26869333 26869334
chr8 119416385 119416386
1

2 Answers

Here's an awk approach:

$ awk 'NR>1{for(i=1;i<=NF;i++){ if((i%4)==0) printf "\n"; printf "%s%s",$i,FS;}printf "\n"} END{printf "\n"}' input.csv
chr16 2697996 2697997
chr16 2086192 2086193
chr5 112801382 112801383
chr5 138445682 138445683
chr15 26869333 26869334
chr8 119416385 119416386

Assuming the "Columns" is present in the file, it'd be easier to process the file without them, hence NR>1 part. After that is a simple for-loop, which inserts a newline whenever there's no remainder in integer division by 4 in the column number ( the (i%4) == 0 part ). This is a very common trick to deal with short counters or data that needs to occur at specific intervals. Newline is necessary because we're printing data via printf directly as plain string, not a full line.


Since you've requested python, here's a python script as well:

#!/usr/bin/env python3
import sys
with open(sys.argv[1]) as fd: for index,line in enumerate(fd): items=line.strip().split() if index == 0: print(" ".join(items[:len(items)//2])) continue left = items[:len(items)//2] right = items[len(items)//2:] print( " ".join(left) ) print( " ".join(right) )

Works as follows:

$ ./break_columns.py ./input.csv
Column1 Column2 Column3
chr16 2697996 2697997
chr16 2086192 2086193
chr5 112801382 112801383
chr5 138445682 138445683
chr15 26869333 26869334
chr8 119416385 119416386
5

You can use Miller () and a bash script. Starting from input

Column1 Column2 Column3 Column4 Column5 Column6
chr16 2697996 2697997 chr16 2086192 2086193
chr5 112801382 112801383 chr5 138445682 138445683
chr15 26869333 26869334 chr8 119416385 119416386

and running

<input tail -n +2 | mlr --nidx --ifs ' ' --repifs cut -f 1,2,3 >out_01
<input tail -n +2 | mlr --nidx --ifs ' ' --repifs cut -f 4,5,6 >out_02
mlr --nidx --ifs ' ' --repifs cat out_01 out_02 >out
mlr -I --n2c --fs ' ' cat then label Column1,Column2,Column3 out

you will have

Column1 Column2 Column3
chr16 2697996 2697997
chr5 112801382 112801383
chr15 26869333 26869334
chr16 2086192 2086193
chr5 138445682 138445683
chr8 119416385 119416386

If you want a pretty printed output, change the final command in mlr -I --n2p --fs ' ' cat then label Column1,Column2,Column3 out

Column1 Column2 Column3
chr16 2697996 2697997
chr5 112801382 112801383
chr15 26869333 26869334
chr16 2086192 2086193
chr5 138445682 138445683
chr8 119416385 119416386

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

You Might Also Like