Perl is used for manipulating text files and in this article I will teach how to read a CSV file in Perl? First of all let’s start with
What is a CSV file?
A comma-separated values (CSV) file stores tabular data (numbers and text) in plain-text form. As a result, such a file is easily human-readable (e.g., in a text editor).
CSV is a simple file format that is widely supported by consumer, business, and scientific applications. Among its most common uses is to move tabular data between programs that naturally operate on a more efficient or complete proprietary format. For example: a CSV file might be used to transfer information from a database program to a spreadsheet. -Wikipedia
Using Perl to read a CSV file
Now suppose we have a CSV file which has 4 rows and each row has fields which are separated by comma as given below: Linux,Unix,1,Ubuntu
RHEL,Kubuntu,2,Fedora
Mint,Puppy,3,ArchLinux
Mandriva,OpenSUSE,4,CentOS
If you want to extract out the data from 3rd column and calculate the sum of all the numbers then the approach to do this will be as follows:
Read the first line of the CSV file and extract out the 3rd column. Assign the extracted value to a scalar variable and then read the next line, extract out the value of 3rd column, add it to the scalar variable which contains the first value and store result in the same variable. Then read the 3rd line and so on…
Below is the Perl code for this purpose using split function.
#!/usr/bin/perl
use strict;
use warnings;
my $filename = $ARGV[0];
my $total = 0;
open(my $FH, '<', $filename) or die "Couldn't open $filename $!";
while (my $fetchline = <$FH>){
chomp $fetchline;
my @elements = split "," , $fetchline;
$total += $elements[2];
}
print "$total\n";
Save the file as text.plx and if the CSV file name is filename.csv then you may run the code as
$ perl text.plx filename.csv
In this case we get 10 (1+2+3+4) as output.
In this article we used split function to extract out the columns from CSV file using comma (,) as separator.
But this code is error prone because of 2 reasons.
Reason 1: If the filename.csv contains something like
Linux,Unix,1,Ubuntu
RHEL,Kubuntu,2,Fedora
Mint,"Puppy,Linux",3,ArchLinux
Mandriva,OpenSUSE,4,CentOS
Notice the 3rd row. The 2nd field of it has a comma “Puppy,Linux”. If we use the same approach of splitting fields by comma separator then it will throw an error because split function will split out Puppy and Linux since they are separated by a comma. Then it will do something like 1+2+Linux+4 which is wrong.
Reason 2: If the filename.csv contains something like
Linux,Unix,1,Ubuntu
RHEL,Kubuntu,2,Fedora
Mint,"Puppy,
Linux",3,ArchLinux
Mandriva,OpenSUSE,4,CentOS
Here the 3rd row contains multi-line fields. When the above code will be used it will treat the CSV file as it contains 5 lines instead of 4.
So we can conclude that if the CSV file is in good format AND if it does not contain field separators within quote AND if there are no multi-line fields in CSV file then our code (which we have written above) will work perfectly and will give correct result. Take use of Text::CSV_XS module which is used for reading and writing CSV files, it will deal with those error prone situations very efficiently.
No comments:
Post a Comment