Data science, I'm sorry to say, often involves cleaning up input data into a usable and uniform format. Command line tools like
sed provide an arcane power to manipulate text in files of arbitrary size. Mastering these tools can separate data science novices from data scientists with flaming robes (to continue on the arcane theme).
For the purposes of this tutorial we have a directory of files that have some lines of the form:
Tags: Tag1, Tag2, ... (zero or more Tag labels). Our goal is to convert the tag labels to lowercase (Tag1 -> tag1), but leave the rest of the file unchanged. You can get the git repo with example files (here), or with
git clone https://github.com/frankcleary/tag-examples.git.
$ git grep -lz "^Tags:" | xargs -0 sed -i -r "s/(^Tags:)(.+)/\1\L\2/g"
sed command allows for substitution of strings in text. We'll use a
sed command to do the text manipulation (lowercasing of tag names) on these files. The basic syntax for a
sed substitution command is this:
$ # s/ : do substitution $ # old/new : replace any occurrences of "old" with "new" $ # /g : replace all found matches on the line, instead of only the first $ # filename : the name of the file to search and replace text in $ sed "s/old/new/g" filename
This command will not modify the file, it outputs the result to stdout (prints it to screen). Our goal to construct a
sed command to lowercase everything after "Tags:", modifying the file in place and not changing any files that aren't under version control. We'll go about constructing this command in steps.
Developing the command, Step 1: Match the line.
sed command finds the lines containing "Tags:" and any other characters, and replaces the entire line with the string "changed".
$ # the original file: $ cat tag-example1.txt Title: Tag example 1 Tags: Tag1, Tag2 Content $ # -r : use regular expressions $ # ^Tags:.+ : Search for "Tags" at the beginning of a line (^) $ # followed by one or more other characters (.+). $ sed -r "s/^Tags:.+/changed/g" tag-example1.txt Title: Tag example 1 changed Content
Developing the command, Step 2: lowercase the line.
We don't want to replace the line with new text, we want to replace it with the old text in lowercase (expect for the initial "Tag:" part). In a
\0 means "what was matched" and
\L means "make lowercase." Combining these we can lowercase the entire line.
$ sed -r "s/^Tags:.+/\L\0/g" tag-example1.txt Title: Tag example 1 tags: tag1, tag2 Content
Developing the command, Step 3: lowercase part of the line.
The problem with the above command is that it lowercases the entire line, including the initial "Tags:" part. To solve this problem we can enclose parts of our string to replace in parenthesis and access the first enclosed part as
\1, the second as
\2 and so on. To lowercase just the part after "Tags:":
$ sed -r "s/(^Tags:)(.+)/\1\L\2/g" tag-example1.txt Title: Tag example 1 Tags: tag1, tag2 Content
Developing the command, Step 4: Finding the files to change
Now its time to replace the text of the actual files with the
-i flag (
-i '' on Mac OSX). This operation could be dangerous if the files are not under version control, so we'll use git to find and change only files in the git repo.
$ # outputs the file name and the matching line $ git grep "^Tags:" tag-example1.txt:Tags: Tag1, Tag2 tag-example2.txt:Tags: Tag1 tag-example3.txt:Tags: Tag1, Tag2, Tag3 $ # outputs just the file names $ git grep -l "^Tags:" tag-example1.txt tag-example2.txt tag-example3.txt $ # outputs the file names separated by a null character $ git grep -lz "^Tags:" firstname.lastname@example.orgemail@example.com^@
Developing the command, Step 5: The complete command
We can use the
xargs tool to tell
sed to act on the list of files we found in step 4.
$ # outputs the files to be changed $ git grep -lz "^Tags:" | xargs -0 echo tag-example1.txt tag-example2.txt tag-example3.txt $ # The final answer: $ git grep -lz "^Tags:" | xargs -0 sed -i -r "s/(^Tags:)(.+)/\1\L\2/g"
Developing the command step 6: Inspect the results with
We can confirm that we got the correct outcome with
$ git diff diff --git a/tag-example1.txt b/tag-example1.txt index 589bbdf..7d57a7d 100644 --- a/tag-example1.txt +++ b/tag-example1.txt @@ -1,4 +1,4 @@ Title: Tag example 1 -Tags: Tag1, Tag2 +Tags: tag1, tag2 Content diff --git a/tag-example2.txt b/tag-example2.txt index addcd3b..d271212 100644 --- a/tag-example2.txt +++ b/tag-example2.txt @@ -1,4 +1,4 @@ Title: Tag example 2 -Tags: Tag1 +Tags: tag1 Content diff --git a/tag-example3.txt b/tag-example3.txt index c8b10e1..42e0a75 100644 --- a/tag-example3.txt +++ b/tag-example3.txt @@ -1,4 +1,4 @@ Title: Tag example 3 -Tags: Tag1, Tag2, Tag3 +Tags: tag1, tag2, tag3
- First Look at AWS Machine Learning, Score: 0.904
- Analyzing large xml files in python, Score: 0.858
- Saving time and space by working with gzip and bzip2 compressed files in python, Score: 0.836
- SF Python meetup talk, Score: 0.834
- Installing python for data science, Score: 0.807