SWC @ University of Twente (General information)
Nov 17 2025, 09:00-14:00 CEST
Néstor DelaPaz-Ruíz
Example: Computational usability and reproducibility
Please consider providing feedback (UT credentials required)
Joel H. Nitta
Download shell-lesson-data.zip and move the file to your Desktop.
Unzip/extract the file. You should end up with a new folder called shell-lesson-data on your Desktop.
For instructions by operating system, see the Shell Lesson
Humans interact with computers using GUI (graphical user interface) or CLI (command-line interface).
GUI: Intuitive, menu-driven, but not efficient for repetitive tasks.
CLI (Unix shell): Efficient for repetitive tasks, automates tasks quickly.
The shell interprets and runs the commands typed by the user.
Popular Unix shell: Bash (Bourne Again SHell).
Benefits of using the shell:
When you open the shell, you should see something like this:
$
The $ is the prompt, where you type your commands
Depending on your setup, it may look a little different, for example:
nelle@localhost $
lsThe first command we will learn is ls, which lists the content of your current directory (we will come back to this later):
Desktop Downloads Movies Pictures
Documents Library Music Public
goostats.sh to measure protein abundance.Using a GUI, Nelle would need to manually run 1520 files, taking over 12 hours. Can Nelle do this more efficiently with the shell?
How can I move around on my computer?
How can I see what files and directories I have?
How can I specify the location of a file or directory on my computer?
Use pwd to show your current working directory (where you “are” in your computer)
/ character.
For example, Nelle’s files are stored in /Users/nelle.
ls-F option to adjust the output:
/ indicates that this is a directory@ indicates a link* indicates an executableYou can clear a cluttered terminal with clear
Get a help menu by adding --help:
Or, add man in front of the command:
lsIf pwd displays /Users/backup, and -r tells ls to display things in reverse order, what command(s) will result in the following output:
pnas_sub/ pnas_final/ original/
ls pwdls -r -Fls -r -F /Users/backupcd.... takes you one directory higher
..Note that if you use ls -a to show everything, you will see ..
~You can use ~ to move to your home directory
-You can use - to move back to the directory you just came from
If you type a path that does not start with /, it means you are talking about a folder or file relative to your current location
If you type a path that starts with /, it means you are talking about a path from the root of the file system
If pwd displays /Users/thing, what will ls -F ../backup display?
../backup: No such file or directory2012-12-01 2013-01-08 2013-01-272012-12-01/ 2013-01-08/ 2013-01-27/original/ pnas_final/ pnas_sub/-- (ls --all)- (ls -a)ls-F means the command “ls-F”)ls -s vs ls -S)The shell will finish typing the names of files and folders for you when you press the tab key
Try it from ~/Desktop/shell-lesson-data/
(press the tab key twice to see what files start with goo)
How can I create, copy, and delete files and directories?
How can I edit files?
mkdirMake sure we are in shell-lesson-data, then enter exercise-data/writing
Have a look around, then create a new directory called thesis:
mkdirYou can create a nested directory using -p
Check what you did (-R is the option to ls that will list all nested subdirectories within a directory):
- (dash).. (period or ‘full stop’), - (dash) and _ (underscore).nano is a text editor program. It will create a file and open it for editing.
Press Ctrl+o to save (as indicated by ^O), then Ctrl+x to exit
touchrmrm is forever! (no recycle bin). Be very careful when you use it.mvEnter shell-lesson-data/exercise-data/writing:
Let’s rename draft.txt:
(check the results with ls)
mvrm, there is no “undo” button for mv: it will over-write any file with the same name, so use carefully!Let’s move quotes.txt into our current directory:
Check ls thesis
cpcp is similar to mv, but copies instead of movescpcp a folder:cp: -r not specified; omitting directory 'thesis'
cp-r (recursive) optionrm -r-r for cp, you need -r to delete a folder:shell-lesson-data/exercise-data:Moving multiple files at once is handy, but that was a lot of typing
We can use * and ? to match multiple file names. These are called “wildcards”
Consider the files in shell-lesson-data/exercise-data/alkanes:
*: Represents zero or more characters.
*.pdb matches ethane.pdb, propane.pdb, etc.p*.pdb matches pentane.pdb, propane.pdb.?: Represents exactly one character.
?ethane.pdb matches methane.pdb.*ethane.pdb matches ethane.pdb, methane.pdb.???ane.pdb matches cubane.pdb, ethane.pdb, octane.pdb.ls *.pdf in a directory with only .pdb files results in an error.When run in the alkanes directory, which ls command(s) will produce this output?
ethane.pdb methane.pdb
ls *t*ane.pdbls *t?ne.*ls *t??ne.pdbls ethane.*Jamie is working on a project, and she sees that her files aren’t very well organized:
The fructose.dat and sucrose.dat files contain output from her data analysis. What command(s) covered in this lesson does she need to run so that the commands below will produce the output shown?
You’re starting a new experiment and would like to duplicate the directory structure from your previous experiment so you can add new data.
Assume that the previous experiment is in a folder called 2016-05-18, which contains a data folder that in turn contains folders named raw and processed that contain data files. The goal is to copy the folder structure of the 2016-05-18 folder into a folder called 2016-05-20 so that your final directory structure looks like this:
2016-05-20/
└── data
├── processed
└── raw
Which of the following set of commands would achieve this objective? What would the other commands do?
How can I combine existing commands to produce a desired output?
How can I show only part of the output?
pdb files?Nelle needs to determine the pdb file with the fewest lines of text in the shell-lesson-data/exercise-data/alkanes directory.
She can do this with wc, which counts text in a file:
wc to all filesCheck the number of lines of text in all .pdb files:
This works for a few, but what if we had thousands?
Let’s send the results to a file with > and read out the contents with cat:
(note that >> will add to an existing file)
Next, use sort to sort the output, then save it to a file, then finally get the first entry of that file with head:
Whew! We did it!
tail is similar to head, but prints lines from the end of a file instead.
Consider the file shell-lesson-data/exercise-data/animal-counts/animals.csv. After these commands, select the answer that corresponds to the file animals-subset.csv:
animals.csvanimals.csvanimals.csvanimals.csvSo that worked, but it relied on two intermediate text files (lengths.txt and sorted-lengths.txt). That is confusing.
We can streamline the analysis by sending the results of one command directly into the input of another with the pipe: |
north-pacific-gyre 300 NENE02040B.txt
300 NENE02040Z.txt
300 NENE02043A.txt
300 NENE02043B.txt
5040 total
Z?A or B; by convention, her lab uses Z to indicate samples with missing information. To find others like it, she does this:ls *Z.txt
NENE*A.txt or NENE*B.txtSuppose you want to delete your processed data files, and only keep your raw files and processing script to save storage.
The raw files end in .dat and the processed files end in .txt.
Which of the following would remove all the processed data files, and only the processed data files?
rm ?.txtrm *.txtrm * .txtrm *.*A file called animals.csv (in the shell-lesson-data/exercise-data/animal-counts folder) contains the following data:
What are the contents of final.txt? (the sort -r command sorts in reverse)
Scenario: Extract classification from genome files.
Files: basilisk.dat, minotaur.dat, unicorn.dat in exercise-data/creatures
Structure:
Our goal: Print the classification (2nd line) for each file.
General form of a loop:
For our situation:
$filename is a variable that gets filled in by the shell
The shell prompt changes from $ to > and back again as we were typing in our loop.
A semicolon, ;, can be used to separate two commands written on a single line.
If the shell prints > or $ then it expects you to type something, and the symbol is a prompt.
If you type > or $ yourself, it is an instruction from you that the shell should redirect output or get the value of a variable.
You can put the variable name in curly braces: ${filename}. This makes it easier to distinguish the variable from surrounding text (like ${file}name)
for filename and $filename, but we could just have easily said for x and $x. Is this a good idea?No, because it is not clear what the variable refers to. It is better to use variable names that convey their meaning.
This exercise refers to the shell-lesson-data/exercise-data/alkanes directory. ls *.pdb gives the following output:
What is the output of the following code?
Now, what is the output of the following code?
Why do these two loops give different outputs?
What would be the output of running the following loop in the shell-lesson-data/exercise-data/alkanes directory?
cubane.pdb, octane.pdb and pentane.pdb are listed.cubane.pdb is listed.How would the output differ from using this command instead?
cubane.pdb and octane.pdb will be listed.octane.pdb will be listed.In the shell-lesson-data/exercise-data/alkanes directory, what is the effect of this loop?
cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb, and the text from propane.pdb will be saved to a file called alkanes.pdb.cubane.pdb, ethane.pdb, and methane.pdb, and the text from all three files would be concatenated and saved to a file called alkanes.pdb.cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, and pentane.pdb, and the text from propane.pdb will be saved to a file called alkanes.pdb.Also in the shell-lesson-data/exercise-data/alkanes directory, what would be the output of the following loop?
cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, and pentane.pdb would be concatenated and saved to a file called all.pdb.ethane.pdb will be saved to a file called all.pdb.cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb would be concatenated and saved to a file called all.pdb.cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb would be printed to the screen and saved to a file called all.pdb.Context: Nelle needs to process protein sample files using goostats.sh.
Script: goostats.sh calculates statistics from a protein sample file.
Nelle decides to build commands step-by-step.
Step 1: Select the right input files.
A or B, not Z.Next step: decide what to call the files that the goostats.sh analysis program will create.
Prefixing each input file’s name with stats seems clear:
The ; has the same effect as a line-break
goostats.sh just writes out the results file without printing anything to the screen. Let’s kill the script with Ctrl + c, then add an echo to display the name of the file:
We can inspect the output by opening another shell window
alkanes/.middle.sh and open it with nano:cd alkanes
nano middle.sh
Type this in the script (it should look familiar):
Now we can run the script:
This script would be much more useful if we could run it on any file, not just octane.pdb.
Open it again in nano and modify it like so:
The "$1" means ‘the first filename (or other argument) on the command line’. Try it out!
In the alkanes directory, imagine you have a shell script called script.sh containing the following commands:
While you are in the alkanes directory, you type the following command:
Which of the following outputs would you expect to see?
.pdb in the alkanes directory.pdb in the alkanes directoryalkanes directory*.pdbSo far, we have been able to input one file into a script with something like "$1"
But what if we have many files we want to input?
Solution: "$@"
"$@" means ‘All of the command-line arguments to the shell script’sorted.shTry it!
For this question, consider the shell-lesson-data/exercise-data/alkanes directory once again. This contains a number of .pdb files in addition to any other files you may have created. Explain what each of the following three scripts would do when run as bash script1.sh *.pdb, bash script2.sh *.pdb, and bash script3.sh *.pdb respectively.
grep?“To grep something” has become a verb kind of like “To google something”
grep is a computer program that searches for text
Our examples will use haiku about programming that were featured in Salon magazine
grep examplenot:TheThesis. How can we match only the whole word The?
-w (for “word”)-n-n: Show the line number-w: Only match whole words-i: Make search case-insensitive(you can also combine these into -nwi)
-v:-r:The re in grep stands for “Regular Expressions”
Regular expressions are kind of like wildcards: they can match certain patterns in text
This is one of the most powerful features of grep
For example, this finds any text with an “o” as the second character (-E turns on matching via regular expressions, the ^ matches the start of a line, and the . matches any single character):
grepWhich command would result in the following output:
grep "of" haiku.txtgrep -E "of" haiku.txtgrep -w "of" haiku.txtgrep -i "of" haiku.txtfindgrep finds lines in files, the find command finds files themselves.Try it out from shell-lesson-data/exercise-data:
-type d-type f-nameWait a sec - I thought there were more text files?
find . -name numbers.txt-namewc and find.txt files:grep and find.txt files in the current directory:By combining relatively simple, small programs using techniques like $() and the pipe (|), we can achieve very powerful results with a small amount of code
This is the beauty of the shell!
Remember, the -v option to grep inverts pattern matching, so that only lines which do not match the pattern are printed. Given that, which of the following commands will find all .dat files in creatures except unicorn.dat? Once you have thought about your answer, you can test the commands in the shell-lesson-data/exercise-data directory.
find creatures -name "*.dat" | grep -v unicornfind creatures -name *.dat | grep -v unicorngrep -v "unicorn" $(find creatures -name "*.dat")