Managing Files Using the Command Line
The following sections describe how to use command-line tools to manage files on CARC systems.
Project files should be organized within a directory structure of some kind in order to keep files organized, documented, and findable. This may include, for example, having separate directories for raw data, processed data, and code.
To list files and directories, use the
ls command. For example, to list files in long format for the current directory use:
For other directories, add the directory path to the command. Enter
man ls or
ls --help for more information and to view all available options.
To create a directory, use the
man mkdir or
mkdir --help for more information and to view all available options.
To copy files or directories, use the
cp /source/path /destination/path
For example, to copy a directory on /scratch to /project, use:
cp -r /scratch/ttrojan/dir /project/ttrojan_123/
-r option, recursive mode, is needed when copying directories. To print a log of the copying, add the
-v option, which enables verbose mode. To copy multiple files or directories to the same destination, simply include additional source paths in the command. Enter
man cp or
cp --help for more information and to view all available options.
Note: Do not use the
-poptions if you are copying locally into a project directory, because this could result in incorrect file permissions.
To move files or directories (i.e., copy and also remove the files from the source), use the
mv command instead:
mv /source/path /destination/path
To rename files, you can also use the
mv /source/filename.txt /source/newfilename.txt
If you are backing up and syncing a directory, use an
rsync command. For example:
rsync /source/dir/ /destination/dir/
Rsync will copy only files that are new or have changed in the source directory. Enter
man rsync or
rsync --help for more information and to view all available options.
Note: Do not use the
rsyncif you are copying locally into a project directory, because this could result in incorrect file permissions.
To delete files or directories, use the
For example, to delete a directory, use:
rm -r /scratch/ttrojan/dir
-r option, recursive mode, is needed to remove directories. To remove multiple files or directories, simply add additional paths to the command. Enter
man rm or
rm --help for more information and to view all available options.
Checking file disk usage
To check the disk usage of files and directories, use the
du -h command:
du -h /path/to/file
Please note that all file systems run ZFS which compresses files, so the file size on disk may be smaller than the actual file size (on your local computer, for example). Using the
du --apparent-size -h command will give the uncompressed file size, and the
ls -lh command should give the same result.
To list the ten largest files or subdirectories in the current directory, enter:
du -s * | sort -nr | head -n 10
du -s *: Summarizes disk usage of all files
sort -nr: Sorts numerically, in reverse order
head -n 10: Shows the first ten lines from head
man du or
du --help for more information and to view all available options.
The /project directories are the best place to share files. By default, the members of a project group will have full read, write, and execute permissions for all files in a project directory (i.e., permissions set to 770 = drwxrwx---).
You can check the current permissions for a file or directory with the command
ls -l </path/to/file>.
When sharing your files, please keep the following in mind:
- Never set the permissions of your directories to 777 (drwxrwxrwx), which means that anybody can access and delete your files.
- Do not share or change the permissions of your /home1 directory and its subdirectories. If something goes wrong, you may be blocked from logging in because SSH requires strict permissions for logging in.
- Granting other users read permission for your files (
r--) and read and execute permissions (
r-x) for your directories is typically sufficient for sharing. Granting write permission can result in modified or deleted files, so only provide write permission when actually needed.
You can change file and directory permissions using a
For example, to provide read and execute permissions (
r-x) but not write permission to a project subdirectory for your project group, use:
chmod 750 /project/ttrojan_123/dir
If the subdirectory is actually located within another subdirectory, note that the group would also need read and execute permission to the full hierarchy of subdirectories. Granting write permission to a directory allows users to create, modify, or delete files in that directory, also depending on individual file permissions. Enter
man chmod or
chmod --help for more information and to view all available options.
Backing up files
Although the /home1 and /project file systems have some file recovery capabilities, we encourage you to also back up your files elsewhere. There are a few different backup locations to consider:
- Local storage (e.g., external drive)
- Cloud storage
- Research data repositories
To transfer files to local or cloud storage, see our guide for Transferring Files Using the Command Line. Rsync is especially useful for syncing to a backup directory on local storage, and Rclone works similarly for cloud storage. For large transfers to local or cloud storage, Globus can sync two directories in a similar manner.
Research data repositories, such as OSF, Zenodo, Harvard Dataverse, and Dryad, are a special type of cloud storage intended for sharing research data with the wider research community. These services typically have an API that can be used at the command line to upload files directly from CARC systems.
For long-term archival storage, consult the USC Digital Repository or consider using a research data repository.
As part of the process of backing up files, you can also create a single archive file containing multiple files and directories using
tar (see section below). This may be useful for versioning and organizing backups.
Archiving and compressing files
Archiving and compressing files can help simplify file organization and save storage space, such as after a project is completed and the associated files are not needed in the immediate future. This is also useful for packaging project files in order to distribute them to other researchers, for example. You can use a combination of the programs
tar for archiving files and
xz for compressing files.
Archiving with tar
To create an archive file from a directory of files, use the
tar command. For example:
tar -cvf <filename>.tar <dir>
To add multiple directories and files, simply add the paths to these directories and files in the command. To check the integrity of the files, add the
To extract the archive, use the
-x option instead of the
-c option. For example:
tar -xvf <filename>.tar
Note that the .tar file will be larger in size than the sum of all the files being archived, primarily because of the added file headers in the archive file. Enter
man tar or
tar --help for more information and to view all available options.
Compressing with gzip
To compress files using
gzip -v <filename>
This will create a .gz file. Including the
-v option, verbose mode, will print the compression ratio. There are 9 levels of compression, with 9 being the highest/slowest level and 6 being the default. The default is typically the best value to use with respect to the compression/time tradeoff. To maximize compression, at the expense of compression time, add the
To uncompress a .gz file, add the
gzip -dv <filename>.gz
gzip --help to view all available options. In addition, the
pigz module is a parallel implementation of
gzip that provides faster compression and uncompression times:
module load pigz. It can be used as a drop-in replacement for
Compressing with xz
For better compression ratios or for maximum compression, use
xz instead of
xz you can also use multiple cores to speed up the compression time. For example, to compress using 4 cores, add the
xz -v -T4 <filename>
This will create a .xz file. Including the
-v option, verbose mode, will print compression progress and related information. There are 9 levels of compression, with 9 being the highest/slowest level and 6 being the default. The default is typically the best value to use with respect to the compression/time tradeoff. To maximize compression, at the expense of compression time and memory required, add the
To uncompress an .xz file, add the
xz -dv -T4 <filename>.xz
man xz or
xz -H for more information and to view all available options.
Archiving and compressing with tar
You can also archive and compress with one command using
tar with the
-z option, which uses
gzip compression by default. For example:
tar -czvf <filename>.tar.gz <dir>
Alternatively, to use
xz to compress, use the
-J option instead. In contrast to using
tar does not delete the source files by default. Add the
--remove-files option to do so.
To uncompress and unarchive in one command, use the
tar -xvf <filename>.tar.gz
This will extract the contents of the archive into the current directory.
tar will automatically detect which uncompression program to use, and note that it will not automatically delete the compressed archive file after extracting the files.
Software for Linux is typically distributed as a .tar.gz file, so a command like the above will extract the source code or binary files into the current directory.
Archiving and compressing before transferring files
Creating and compressing a single archive file can be useful before transferring files to or from CARC systems, especially for directories with a large number of files (e.g., > 1000, regardless of the total size of those files). Each file has associated metadata, and the transfer can be slowed by attending to that metadata. Compressing files will reduce the amount of data that needs to be transferred. However, it takes time to compress and uncompress files, so the total transfer time may not necessarily decrease depending on factors like network speeds. With fast network speeds, relative to total transfer size, it is typically not worth compressing files.
Managing file I/O
File input/output (I/O) refers to reading and writing data to disk. The following offers advice on managing I/O for your compute jobs.
First, try to avoid I/O in the first place. Process data and commands in memory where possible, instead of writing to and reading from disk. This will provide the best performance, though the size of the data and subsequent memory requirements may place limits on this strategy.
Second, use the /project, /scratch, or /scratch2 directories for I/O when needed, all of which are located on high-performance, parallel file systems. This includes data, software and software packages, and programs and scripts.
Third, try to avoid using the local /tmp directories on compute nodes in most cases because these are limited to 1 GB and can be shared with other jobs. To automatically redirect temporary files to another location, set the
TMPDIR environment variable. For example, create a
tmp subdirectory in one of your scratch directories and enter the following:
Including this line in your
~/.bashrc will automatically set the variable every time you log in. To change the
TMPDIR on a job-by-job basis, add a similar line to your job scripts.
Currently, CARC systems do not support the use or storage of sensitive data. If your research work includes sensitive data, including but not limited to HIPAA-, FERPA-, or CUI-regulated data, see our Secure Computing user guides or contact us at email@example.com before using our systems.