NAME
agedu - correlate disk usage with last-access times to identify large
and disused data
SYNOPSIS
agedu [ options ] action [action...]
DESCRIPTION
agedu scans a directory tree and produces reports about how much disk
space is used in each directory and subdirectory, and also how that
usage of disk space corresponds to files with last-access times a long
time ago.
In other words, agedu is a tool you might use to help you free up disk
space. It lets you see which directories are taking up the most space,
as du does; but unlike du, it also distinguishes between large
collections of data which are still in use and ones which have not been
accessed in months or years - for instance, large archives downloaded,
unpacked, used once, and never cleaned up. Where du helps you find
what's using your disk space, agedu helps you find what's wasting your
disk space.
agedu has several operating modes. In one mode, it scans your disk and
builds an index file containing a data structure which allows it to
efficiently retrieve any information it might need. Typically, you
would use it in this mode first, and then run it in one of a number of
‘query’ modes to display a report of the disk space usage of a
particular directory and its subdirectories. Those reports can be
produced as plain text (much like du) or as HTML. agedu can even run as
a miniature web server, presenting each directory's HTML report with
hyperlinks to let you navigate around the file system to similar
reports for other directories.
So you would typically start using agedu by telling it to do a scan of
a directory tree and build an index. This is done with a command such
as
$ agedu -s /home/fred
which will build a large data file called agedu.dat in your current
directory. (If that current directory is inside /home/fred, don't worry
- agedu is smart enough to discount its own index file.)
Having built the index, you would now query it for reports of disk
space usage. If you have a graphical web browser, the simplest and
nicest way to query the index is by running agedu in web server mode:
$ agedu -w
which will print (among other messages) a URL on its standard output
along the lines of
URL: http://127.0.0.1:48638/
(That URL will always begin with ‘127.’, meaning that it's in the
localhost address space. So only processes running on the same computer
can even try to connect to that web server, and also there is access
control to prevent other users from seeing it - see below for more
detail.)
Now paste that URL into your web browser, and you will be shown a
graphical representation of the disk usage in /home/fred and its
immediate subdirectories, with varying colours used to show the
difference between disused and recently-accessed data. Click on any
subdirectory to descend into it and see a report for its subdirectories
in turn; click on parts of the pathname at the top of any page to
return to higher-level directories. When you've finished browsing, you
can just press Ctrl-D to send an end-of-file indication to agedu, and
it will shut down.
After that, you probably want to delete the data file agedu.dat, since
it's pretty large. In fact, the command agedu -R will do this for you;
and you can chain agedu commands on the same command line, so that
instead of the above you could have done
$ agedu -s /home/fred -w -R
for a single self-contained run of agedu which builds its index, serves
web pages from it, and cleans it up when finished.
If you don’t have a graphical web browser, you can do text-based
queries as well. Having scanned /home/fred as above, you might run
$ agedu -t /home/fred
which again gives a summary of the disk usage in /home/fred and its
immediate subdirectories; but this time agedu will print it on standard
output, in much the same format as du. If you then want to find out how
much old data is there, you can add the -a option to show only files
last accessed a certain length of time ago. For example, to show only
files which haven't been looked at in six months or more:
$ agedu -t /home/fred -a 6m
That’s the essence of what agedu does. It has other modes of operation
for more complex situations, and the usual array of configurable
options. The following sections contain a complete reference for all
its functionality.
OPERATING MODES
This section describes the operating modes supported by agedu. Each of
these is in the form of a command-line option, sometimes with an
argument. Multiple operating-mode options may appear on the command
line, in which case agedu will perform the specified actions one after
another. For instance, as shown in the previous section, you might want
to perform a disk scan and immediately launch a web server giving
reports from that scan.
-s directory or --scan directory
In this mode, agedu scans the file system starting at the
specified directory, and indexes the results of the scan into a
large data file which other operating modes can query.
By default, the scan is restricted to a single file system
(since the expected use of agedu is that you would probably use
it because a particular disk partition was running low on
space). You can remove that restriction using the --cross-fs
option; other configuration options allow you to include or
exclude files or entire subdirectories from the scan. See the
next section for full details of the configurable options.
The index file is created with restrictive permissions, in case
the file system you are scanning contains confidential
information in its structure.
Index files are dependent on the characteristics of the CPU
architecture you created them on. You should not expect to be
able to move an index file between different types of computer
and have it continue to work. If you need to transfer the
results of a disk scan to a different kind of computer, see the
-D and -L options below.
-w or --web
In this mode, agedu expects to find an index file already
written. It allocates a network port, and starts up a web server
on that port which serves reports generated from the index file.
By default it invents its own URL and prints it out.
The web server runs until agedu receives an end-of-file event on
its standard input. (The expected usage is that you run it from
the command line, immediately browse web pages until you're
satisfied, and then press Ctrl-D.)
In case the index file contains any confidential information
about your file system, the web server protects the pages it
serves from access by other people. On Linux, this is done
transparently by means of using /proc/net/tcp to check the owner
of each incoming connection; failing that, the web server will
require a password to view the reports, and agedu will print the
password it invented on standard output along with the URL.
Configurable options for this mode let you specify your own
address and port number to listen on, and also specify your own
choice of authentication method (including turning
authentication off completely) and a username and password of
your choice.
-t directory or --text directory
In this mode, agedu generates a textual report on standard
output, listing the disk usage in the specified directory and
all its subdirectories down to a fixed depth. By default that
depth is 1, so that you see a report for directory itself and
all of its immediate subdirectories. You can configure a
different depth using -d, described in the next section.
Used on its own, -t merely lists the total disk usage in each
subdirectory; agedu's additional ability to distinguish unused
from recently-used data is not activated. To activate it, use
the -a option to specify a minimum age.
The directory structure stored in agedu's index file is treated
as a set of literal strings. This means that you cannot refer to
directories by synonyms. So if you ran agedu -s ., then all the
path names you later pass to the -t option must be either ‘.’ or
begin with ‘./’. Similarly, symbolic links within the directory
you scanned will not be followed; you must refer to each
directory by its canonical, symlink-free pathname.
-R or --remove
In this mode, agedu deletes its index file. Running just agedu
-R on its own is therefore equivalent to typing rm agedu.dat.
However, you can also put -R on the end of a command line to
indicate that agedu should delete its index file after it
finishes performing other operations.
-D or --dump
In this mode, agedu reads an existing index file and produces a
dump of its contents on standard output. This dump can later be
loaded into a new index file, perhaps on another computer.
-L or --load
In this mode, agedu expects to read a dump produced by the -D
option from its standard input. It constructs an index file from
that dump, exactly as it would have if it had read the same data
from a disk scan in -s mode.
-S directory or --scan-dump directory
In this mode, agedu will scan a directory tree and convert the
results straight into a dump on standard output, without
generating an index file at all. So running agedu -S /path
should produce equivalent output to that of agedu -s /path -D,
except that the latter will produce an index file as a side
effect whereas -S will not.
(The output will not be exactly identical, due to a difference
in treatment of last-access times on directories. However, it
should be effectively equivalent for most purposes. See the
documentation of the --dir-atime option in the next section for
further detail.)
-H directory or --html directory
In this mode, agedu will generate an HTML report of the disk
usage in the specified directory and its immediate
subdirectories, in the same form that it serves from its web
server in -w mode. However, this time, a single HTML report will
be generated and simply written to standard output, with no
hyperlinks pointing to other similar pages.
OPTIONS
This section describes the various configuration options that affect
agedu's operation in one mode or another.
The following option affects nearly all modes (except -S):
-f filename or --file filename
Specifies the location of the index file which agedu creates,
reads or removes depending on its operating mode. By default,
this is simply ‘agedu.dat’, in whatever is the current working
directory when you run agedu.
The following options affect the disk-scanning modes, -s and -S:
--cross-fs and --no-cross-fs
These configure whether or not the disk scan is permitted to
cross between different file systems. The default is not to:
agedu will normally skip over subdirectories on which a
different file system is mounted. This makes it convenient when
you want to free up space on a particular file system which is
running low. However, in other circumstances you might wish to
see general information about the use of space no matter which
file system it's on (for instance, if your real concern is your
backup media running out of space, and if your backups do not
treat different file systems specially); in that situation, use
--cross-fs.
(Note that this default is the opposite way round from the
corresponding option in du.)
--prune wildcard and --prune-path wildcard
These cause particular files or directories to be omitted
entirely from the scan. If agedu's scan encounters a file or
directory whose name matches the wildcard provided to the
--prune option, it will not include that file in its index, and
also if it's a directory it will skip over it and not scan its
contents.
Note that in most Unix shells, wildcards will probably need to
be escaped on the command line, to prevent the shell from
expanding the wildcard before agedu sees it.
--prune-path is similar to --prune, except that the wildcard is
matched against the entire pathname instead of just the filename
at the end of it. So whereas --prune *a*b* will match any file
whose actual name contains an a somewhere before a b, --prune-
path *a*b* will also match a file whose name contains b and
which is inside a directory containing an a, or any file inside
a directory of that form, and so on.
--exclude wildcard and --exclude-path wildcard
These cause particular files or directories to be omitted from
the index, but not from the scan. If agedu's scan encounters a
file or directory whose name matches the wildcard provided to
the --exclude option, it will not include that file in its index
- but unlike --prune, if the file in question is a directory it
will still scan its contents and index them if they are not
ruled out themselves by --exclude options.
As above, --exclude-path is similar to --exclude, except that
the wildcard is matched against the entire pathname.
--include wildcard and --include-path wildcard
These cause particular files or directories to be re-included in
the index and the scan, if they had previously been ruled out by
one of the above exclude or prune options. You can interleave
include, exclude and prune options as you wish on the command
line, and if more than one of them applies to a file then the
last one takes priority.
For example, if you wanted to see only the disk space taken up
by MP3 files, you might run
$ agedu -s . --exclude '*' --include '*.mp3'
which will cause everything to be omitted from the scan, but
then the MP3 files to be put back in. If you then wanted only a
subset of those MP3s, you could then exclude some of them again
by adding, say, ‘--exclude-path './queen/*'’ (or, more
efficiently, ‘--prune ./queen’) on the end of that command.
As with the previous two options, --include-path is similar to
--include except that the wildcard is matched against the entire
pathname.
--progress, --no-progress and --tty-progress
When agedu is scanning a directory tree, it will typically print
a one-line progress report every second showing where it has
reached in the scan, so you can have some idea of how much
longer it will take. (Of course, it can't predict exactly how
long it will take, since it doesn't know which of the
directories it hasn't scanned yet will turn out to be huge.)
By default, those progress reports are displayed on agedu's
standard error channel, if that channel points to a terminal
device. If you need to manually enable or disable them, you can
use the above three options to do so: --progress unconditionally
enables the progress reports, --no-progress unconditionally
disables them, and --tty-progress reverts to the default
behaviour which is conditional on standard error being a
terminal.
--dir-atime and --no-dir-atime
In normal operation, agedu ignores the atimes (last access
times) on the directories it scans: it only pays attention to
the atimes of the files inside those directories. This is
because directory atimes tend to be reset by a lot of system
administrative tasks, such as cron jobs which scan the file
system for one reason or another - or even other invocations of
agedu itself, though it tries to avoid modifying any atimes if
possible. So the literal atimes on directories are typically not
representative of how long ago the data in question was last
accessed with real intent to use that data in particular.
Instead, agedu makes up a fake atime for every directory it
scans, which is equal to the newest atime of any file in or
below that directory (or the directory's last modification time,
whichever is newest). This is based on the assumption that all
important accesses to directories are actually accesses to the
files inside those directories, so that when any file is
accessed all the directories on the path leading to it should be
considered to have been accessed as well.
In unusual cases it is possible that a directory itself might
embody important data which is accessed by reading the
directory. In that situation, agedu's atime-faking policy will
misreport the directory as disused. In the unlikely event that
such directories form a significant part of your disk space
usage, you might want to turn off the faking. The --dir-atime
option does this: it causes the disk scan to read the original
atimes of the directories it scans.
The faking of atimes on directories also requires a processing
pass over the index file after the main disk scan is complete.
--dir-atime also turns this pass off. Hence, this option affects
the -L option as well as -s and -S.
(The previous section mentioned that there might be subtle
differences between the output of agedu -s /path -D and agedu -S
/path. This is why. Doing a scan with -s and then dumping it
with -D will dump the fully faked atimes on the directories,
whereas doing a scan-to-dump with -S will dump only partially
faked atimes - specifically, each directory's last modification
time - since the subsequent processing pass will not have had a
chance to take place. However, loading either of the resulting
dump files with -L will perform the atime-faking processing
pass, leading to the same data in the index file in each case.
In normal usage it should be safe to ignore all of this
complexity.)
--mtime
This option causes agedu to index files by their last
modification time instead of their last access time. You might
want to use this if your last access times were completely
useless for some reason: for example, if you had recently
searched every file on your system, the system would have lost
all the information about what files you hadn't recently
accessed before then. Using this option is liable to be less
effective at finding genuinely wasted space than the normal mode
(that is, it will be more likely to flag things as disused when
they're not, so you will have more candidates to go through by
hand looking for data you don't need), but may be better than
nothing if your last-access times are unhelpful.
The following option affects all the modes that generate reports: the
web server mode -w, the stand-alone HTML generation mode -H and the
text report mode -t.
--files
This option causes agedu's reports to list the individual files
in each directory, instead of just giving a combined report for
everything that's not in a subdirectory.
The following options affect the web server mode -w, and in one case
also the stand-alone HTML generation mode -H:
-r age range or --age-range age range
The HTML reports produced by agedu use a range of colours to
indicate how long ago data was last accessed, running from red
(representing the most disused data) to green (representing the
newest). By default, the lengths of time represented by the two
ends of that spectrum are chosen by examining the data file to
see what range of ages appears in it. However, you might want to
set your own limits, and you can do this using -r.
The argument to -r consists of a single age, or two ages
separated by a minus sign. An age is a number, followed by one
of ‘y’ (years), ‘m’ (months), ‘w’ (weeks) or ‘d’ (days). The
first age in the range represents the oldest data, and will be
coloured red in the HTML; the second age represents the newest,
coloured green. If the second age is not specified, it will
default to zero (so that green means data which has been
accessed just now).
For example, -r 2y will mark data in red if it has been unused
for two years or more, and green if it has been accessed just
now. -r 2y-3m will similarly mark data red if it has been unused
for two years or more, but will mark it green if it has been
accessed three months ago or later.
--address addr[:port]
Specifies the network address and port number on which agedu
should listen when running its web server. If you want agedu to
listen for connections coming in from any source, you should
probably specify the special IP address 0.0.0.0. If the port
number is omitted, an arbitrary unused port will be chosen for
you and displayed.
If you specify this option, agedu will not print its URL on
standard output (since you are expected to know what address you
told it to listen to).
--auth auth-type
Specifies how agedu should control access to the web pages it
serves. The options are as follows:
magic This option only works on Linux, and only when the
incoming connection is from the same machine that agedu
is running on. On Linux, the special file /proc/net/tcp
contains a list of network connections currently known to
the operating system kernel, including which user id
created them. So agedu will look up each incoming
connection in that file, and allow access if it comes
from the same user id under which agedu itself is
running. Therefore, in agedu's normal web server mode,
you can safely run it on a multi-user machine and no
other user will be able to read data out of your index
file.
basic In this mode, agedu will use HTTP Basic authentication:
the user will have to provide a username and password via
their browser. agedu will normally make up a username and
password for the purpose, but you can specify your own;
see below.
none In this mode, the web server is unauthenticated: anyone
connecting to it has full access to the reports generated
by agedu. Do not do this unless there is nothing
confidential at all in your index file, or unless you are
certain that nobody but you can run processes on your
computer.
default
This is the default mode if you do not specify one of the
above. In this mode, agedu will attempt to use Linux
magic authentication, but if it detects at startup time
that /proc/net/tcp is absent or non-functional then it
will fall back to using HTTP Basic authentication and
invent a user name and password.
--auth-file filename or --auth-fd fd
When agedu is using HTTP Basic authentication, these options
allow you to specify your own user name and password. If you
specify --auth-file, these will be read from the specified file;
if you specify --auth-fd they will instead be read from a given
file descriptor which you should have arranged to pass to agedu.
In either case, the authentication details should consist of the
username, followed by a colon, followed by the password,
followed immediately by end of file (no trailing newline, or
else it will be considered part of the password).
LIMITATIONS
The data file is pretty large. The core of agedu is the tree-based data
structure it uses in its index in order to efficiently perform the
queries it needs; this data structure requires O(N log N) storage. This
is larger than you might expect; a scan of my own home directory,
containing half a million files and directories and about 20Gb of data,
produced an index file over 60Mb in size. Furthermore, since the data
file must be memory-mapped during most processing, it can never grow
larger than available address space, so a really big filesystem may
need to be indexed on a 64-bit computer. (This is one reason for the
existence of the -D and -L options: you can do the scanning on the
machine with access to the filesystem, and the indexing on a machine
big enough to handle it.)
The data structure also does not usefully permit access control within
the data file, so it would be difficult - even given the willingness to
do additional coding - to run a system-wide agedu scan on a cron job
and serve the right subset of reports to each user.
LICENCE
agedu is free software, distributed under the MIT licence. Type agedu
--licence to see the full licence text.