NAME

textual-slideshow.pl -- slowly scroll random paragraphs from various files

SYNOPSIS

# Do a slideshow based on .txt files in the $HOME directory.
# (Or the current directory if there is no $HOME environment variable.)
textual-slideshow.pl

# Run on the $HOME/homepage directory and print paragraphs only from html files.
textual-slideshow.pl --extensions html $HOME/homepage

# Get txt and html files from both $HOME/ebooks and $HOME/homepage
textual-slideshow.pl --extensions txt,html $HOME/ebooks $HOME/homepage

# Get all text files regardless of extension
textual-slideshow.pl --type 

# Scroll text faster (sleep a shorter amount of time between lines):
textual-slideshow.pl --sleep 0.02 $HOME/ebooks

# Scroll text slower:
textual-slideshow.pl --sleep 0.1 $HOME/ebooks

# Wrap text to fit the terminal window, if a given paragraph doesn't look
# like poetry, code, etc.
textual-slideshow.pl --wrap $HOME/ebooks

# Wrap text to a 60-character margin regardless of the terminal window size.
textual-slideshow.pl --wrap --margin 60 $HOME/ebooks

# Print filename and line number of the source file before printing a paragraph.
textual-slideshow.pl --print-filenames $HOME/ebooks

DESCRIPTION

This program takes a list of text files or directories containing text files, then randomly display paragraphs from those files to the terminal one line at a time with short pause between each line. It's interactive, with various keystrokes speeding up or slowing down the scrolling, getting help, exiting or doing other things.

You can run this on your home directory and see what output you get, or you can customize the output by giving it particular subdirectories, supplying a weights file to make files whose paths match certain patterns more or less likely to be chosen, setting the file extensions it looks for, etc.

This will probably produce more interesting output if you have a lot of HTML or plain text ebooks on your hard drive. A good source of them is Project Gutenberg. In a future version I plan to have this script grab paragraphs from .epub files as well.

While the slideshow is running, press h for help, q for quit.

COMMAND LINE OPTIONS

Usage: $scriptname [options] [filenames and/or directory names]

-s --sleep=[number]

Sleep time (seconds to sleep for each character output; default is 0.045 seconds per character).

-f --print-filenames

Print filename and line number of each paragraph

-w --wrap

Re-wrap paragraphs to fit the terminal window

-W --force-wrap

Re-wrap everything, even if it looks like source code, poetry etc.

-m --margin=[number]

Margin (integer); overrides terminal window width for rewrap

-t --type

Identify files with Perl -T filetest instead of file extension

-e --extensions=[string]

Look for specified list of extensions (comma-separated) instead of just .txt

-l --min-length=[number]

Don't print paragraphs shorter than this (in characters).

-L --max-length=[number]

Don't print paragraphs longer than this (in characters).

-p --max-paragraphs

Maximum number of random paragraphs to store in memory at once

--preload

Start loading and printing paragraphs before we finish collecting the list of filenames. (Use this if you're giving it a a huge directory hierarchy and the long startup time is annoying.)

--weights=filename

Specify a file of regular expression/weight pairs to determine how likely filenames matching various regexes are to be used. See "WEIGHTS" section.

-h --help

Get brief help.

-M --manual

Get detailed help (this manual).

Command line options used for testing

These options are probably only useful if you're hacking the script and adding features, etc.

-d --debug

Turn on debug messages.

-S --startup-test

Exit after doing startup, before starting the main print-and-check-keystroke loop. Probably only useful along with --debug.

-U --utf8-test

Execute test_utf8_handling(), which tests utf8 code by reading, parsing, decoding and outputting some test files.

INTERACTIVE COMMANDS

While the slideshow is running, you can press certain keys to change its behavior or exit.

h, ?

Display help.

+

Speed up scrolling.

-

Slow down scrolling.

f

Turn on/off printing of filenames and line numbers.

w

Turn on/off wrapping of text to fit terminal window.

d

Turn on/off debug mode.

[spacebar]

Pause (any key resumes).

q, x, ESC

Quit.

WEIGHTS

If the --weights argument is used, the following filename will be read and interpreted as a weights file, where each line consists of a regular expression, a tab, and a weight (a nonnegative number). If a filename (including the full path) matches any of the regular expressions in the weights file, then the corresponding weight will be applied; 0 means to exclude files matching that regular expression, 0.5 means to give those files half the default probability of being chosen to print paragraphs from, 2 means to give them double the default probability. If multiple regular expressions match a single filename, the applied weights are multiplied. (The default probability is normally 1/n where n = the number of text files found in the target directories that match the criteria given on the command line, but once you start applying weights it gets a a bit tricky.)

I mostly use this to block out certain directories that contain log files, etc., and also to make sure that the Project Gutenberg directory (which contains the vast majority of the plain text files on my hard drive) doesn't overly dominate the output.

Some example weights:

/packagelist  0
\.log$        0
etext 1.5
Documents/fic 1.5
etext/gutenberg_2001  0.67

These weights will exclude any text files found in the packagelist directory and any .log files, wherever they're found. (I have a cron job that saves a list of the installed packages every day, to make it easier to get back to a good state after an OS reinstall or upgrade.) It will give extra weight to files in the etext and Documents/fic directories, and reduced weight to the Project Gutenberg directory (as mentioned above, to keep it from overwhelming the output).

See WeightRandomList.html for more information.

DEPENDENCIES

This script uses the following Perl modules:

Encode, Encode::Guess, Getopt::Long, Pod::Usage, Text::Wrap, and Time::HiRes are in the standard library.

HTML::Parser and Term::ReadKey are available from CPAN.

WeightRandomList.pm is included with this distribution.

AUTHOR

Jim Henry III, http://jimhenry.conlang.org/software/

ACKNOWLEDGMENTS

Thanks to people at perlmonks.org for help. http://perlmonks.org/index.pl?node_id=869446

LICENSE

This script is free software; you may redistribute it and/or modify it under the same terms as Perl itself.

BUGS and LIMITATIONS

Currently assumes the input files are in one of ASCII, Latin-1, CP437 (the old IBM PC charset), UTF-8 or UTF-16. Will probably garble the output if any of the input files are in some other format.

Paragraphs taken from HTML files are not associated with a line number, only a filename.

There is no way to get higher-verbosity debug messages except editing the my $debug = 0; statement to set it to a higher value.

Symbolic links found in the target directories are ignored, as are all other things that aren't regular files (devices, etc.) or subdirectories.

TO DO LIST

Make the algorithm that determines how likely we are to save a particular paragraph from a particular file depend on configurable variables instead of a constant.

Add a command line option to skip paragraphs shorter than a certain minimum length.

Use Archive::Zip and HTML::Parser to get paragraphs from epubs as well.

Add command line option to suppress ANSI colors in paragraphs from HTML files. Maybe also option to customize which ANSI codes are associated with which tags?

Fancier display: use Term::ANSIColor to let the user specify colors and text decoration for various HTML tags via names instead of hard-coding one for the <em> family and one for <strong>.

Or take full control of screen, as graphical window, and print each paragraph in a different font? Maybe figure out how to make it a screensaver plugin for GNOME and other desktops?

FUNCTIONS

main function

Initialize variables based on command line options, initialize our list of files, then start the main event loop.

test_utf8_handling()

Used only in testing; reads a couple of simple files and tests whether their high-bit characters look right when printed. (Gets called when the -U option is set.)

interactive_help() and display_usage()

Give help on the command line or while running.

validate_options()

Sanity-check the variables we were given on the command line.

main_event_loop()

Repeatedly call slow_print() to print randomly chosen paragraphs, intermittently calling add_paras() and delete_oldest_paras() as needed to maintain the list of random paragraphs. slow_print() will take care of checking for user keystrokes and passing them to handle_keystroke().

build_file_list()

Iterate over the filenames and/or directory names given on the command line and build a list of filenames matching the criteria given via command line options (--extensions and --type).

apply_weights()

If the --weights command line option was given, read the weights file and apply the weights to the list of filenames.

want_file( filename )

Check if we want this file based on the --type and --extensions command line options, and if neither option was given, check if it has a .txt extension.

recurse_dir( directory name )

Iterate over the directory we're given, call ourselves recursively if it contains subdirectories, and add regular files to the @filenames list if they're wanted. Periodically print a random paragraph from files collected so far if the --preload command line option was given.

handle_keystroke( key )

Take appropriate action on keys pressed by the user.

If I were writing this now, or now if the actions per keystroke were more than two or three lines each, I'd probably use a hash of keystrokes mapped to function references, but it's not that long and is a low priority for refactoring.

slow_print( index )

Takes a subscript to the @paras array, gets the paragraph, wraps it if needed, and prints it slowly, one line at a time, checking for keystrokes between them.

If trying to print something gets an error (e.g. "Wide character in print"), we call this to convert the high-bit characters to hex numbers and print the line/paragraph as ASCII. That shouldn't happen anymore after the recent fixes (2023/9/7).

HTML parsing functions

This group of functions is grouped in a scope block because they share some state variables. Basically it's one function to initialize our HTML::Parser object for a given file, parse, and return the list of paragraphs found; and three callback functions that handle various tags and text blocks. We build up a paragraph with each text block, and when we hit an opening or closing tag of certain types, we decode the paragraph as needed and add the working paragraph to an array.

add_paras()

Pick a random file, snag a random subset of paragraphs from it.

If/when we add epub support, this if/else would get another branch and we'll write an add_random_paras_from_epub_file() function.

add_random_paras_from_html_file( filename )

Get a list of the parsed paragraphs from an HTML file, then randomly pick a subset of them to add to the @paras array.

Note that we can't save the line numbers to the @indices array because HTML::Parser doesn't give our callback functions access to line numbers in the source file, as far as I can tell. I might be able to work around that by parsing the file in chunks rather than in one call to parse_file, but I'm not sure it's worthwhile compared to other work I want to get done (like adding .epub support).

is_utf8( filename )

Check whether a file is encoded as utf8. To be used by paragraphs_from_html_file() so it can tell the parser object whether the file it's working on is utf8.

If I support epubs at some point, I'll want to revise this so it can take its argument as a string representing the contents of an HTML file as well as a filename. Maybe use a hash with different keys representing different types of argument?

good_length( paragraph )

Check a paragraph's length against the --min-length and --max-length options. Return true if within range, false if not.

Check whether the current line looks like the beginning or end of a Project Gutenberg ebook's license agreement section. Return true if it does, false if it doesn't.

add_random_paras_from_text_file( filename )

Read all the lines from a file into an array, figure out the encoding, then iterate over the lines and build paragraphs from the non-blank lines between blanks. Randomly pick a subset of the paragraphs to add to @paras. Filter out Project Gutenberg license.

It's overly long even after a lot of refactoring and I may refactor further by splitting it into a function to parse the file into paragraphs and another one to pick a random subset of those to save, like I did with the HTML parsing functions.

delete_oldest_paras()

Delete the oldest ten percent of saved paragraphs.

log_paras_got_count()

Add the number of paragraphs passed to us to a hash on filename, which is used by avg_paras().

files_paras_taken_from()

Returns the number of files we've taken paragraphs from.

avg_paras()

Return average number of paragraphs taken from each file.

rewrap( paragraph )

Wrapper (heh) for Text::Wrap::wrap(). Do some tests to see if we need to wrap the current paragraph and what the margin should be, then wrap it.

get_decoder( array ref )

Takes a reference to an array of lines from a file whose encoding we don't know yet. Gets a decoder object for it and then rebuilds the array with different newlines if the encoding requires it. Returns the decoder object.

decode_line( decoder, line, filename, line number )

Decode the line if possible, print error messages if not.

figure_out_encoding( reference to array of strings )

Use Encode::Guess plus some heuristics to figure out the probable encoding of a file, based on an array of lines from the file (passed by reference because it could be very big and we don't need to modify it). Return a decoder object.

"In the script above, there's a line commented out giving cp437 as one of the defaults to initialize Encode::Guess; if I have that line in, every file with high-bit characters gets a bad decoder. This is not surprising, given the man page's warning that Encode::Guess is bad at distinguishing different 8-bit encodings from each other. I have a lot of cp437 etexts lying around, but I'm pretty sure I can write an ad-hoc routine to distinguish them from the Latin-1 text files -- in theory both code pages use all the characters from 0x80 to 0xFF, but in practice, only accented Latin letters characters in the 80 to A5 range are common in cp437 text files and only characters in the C0 to FF range are common in Latin-1 files."

https://www.perlmonks.org/?node_id=870638