Path: blob/main/sys/contrib/zlib/doc/txtvsbin.txt
108090 views
A Fast Method for Identifying Plain Text Files1==============================================234Introduction5------------67Given a file coming from an unknown source, it is sometimes desirable8to find out whether the format of that file is plain text. Although9this may appear like a simple task, a fully accurate detection of the10file type requires heavy-duty semantic analysis on the file contents.11It is, however, possible to obtain satisfactory results by employing12various heuristics.1314Previous versions of PKZip and other zip-compatible compression tools15were using a crude detection scheme: if more than 80% (4/5) of the bytes16found in a certain buffer are within the range [7..127], the file is17labeled as plain text, otherwise it is labeled as binary. A prominent18limitation of this scheme is the restriction to Latin-based alphabets.19Other alphabets, like Greek, Cyrillic or Asian, make extensive use of20the bytes within the range [128..255], and texts using these alphabets21are most often misidentified by this scheme; in other words, the rate22of false negatives is sometimes too high, which means that the recall23is low. Another weakness of this scheme is a reduced precision, due to24the false positives that may occur when binary files containing large25amounts of textual characters are misidentified as plain text.2627In this article we propose a new, simple detection scheme that features28a much increased precision and a near-100% recall. This scheme is29designed to work on ASCII, Unicode and other ASCII-derived alphabets,30and it handles single-byte encodings (ISO-8859, MacRoman, KOI8, etc.)31and variable-sized encodings (ISO-2022, UTF-8, etc.). Wider encodings32(UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however.333435The Algorithm36-------------3738The algorithm works by dividing the set of bytecodes [0..255] into three39categories:40- The allow list of textual bytecodes:419 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255.42- The gray list of tolerated bytecodes:437 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC).44- The block list of undesired, non-textual bytecodes:450 (NUL) to 6, 14 to 31.4647If a file contains at least one byte that belongs to the allow list and48no byte that belongs to the block list, then the file is categorized as49plain text; otherwise, it is categorized as binary. (The boundary case,50when the file is empty, automatically falls into the latter category.)515253Rationale54---------5556The idea behind this algorithm relies on two observations.5758The first observation is that, although the full range of 7-bit codes59[0..127] is properly specified by the ASCII standard, most control60characters in the range [0..31] are not used in practice. The only61widely-used, almost universally-portable control codes are 9 (TAB),6210 (LF) and 13 (CR). There are a few more control codes that are63recognized on a reduced range of platforms and text viewers/editors:647 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB) and 27 (ESC); but these65codes are rarely (if ever) used alone, without being accompanied by66some printable text. Even the newer, portable text formats such as67XML avoid using control characters outside the list mentioned here.6869The second observation is that most of the binary files tend to contain70control characters, especially 0 (NUL). Even though the older text71detection schemes observe the presence of non-ASCII codes from the range72[128..255], the precision rarely has to suffer if this upper range is73labeled as textual, because the files that are genuinely binary tend to74contain both control characters and codes from the upper range. On the75other hand, the upper range needs to be labeled as textual, because it76is used by virtually all ASCII extensions. In particular, this range is77used for encoding non-Latin scripts.7879Since there is no counting involved, other than simply observing the80presence or the absence of some byte values, the algorithm produces81consistent results, regardless what alphabet encoding is being used.82(If counting were involved, it could be possible to obtain different83results on a text encoded, say, using ISO-8859-16 versus UTF-8.)8485There is an extra category of plain text files that are "polluted" with86one or more block-listed codes, either by mistake or by peculiar design87considerations. In such cases, a scheme that tolerates a small fraction88of block-listed codes would provide an increased recall (i.e. more true89positives). This, however, incurs a reduced precision overall, since90false positives are more likely to appear in binary files that contain91large chunks of textual data. Furthermore, "polluted" plain text should92be regarded as binary by general-purpose text detection schemes, because93general-purpose text processing algorithms might not be applicable.94Under this premise, it is safe to say that our detection method provides95a near-100% recall.9697Experiments have been run on many files coming from various platforms98and applications. We tried plain text files, system logs, source code,99formatted office documents, compiled object code, etc. The results100confirm the optimistic assumptions about the capabilities of this101algorithm.102103104--105Cosmin Truta106Last updated: 2006-May-28107108109