Isolate large text and binary data


Binary file formats are usually designed to be both compact and efficient to parse--that's their main advantage over text-based formats. To meet both those criteria, they're usually composed of on-disk structures that are easily mapped to data structures that a program might use to represent the same data in memory. The library will give you an easy way to define the mapping between the on-disk structures defined by a binary file format and in-memory Lisp objects.

Using the library, it should be easy to write a program that can read a binary file, translating it into Lisp objects that you can manipulate, and then write back out to another properly formatted binary file. The starting point for reading and writing binary files is to open the file for reading or writing individual bytes. When you're dealing with binary files, you'll specify unsigned-byte 8. An input stream opened with such an: Above the level of individual bytes, most binary formats use a smallish number of primitive data types--numbers encoded in various ways, textual strings, bit fields, and so on--which are then composed into more complex structures.

So your first task is to define a framework for writing code to read and write the primitive data types used by a given binary format. To take a simple example, suppose you're dealing with a binary format that uses an unsigned bit integer as a primitive data type. To read such an integer, you need to read the two bytes and then combine them into a single number by multiplying one byte by , a. For instance, assuming the binary format specifies that such bit quantities are stored in big-endian 3 form, with the most significant byte first, you can read such a number with this function:.

However, Common Lisp provides a more convenient way to perform this kind of bit twiddling. The function LDB , whose name stands for load byte, can be used to extract and set with SETF any number of contiguous bits from an integer. BYTE takes two arguments, the number of bits to extract or set and the position of the rightmost bit where the least significant bit is at position zero. LDB takes a byte specifier and the integer from which to extract the bits and returns the positive integer represented by the extracted bits.

Thus, you can extract the least significant octet of an integer like this:. To write a number out as a bit integer, you need to extract the individual 8-bit bytes and write them one at a time.

To extract the individual bytes, you just need to use LDB with the same byte specifiers. Of course, you can also encode integers in many other ways--with different numbers of bytes, with different endianness, and in signed and unsigned format. Textual strings are another kind of primitive data type you'll find in many binary formats.

When you read files one byte at a time, you can't read and write strings directly--you need to decode and encode them one byte at a time, just as you do with binary-encoded numbers. And just as you can encode an integer in several ways, you can encode a string in many ways. To start with, the binary format must specify how individual characters are encoded. To translate bytes to characters, you need to know both what character code and what character encoding you're using.

A character code defines a mapping from positive integers to characters. Each number in the mapping is called a code point. For instance, ASCII is a character code that maps the numbers from to particular characters used in the Latin alphabet. A character encoding, on the other hand, defines how the code points are represented as a sequence of bytes in a byte-oriented medium such as a file.

Nearly as straightforward are pure double-byte encodings, such as UCS-2, which map between bit values and characters. The only reason double-byte encodings can be more complex than single-byte encodings is that you may also need to know whether the bit values are supposed to be encoded in big-endian or little-endian format. Variable-width encodings use different numbers of octets for different numeric values, making them more complex but allowing them to be more compact in many cases. For instance, UTF-8, an encoding designed for use with the Unicode character code, uses a single octet to encode the values while using up to four octets to encode values up to 1,, On the other hand, texts consisting mostly of characters requiring four bytes in UTF-8 could be more compactly encoded in a straight double-byte encoding.

Common Lisp provides two functions for translating between numeric character codes and character objects: The language standard doesn't specify what character encoding an implementation must use, so there's no guarantee you can represent every character that can possibly be encoded in a given file format as a Lisp character. In addition to specifying a character encoding, a string encoding must also specify how to encode the length of the string.

Three techniques are typically used in binary file formats. The simplest is to not encode it but to let it be implicit in the position of the string in some larger structure: Both these techniques are used in ID3 tags, as you'll see in the next chapter. The other two techniques can be used to encode variable-length strings without relying on context.

One is to encode the length of the string followed by the character data--the parser reads an integer value in some specified integer format and then reads that number of characters.

Another is to write the character data followed by a delimiter that can't appear in the string such as a null character. The different representations have different advantages and disadvantages, but when you're dealing with already specified binary formats, you won't have any control over which encoding is used.

However, none of the encodings is particularly more difficult to read and write than any other. To write a string back out, you just need to translate the characters back to numeric values that can be written with WRITE-BYTE and then write the null terminator after the string contents. As these examples show, the main intellectual challenge--such as it is--of reading and writing primitive elements of binary files is understanding how exactly to interpret the bytes that appear in a file and to map them to Lisp data types.

If a binary file format is well specified, this should be a straightforward proposition. Actually writing functions to read and write a particular encoding is, as they say, a simple matter of programming. Now you can turn to the issue of reading and writing more complex on-disk structures and how to map them to Lisp objects.

Since binary formats are usually used to represent data in a way that makes it easy to map to in-memory data structures, it should come as no surprise that composite on-disk structures are usually defined in ways similar to the way programming languages define in-memory structures.

Usually a composite on-disk structure will consist of a number of named parts, each of which is itself either a primitive type such as a number or a string, another composite structure, or possibly a collection of such values.

For instance, an ID3 tag defined in the 2. Following the header is a list of frames , each of which has its own internal structure. After the frames are as many null bytes as are necessary to pad the tag out to the size specified in the header.

If you look at the world through the lens of object orientation, composite structures look a lot like classes. For instance, you could write a class to represent an ID3 tag. An instance of this class would make a perfect repository to hold the data needed to represent an ID3 tag. You could then write functions to read and write instances of this class.

For example, assuming the existence of certain other functions for reading the appropriate primitive data types, a read-id3-tag function might look like this:.

It's not hard to see how you could write the appropriate classes to represent all the composite data structures in a specification along with read-foo and write-foo functions for each class and for necessary primitive types.

But it's also easy to tell that all the reading and writing functions are going to be pretty similar, differing only in the specifics of what types they read and the names of the slots they store them in. It's particularly irksome when you consider that in the ID3 specification it takes about four lines of text to specify the structure of an ID3 tag, while you've already written eighteen lines of code and haven't even written write-id3-tag yet. What you'd really like is a way to describe the structure of something like an ID3 tag in a form that's as compressed as the specification's pseudocode yet that can also be expanded into code that defines the id3-tag class and the functions that translate between bytes on disk and instances of the class.

Sounds like a job for a macro. Since you already have a rough idea what code your macros will need to generate, the next step, according to the process for writing a macro I outlined in Chapter 8, is to switch perspectives and think about what a call to the macro should look like. Since the goal is to be able to write something as compressed as the pseudocode in the ID3 specification, you can start there. The header of an ID3 tag is specified like this:.

The version consists of two bytes, the first of which--for this version of the specification--has the value 2 and the second of which--again for this version of the specification--is 0. The flags slot is eight bits, of which all but the first two are 0, and the size consists of four bytes, each of which has a 0 in the most significant bit. Some information isn't captured by this pseudocode.

For instance, exactly how the four bytes that encode the size are to be interpreted is described in a few lines of prose. Likewise, the spec describes in prose how the frame and subsequent padding is stored after this header. But most of what you need to know to be able to write code to read and write an ID3 tag is specified by this pseudocode.

It is sometimes useful to have a PED file that is tab-delimited, except that between alleles of the same genotype a space instead of a tab is used. A file formatted in this way can load into Excel, for example, as a tab-delimited file, but with one genotype per column instead of one allele per column. Use the option --tab as well as --recode or --recode12 to achieve this effect. Command reference table List of options List of output files Under development 5.

Permutation procedures Basic permutation Adaptive permutation max T permutation Ranked permutation Gene-dropping Within-cluster Permuted phenotypes files Imputation beta Making reference set Basic association test Modifying parameters Imputing discrete calls Verbose output options Dosage data Input file formats Association analysis Outputting dosage data Meta-analysis Basic usage Misc.

Annotation Basic usage Misc. LD-based results clumping Basic usage Verbose reporting Combining multiple studies Best single proxy Gene-based report Basic usage Other options Simulation tools Basic usage Resampling a population Quantitative traits Flow-chart Order of commands Recode and reorder a sample A basic, but often useful feature, is to output a dataset: Also, if --output-missing-genotype is specified which can be as well as --missing-genotype then this value will be used instead i.

The --make-bed option does the same as --recode but creates binary files; these can also be filtered, etc, as described below. In contrast, plink --file data --recode12 will recode the alleles as 1 and 2 and the missing genotype will always be 0.

Both these commands will create two new files plink. Unless manually specified, for all these options, the usual filters for missingness and allele frequency will be set so as not to exclude any SNPs or individuals. By explicitly including an option, e. These flags should be used in conjunction with a data generation command e. Alleles other than A,C,G,T or 1,2,3,4 will be left unchanged. To make a new file in which non-founders without both parents also in the same fileset are recoded as founders i.

Transposed genotype files When using either --recode or --recode12 , you can obtain a transposed text genotype file by adding the --transpose option. This generates two files: The order of individuals in this file is the same as the order across the columns of the TPED file.

Additive and dominance components The following format is often useful if one wants to use a standard, non-genetic statistical package to analyse the data, as here genotypes are coded as a single allele dosage number. To create a file with SNP genotypes recoded in terms of additive and dominant components, use the option: The --recodeAD option produces both an additive and dominance coding: The --recodeAD option saves the data to a single file plink.

This file can be easily loaded into R: The additive count of the number of common 1 alleles is therefore: The behavior of the --recodeA and --recodeAD commands can be changed with the --recode-allele command. This allows for the 0, 1, 2 count to reflect the number of a pre-specified allele type per SNP, rather than the number of the minor allele. This command takes as a single argument the name of a file that lists SNP name and allele to report, e. If an allele is specified in --recode-allele that is not seen in the data, similarly all individuals will receive a 0 count i.

NOTE For alleles that have exactly 0. Listing by minor allele count The command --recode-rlist will generate a files plink. For example, consider a particular SNP, rs has a minor allele G seen twice in two heterozygotes and two individuals with a missing genotpe; all other individuals are homozygous for the major allele. In this case, we would see two rows in the pink. This command could be used in conjunction with the --reference command and --freq to list all instances of rare non-reference alleles, e.

The --with-reference with generate a fourth file plink. Listing by genotype Another format that might sometimes be useful is the --list option which genetes a file plink.

For example, if we have a file with two SNPs rs and rs both on chromosome 1: This option is often useful in conjunction with --snp , if you want an easy breakdown of which individuals have which genotypes. Update SNP information To automatically update either the genetic or physical positions for some or all SNPs in a dataset, use the --update-map command, which takes a single parameter of a filename, e. To change genetic position 3rd column in map file add the flag --update-cm as well as --update-map.

There is no way to change chromosome codes using this command. Normally, one would want to save the new file with the changed positions, as in the example above, although one could combine other commands instead e. SNPs not in this file will be left unchanged. If a SNP is listed more than once in the file, an error will be reported. If this is the case, a message will be written to the LOG file.

Although the positions are updated, the order is not changed internally: For example, the if the original contains Only after saving and reloading e. This will only be an issue for commands which rely on relative SNP positions e. If the LOG file does not show a message that the order of SNPs has changed after using --update-map , one need not worry. The name and chromosome code of a SNP can also be changed, by adding the modifiers --update-name or --update-chr , e. You cannot update more than one attribute at a time for SNPs.

Update allele information To recode alleles, for example from A,B allele coding to A,C,G,T coding, use the command --update-alleles , for example. Must have a higher column number than the last bound column. For this reason, applications should make sure to place long data columns at the end of the select list. For more information, see Using Block Cursors. Some drivers do not enforce these restrictions.

This restricts the number of bytes of data that will be returned for any character or binary column. For example, suppose a column contains long text documents. An application that browses the table containing this column might have to display only the first page of each document. Although this statement attribute can be simulated in the driver, there is no reason to do this. In particular, if an application wants to truncate character or binary data, it should bind a small buffer to the column with SQLBindCol and let the driver truncate the data.

The feedback system for this content will be changing soon.