Adding (inserts) or changing (updates) data to BIAD is typically done via a MySQL GUI such as HeidiSQL or sequelPro. This can be achieved either by manual inserts or updates to tables one row at a time, or more usually with a batch change from importing data from a .csv spreadsheet of many hundreds of rows in a single import. Trying to batch import other file formats (.xls, .xlsx, .odt, .odf, .numbers) will not work.
It is crucial to ensure data are formatted consistently at all stages of handling. Specifically, the import and export settings in your MySQL GUI, and the import and export/save settings in whatever program you use when working on spreadsheets (e.g. Microsoft Excel; OpenOffice Calc; LibraOffice Calc). Particular care must be taken if using Microsoft Excel since simply double clicking on a spreadsheet will usually result in all manner of unwanted conversions of the raw data including incorrectly guessing the encoding, and converting some strings and numerics to dates. Excel provides no warning of these conversions and are hard to spot (perhaps just a few changes in a long spreadsheet), but if saved these unwanted changes become irreversible. Excel does provide an import wizard to allow you to be explicit about your preferred formatting (Open a blank worksheet, then import data into it), whilst OpenOffice and LibraOffice Calc both default to an import wizard requiring fewer key strokes. No program is entirely without some unwanted idiosyncrasies, nevertheless we suggest LibraOffice Calc is probably best.
Since the BIAD handles international characters, encoding should always be set to UTF-8. Without setting this explicitly, programs will guess the encoding. The variation in success of various programs is remarkable. For example, Notepad++ is extremely successful, whilst Microsoft Excel seems pathologically incapable of spotting UTF-8. On the other hand, Microsoft Excel provides the option of saving a .csv file explicitly as UTF-8 by adding a special Byte Order Mark to the file (CSV UTF8 or CSV UTF8-BOM) to ensure that Excel will automatically get the encoding correct next time it is opened with a double click. Unfortunately, this still doesn't prevent Excel reformatting strings as dates (such as ‘dec1’). The BOM is also a Microsoft specific feature, so is not recognised by the MAC operating system.
The raw data stored in a .csv file is merely a continuous stream of bytes. Any program being fed this stream needs to be first told how to interpret these bytes. Correctly specifying the encoding as UTF-8 gets us half way to reading our data correctly, but there are a few other key control characters that need to be specified. Frustratingly these depend both on your operating system (Windows or Mac) and even on your geographical location (your locale settings). Ultimately, provided you ensure you are using the same settings to both import and export files from your spreadsheet program and your MySQL GUI, it doesn't matter which control characters you actually use.
Field Separator | The character used to separate data into different columns. Sometimes described as the ‘field terminator’. Since csv stands for comma separated values, you might suppose this should be a comma ( , ). However, continental Europe have adopted the bizarre habit of using a comma as a decimal place, and instead use a tab to separate fields (some call this .tsv). |
Enclose character | Assume we have already specified a comma as a field separator, but we have a notes field including a bunch of text that includes several commas. We certainly don't want the text broken up into different columns! Therefore these text fields are usually enclosed by a double quotation mark ( " ) at each end. Of course, if the data includes double quotation marks as part of the text, you will need to choose something else! |
Escape character | Usually a backslash ( \ ) placed immediately before another character tells the program not to interpret it as the usual character, but instead it has some other special meaning. |
Line ending | The program needs to know when the stream of bytes needs to start a new line. Usually carriage return (CR or \r) and/or Line Feed (LF or \n). |
Nevertheless, you may find the following settings easiest:
Controller | HeidiSQL | SequelPro |
---|---|---|
Field Separator | comma ( , ) | comma ( , ) |
Enclose character | Double quotes optionally ( " and tick box ) | Double quotes or backslash ( " or \ ) |
Escape character | backslash ( \ ) | backslash ( \ ) |
Line ending | backslash r backslash n ( \r\n ) | backslash n ( \n ) |
If these settings (on batch import) aren't correct (i.e., dont match the .csv file), HeidiSQL is still pretty robust in trying to interpret the data, and therefore will usually throw an error about the first line having problems. In contrast, SequelPro tends to shit itself and crash.
The core of the BIAD structure is based on the relationship between Citations, Sites, Phases, PhaseCitation, and PhaseTypes. All subsequent data tables are linked to these tables to allow grouping into spatio-temporal units.
This workflow illustrates the sequence of importing steps to ensure successful data import. If the structure of the imported .csv file is inconsistent with the tables existing already in the database, the import will be unsuccessful.
If no information is available in any of the columns insert "\N".
The Citations table is used to record primary sources used for data entry.
The highest level in the relationships since all data must be associated with a site.
The phase is the most useful level of aggregation, representing discrete archaeological phases at a site.
Citations for each phase. Multiple citations per phase are allowed as long as each receive a separate row.
Broad description of the type of the site, context or deposit. Before a type can be assigned it needs to be present in this table.
Broadly considered description of the main architectural function or intended purpose of the site.
Imports require very few keystrokes. The .csv should have exactly the same columns as the BIAD table to be imported into. The file can then be easily imported via the GUI, such as HeidiSQL. The procedure is:
Batch updates are fundamentally different process to a batch import. Do not use replace! A batch replace deletes the existing entry in BIAD, then inserts the new data. This is wrong, because it will remove the meta data regarding when the data was first entered, and by which user.
Instead, batch updating requires you to write a MySQL query that explicitly states which column(s) to change. A simple example of a batch update would be resolving a common typo, such as correcting all cases of Vitaceaei with Vitaceae as follows:
UPDATE `BIAD`.`zoptions_TaxaList` SET `Family`='Vitaceae' WHERE `Family`='Vitaceaei';
A more complicated example is the problem of updating a column with many different values. For example, having spent time checking many latitudes and longitudes, we have a table of many rows, comprising just three columns of the Sites table:
SiteID | Latitude | Longitude |
S123450 | 40 | 5 |
S123451 | 41 | 6 |
etc | etc | etc |
The first step is to simply get this .csv imported into a new table in BIAD. The easiest way to achieve this is to build a blank copy table of Sites (let's call it new_updates) without the data, indexes or foreign keys , then delete all columns that aren't in our .csv.
Next we batch import our .csv into this blank copy table called new_updates. Errors at this stage suggest a fundamental problem with the structure of the .csv.
Finally we run a MySQL command that explicitly updates only Latitude and Longitude:
UPDATE `BIAD`.`Sites` inner JOIN `new_updates` ON `Sites`.`SiteID`= `new_updates`.`SiteID` SET
`Sites`.`Latitude`=`new_updates`.`Latitude`,
`Sites`.`Longitude`=`new_updates`.`Longitude`;
A bigger challenge is the case where we have a large .csv of a very large number of columns to update. For example, after a careful review of several other sources databases, which might have more detailed information on sites, phases or specific data (e.g. 14C). Perhaps after a thorough comparison of the literature with many BIAD records (typically with the help of R, which can easily handle large datasets) you have created a set of several .csv tables with the same formatting and columns as BIAD.
Since we wish to update all data columns in the Sites table, it would be tedious and error prone to write the full query for all columns. As in the example above we create a blank copy table called new_updates but this time we only delete the four meta data columns time_added, user_added, time_last_update, and user_last_update.
The following MySQL command will then update all columns that are still in the new_updates table:
SET SESSION group_concat_max_len = 10000;
SET @part1 := "UPDATE `BIAD`.`Sites` inner JOIN `new_updates` ON `Sites`.`SiteID`=`new_updates`.`SiteID` SET ";
SELECT GROUP_CONCAT( CONCAT("`Sites`.`",COLUMN_NAME,"`=`new_updates`.`", COLUMN_NAME,"`") SEPARATOR ", ") FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'biad' AND TABLE_NAME = 'new_updates' INTO @part2;
SET @command := CONCAT(@part1,@part2);
PREPARE myquery FROM @command;
EXECUTE myquery;
Clearly if you are only updating a few columns, this last approach is not sensible, but can be labour saving for large tables.