Summary
This article shows how researchers can make use of an open source data dictionary tool called SchemaSpy to help create professional, easy-to-understand documentation for their datasets. Specific instructions are provided for Oracle, Microsoft SQL Server, CSV, SQlite, R, and Python + Pandas.
Body
Summary
Having a data dictionary - a document explaining the structure of data and its transformations - is crucial for study reproducibility and for meeting NIH data management & sharing requirements. However, creating data dictionaries can be time consuming and tedious work without the use of tools to automate as much of this as possible. This article shows how researchers can make use of an open source data dictionary tool called SchemaSpy to help create professional, easy-to-understand documentation for their datasets. Specific instructions are provided for Oracle, Microsoft SQL Server, CSV, SQLite, R, and Python + Pandas.
Example site: automation.depressioncenter.org/datadict
Steps
Video Tutorial
Getting SchemaSpy and Database Connectors
- If using Windows, create the following directories:
c:\datadict
c:\datadict\schemaspy
c:\datadict\schemaspy\lib
c:\datadict\html
- For Linux or Mac, create the same structure under your home directory (
~/datadict
)
- Download the SchemaSpy Java JAR file from: https://schemaspy.org/
- Save schemaspy-<version>.jar to
datadict\schemaspy\schemaspy-<version>.jar
- Optionally, download and install the open source GraphViz library, which is needed for high-resolutio entity-relatioship diagrams. (You can also use the -vizjs parameter in the command line in later steps to use th built-in graphics library).
- Ensure you have Java SE or Java SDK installed. Verify installation by opening a command prompt (shell) and typing:
java --version
- Download the appropriate database drivers (JDBC jar files) and save them to
datadict\schemaspy\lib
- If connecting to SQLite in-memory databases, download the sqlite-xerial JDBC (sqlite-jdbc-*.jar)
- If connecting to Oracle databases, either download the Oracle JDBC package or install the Oracle Client for your operating system. Be sure to note where the files are extracted, and that they match your system architecture (32-bit or 64-bit).
- A common location for Oracle Client drivers in Windows is: C:\Oracle\Product\19\client64\jdbc\lib\
- If you download the JDBC driver, extract all *.jar files to
datadict\schemaspy\lib
.
- If connecting to Microsoft SQL Server databases, download and install the JDBC Driver for SQL Server.
- Make note of the installation path, and ensure the lib directory contains both mssql-jdbc-*.jar and mssql-jdbc_auth*.dll (for SSO and Integrated Security). Optionally, copy those files to
datadict\scheaspy\lib
.
- If connecting to R datasets (.rds, .rdata, .rda), CSV files, or Pandas datasets:
- Download and install the Python csvs-to-sqlite package (pip install csvs-to-sqlite)
- Follow the instructions for downloading the SQLite driver in 6.1
- Optional, advanced configuration: if connecting to other database types, or to create your own connectors (e.g. SharePoint Online), download the appropriate JDBC files and refer to the SchemaSpy documentation (see "Create your own db type - super advanced").
- Additonal requirements:
- You must have an account with schema-level, admin-level or dbo-level access to the database
- Once everything has been installed, verify that your driver is available and ready to use by SchemaSpy:
java -Djava.library.path=c:\datadict\schemaspy\lib -jar "c:\datadict\schemaspy\schemaspy-6.2.4.jar" -dbhelp
Generate database documentation
Use the following examples to generate a data dictionary for your dataset. The output will be placed under datadict\html
.
Oracle Databases
Edit and run this command:
java -jar "C:\datadict\schemaspy\schemaspy-6.2.4.jar" -dp "C:\Oracle\Product\19\client64\jdbc\lib\ojdbc10.jar" -t ora -desc "<database description>" -o "c:\datadict\html\<database>" -ahic -norows -lq -nologo -db <databse> -s <schema> -cat <schema> -host <server_hostname> -port 1521 -u <username> -p "<password>"
Microsoft SQL Server Databases
Create a custom properties file under datadict\schemaspy\mssql-custom.properties
with the following contents:
extends=mssql08
connectionSpec=jdbc:sqlserver://<hostOptionalPort>;databaseName=<db>;integratedSecurity=true;
Edit and run this command:
java -Djava.library.path=c:\datadict\schemaspy\lib -jar "C:\datadict\schemaspy\schemaspy-6.2.4.jar" -dp "C:\datadict\schemaspy\lib\mssql-jdbc-11.2.3.jre11.jar" -t mssql-ph -desc "<database description>" -o "c:\datadict\html\<database>" -ahic -norows -lq -nologo -db <database> -s "%" -cat <schema> -host <server_hostname> -port 1433 -connprops integratedSecurity\=true -sso -configFile "C:\datadict\schemaspy\mssql-custom.properties" -u <active_directory_username>
R or Pandas Datasets
- The first option is to use an R or Python library to first "infer" the data types of your dataset (e.g. Pandas.DataFrame.infer_objects() ), and then export directly to a SQLite database (c:\datadict\<dataset_name>.db)
- The second option is to export your datasets to CSV, save them to
c:\datadict\<dataset_name>\csv
, and use the Python csvs-to-sqlite package to convert those CSVs to a SQLite database while inferring column data types automatically:
pip install csvs-to-sqlite
csvs-to-sqlite c:\datadict\<dataset_name>\csv c:\datadict\<dataset_name>.db
- Once the datasets are in SQLite (.db file), edit and run this command to run SchemaSpy against it:
java -Djava.library.path=c:\datadict\schemaspy\lib -jar "C:\datadict\schemaspy\schemaspy-6.2.4.jar" -dp "C:\datadict\schemaspy\lib\sqlite-jdbc-3.45.2.0.jar" -t sqlite-xerial -desc "<dataset description>" -o "c:\datadict\html\<dataset_description>" -ahic -norows -lq -nologo -db "c:\datadict\<dataset_name>.db" -all -sso
SQLite
Edit and run this command:
java -Djava.library.path=c:\datadict\schemaspy\lib -jar "C:\datadict\schemaspy\schemaspy-6.2.4.jar" -dp "C:\datadict\schemaspy\lib\sqlite-jdbc-3.45.2.0.jar" -t sqlite-xerial -desc "<database description>" -o "c:\datadict\html\<database>" -ahic -norows -lq -nologo -db "c:\datadict\<database>.db" -all -sso
CSV
Follow steps 2 and 3 under R / Pandas.
Document Tables & Columns
- Describing each table and column in your database allows others to understand what the data means and how to use it. This information is sometimes called "metadata." It is very important to provide a short description of each column, and if possible, the formulas used for calculating values or the steps taken to transform the underlying data. This allows future researchers to more easily validate or reproduce your study.
- While SchemaSpy does not have the ability to edit table and column descriptions online, it does provide other options:
- Document table and columns at the database level (which may require working with a database administrator). SchemaSpy is able to read database comments and will display them on the documentation. This also allows anyone with direct access to the database to see definitions in their SQL development tools.
- Alternatively, create a "schema metadata" file and pass it in the commands in the previous section. Refer to the documentation on how to create such a file, and use it in the previous commands by adding -meta <path_to_meta_file.xml>.
Publish the documentation
- The output of the previous commands will consist of one directory per database/schema, containing several HTML, CSS and JavaScript files, with an index.html as the entry point (home page) of each schema.
- The files must be published to a web server. Due to the relative paths and file names used, they will not work if published to OneDrive, Dropbox, Google Drive or SharePoint.
- U-M faculty, students and staff are eligible for free web hosting through ITS. This requries setting up an AFS file share to store the files, and a URL that wil resolve to it. See https://its.umich.edu/computing/web-mobile/web-hosting/afs-websites or https://its.umich.edu/computing/web-mobile/web-hosting/virtual-hosting for more information.
- Once the web server is up and running, simply copy the contents of
datadict\html
to your public HTML folder (e.g. ~\www\html
)
- For example, we published to our AFS file share at: /afs/umich.edu/group/e/efdcautomation/Public/html/dev/datadict/
- This gives us access via the following test web server URL: https://efdc-automation.web.itd.umich.edu/datadict/
- It is recommended that you create an index.html file that links to each individual schema. For an example, see the code used in: https://efdc-automation.web.itd.umich.edu/datadict/index.html
<html>
<head>
<title>Data Dictionaries</title>
</head>
<body>
<h1>Data Dictionaries</h1>
<p>Listing of data dictionaries generated by SchemaSpy.</p>
<ul>
<li><a href="IHS_2023/index.html">IHS Study (2023 Cohort)</a></li>
<li><a href="PROMPT/index.html">PROMPT Study</a></li>
</ul>
</body>
</html>
Notes
- Although SchemaSpy can be helpful for documenting datasets, we recommend using an enterprise-grade data dictionary tool whenever possible. This type of application normally requires institutional buy-in as part of a larger data governance and data stewardship program.
Resources
About the Author
|
Gabriel Mongefranco is a Mobile Data Architect at the University of Michigan Eisenberg Family Depression Center. Gabriel has over a decade of experience in data analytics, dashboard design, automation, back end software development, database design, middleware and API architecture, and technical writing.
| |
|