Friday, April 7, 2017

Lab 3: Data Normalization, Geocoding, and Error Assessment Sand Mining Suitability Project

Goals and Objectives

The goal of this lab was to use the ArcMap geocoding capabilities to geocode the locations of 19 sand mines in Wisconsin and compare my results with the actual locations.  This lab is a part of a multi part project to build a suitability/ risk model for sand mining in Western Wisconsin.  The focus of this lab is to work with normalizing raw data, geocode several addresses, and compare the geocoding results.  There are 128 sand mine locations in the state of Wisconsin according to the downloaded DNR data.  To reduce workload, each student in the class was randomly assigned 19 mines.  Each individual mine was assigned to 4 people to compare results.    

There are five objectives for this lab:
  1. In an Excel table normalize the address data for sand mines in Wisconsin
  2. In ArcMap connect to the ESRI geocoding service and geocode the assigned 19 mines
  3. Using the department ArcGIS server add the Public Land Survey System (PLSS) feature class
  4. Manually locate all 19 mines that use PLSS locations
  5. Compare personal results with those of classmates and the actual locations from given coordinates from the DNR.
Methods

Data Normalization
When the excel data table of the addresses of sand mines in Wisconsin was downloaded from the DNR website the address data table was found to not be normalized.  The first step of this lab exercise was to normalize the address data table for the 19 mines to be completed.  Data normalization is a refinement process that organizes data into columns within in a table in order to reduce redundancy and ensure data integrity.  For the normalization of the excel table downloaded from the DNR, the main element that was normalized was the address column of the downloaded table.  The downloaded data address column appeared with many address components all in one column: house number, street name, street type, city, state, and postal code.  Through data normalization most of these components were broken apart into separate columns within the table separating out city, state, and postal code so that address matching may be more accurate.

Geocoding
After the address data was normalized, in ArcMap the geography department enterprise ArcGIS server account was connected to and the assigned 19 mines were geocoded.  A map was added to the viewer to assist with geocoding the mines.  First ArcGIS online had be logged into using the university's enterprise account.  The excel sheet with the 19 mine addresses or PLLS locations was added to the viewer.   Two processes of geocoding were used.  First geocoding was conducted through using an address locator, the geocoding tool bar was turned on and the Geocode Address option was selected.  The World Geocode Service was selected and the excel sheet was selected.  The proper fields were set for geocoding to match with the different fields within the excel sheet.  The geocoding process was conducted after this process was complete.  Once the geocoding addresses function was complete a window, Figure 1, was opened providing information on how well the geocoding process conducted the matching of addresses.  There was only one address that could not be matched.  This method of matching, by using the address locator works well for addresses that have street addresses, this method does not match Public Land Survey System (PLSS) addresses.  The unmatched address were inspected and it was indeed a PLSS address.  This method does not work for addresses with PLSS locations.

Next the second process of geocoding was used, geocoding based on PLSS locations.  A database connection was added in ArcMap to the WiDNR2014 server.  This allowed access to the townships and sections feature classes to interpret PLSS locations.  Both shapefiles were added to the viewer.  Using the Address Inspection function of the geocoding toolbar and each address was assessed for accuracy by using the zoom to candidates function in the Interactive rematch window.  Each geocoded location was checked against the base map, if a map was not where the location was specified the PLSS location was found and the new address was picked from the map.  PLSS locations were found by using the township shapefile to locate the North or South component and the sections shapefile to locate the East or West component.  All addresses that were geocoded in this case were moved at least a little bit.  The address for each mine was placed at the driveway from a major road to the mine.  None of the geocoded addresses appeared in this correct location.  Once all addresses were properly located.  The feature class containing all of the geocoded locations was saved and exported as a shapefile.  
Figure 1. Geocoding addresses matching report.


Compare to Fellow Classmates Results
Following the completion of geocoding the 19 mines, the class combined their results allowing for students to compare their individual geocoded result with that of their classmates.  The Wisconsin State Cartographers office requires a local accuracy at a 95% confidence interval.  It was decided to assess if these standards could be upheld by checking for the accuracy of each of the 19 geocoded mines.  To begin all of the shapefiles of geocoded mines of students who submitted their mines for comparison were added to the ArcMap window.  Out of a class of 27 students only 16 students submitted their geocoded locations for comparisons.  Because of this a full comparison could not be collected, there was some data missing.  First all of the shapefiles were merged by using the Merge analysis tool.  Next the data table that contained all of the merged addresses was checked for merge accuracy.  the Mine Unique ID field was the most important.  If any of the mines did not have their IDs in this correct field the editor toolbar was used to manually move the IDs to the same correct field.  Next, both the shapefile containing my geocoded mines and the shapefile containing all of my fellow student's geocoded mines were projected to the same coordinate system not measured in decimal degrees, but an actual physical distance measurement.  A coordinate system measured in feet was used.

Next, the Point Distance analysis tool was used to determine the distances from input point features to features from a different feature class called near features within a specified search radius.  This tool was selected because it could not be assumed that the closest point would represent the correct mine, within this data set there are many sand mines close together in Western Wisconsin.  My georeferenced points were used as input point features and the classmates merged georeferenced locations point feature class was used as the near features.  No search radius was specified because the size of the distance between the two points could vary in size.  After the tool finished running the attribute table of the feature class containing all of the merged student's mines was exported as a table.  In this exported table all fields were deleted using the Delete Fields tool except for the Mine Unique ID field. This table was then joined to the output table of the point distance tool by matching the Near FID and the Object ID fields.  This allowed for the Mine Unique IDs to be compared to determine the distance between my mine geocoded location and that of other students.  The distances that separated each of the 19 geocoded mine locations and other students were recorded in an excel table, Figure 4.

Compare to Actual Mine Coordinate Locations
Also the exact location of each mine was given to the class from coordinate data that was exempt from the originally supplied DNR downloaded data but was acquired in the download.  This data was compared to the student's 19 georeferenced points.  This was done by first projecting the two feature classes to the same coordinate system that was not in decimal degrees so that distance could be calculated.  Next, the Point Distance tool from the Analysis toolbox within the proximity toolset was used.  This tool determines the distances from input point features to features from a different feature class called near features within a specified search radius.  This tool was selected because it could not be assumed that the closest point would represent the correct mine, within this data set there are many sand mines close together in Western Wisconsin.  The georeferenced points were used as input point features and the exact location point feature class was used as the near features.  No search radius was specifies because the size of the distance between the two points could vary in size.  After this tool finished running an output table resulted comparing each input feature ID to every near feature ID resulting in the distance separating the two points.  The correct ID's were cross referenced for their coordinating Unique Feature Mine ID and then the correct ID's were searched for the right combinations that resulted in comparing the same points between the two feature classes.  The distance that separated each of the 19 points were recorded in an excel table, Figure 5.   

Results

Below can be seen the data table without normalization (Figure 2) and the data table with normalization (Figure 3).  The main difference can be seen in how the addresses are recorded.  The normalized addresses have different fields for each component of the address and the addressed without normalization has the whole address lumped together in one field.
Figure 2. Data table without normalization.      Figure 3. Data table with normalization.

Below are the distance comparisons between my geocoded mine locations and the geocoded locations of my classmates (Figure 4), and my geocoded mine locations and the actual coordinate locations of the mines (Figure 5). 
Figure 4. Distance comparison between my geocoded mine locations and the geocoded locations of my classmates. 

Figure 5.  Distance comparison between my geocoded mine locations and the actual coordinate locations of the mines.

Below is a map with my estimates of mine locations, my classmates estimated mine locations, and the correct locations of the mines (Figure 6).    
Figure 6. Map comparing the locations where my classmates estimated mine locations to be, the correct mine locations, and my estimates of mine locations.  The mine unique IDs have been labeled.

Discussion

Most of my geocoded mine locations were generally correct.  My data did not meet the 95% accuracy upheld by the Wisconsin State Cartographers office, there were many more errors than I thought there would be.  There were several errors that resulted in the distances of the sampled value from the actual value of the points of the sand mines in Wisconsin. There were many different types of errors that were encountered include inherent and operational errors.  Inherent errors are errors that occur as a result of the spatial nature of geographic data.  Operational errors are errors that occur during the operation of the procedures for collecting, managing, and using geographic data.  There were three mines that had significantly greater amount of error, mines 208, 229, and 289.

The inherent errors that occurred include attribute data input error.  The addresses in the original DNR downloaded dataset might have been incorrect.  The addresses could have been jumbled by the DNR during data input.  I think this is the error that occurred for mine 208.  Mine 208 did not have a PLSS location, just a street address.  The geocoded address should have been more correct.  There might have been an error in entering the street address in comparison to the coordinate locations on the part of the WI DNR.  Mine 229 might have been an inherent error as well.  The PLSS location of the coordinate correct location is different from the PLSS location that was downloaded and used for the geocoding.  This would be an error on the part of the WI DNR in not entering in the correct data. The same goes for mine 289, the downloaded data recorded the mine was in township 24 North but the coordinate location has it in township 25 North.

Operational errors accounted for more of the smaller errors in distance.  The operational errors that occurred include the fact that there were a few mines where there were multiple mines near each other and in the same PLSS area where I chose a different mine instead of the correct one.  An example of this was found for mine 250 where there were multiple mines in one location.  My mine geocoded locations were more closely related to the mine locations of my classmates than the correct mine locations.  I attribute this to the fact that the class received instructions telling us general rules for where to place our geocoded location points, where the driveway to the mine meets a major road, or the main entrance to the mine.  The coordinates of the correct location of the mine were generally to the center of the mine.  An example of this was mine 202 where the correct coordinate location was at the center of the mine and mine and my classmates geocoded location was at the end of the driveway to the mine.  Another example of this error during geocoding there were a lot of decisions made by the user, where to exactly place the geocoded point, which entrance to the mine appears as the main entrance if there were multiple entrances, making error a very possible reality.  Something that could have prevented errors is the addition of protocols for data normalization and how data tables should be organized.  This error occurred when merging datasets, when users normalized their tables some changed the names of the columns causing trouble during merging and fields within tables were could not be matched and merged correctly.

We can know which points are correct and which are not by using the latitude and longitude values supplied by the DNR.  This data is the most accurate data in use, but errors could be present within this dataset from the process of collecting this data in the field.  The only way to know if the coordinates are correct is to go to the coordinate location and see if there is a sand mine located there.

Conclusion

Throughout this lab many important lessons about the complicated process of geocoding and the inaccuracies that can arise were learned.  It is important to be aware that errors may exist in datasets and may have an impact on the final outcomes of a project.  Error could have been minimized if normalization protocols were set in place and specific protocols for where geocoded points should be placed in reference to the location of the mine were in place.  Ultimately errors always exist in datasets and are an unavoidable reality, but being able to minimize errors to acceptable standards and being able to recognize that data errors will occur are important skills.