Final report of ITS Center project: Identification of traffic patterns leading to crashes

UVA Center for Transportation Studies

A Research Project Report

For the Center for ITS Implementation Research

A U.S. DOT University Transportation Center

FREEWAY CRASH PREDICTIONS BASED ON REAL-TIME PATTERN CHANGES IN TRAFFIC FLOW CHARACTERISTICS

 

Principal Investigators:

Dr. Nicholas Garber
Lili Luo

            

 

Center for Transportation Studies

University of Virginia

Thornton Hall

351 McCormick Road, P.O. Box 400742

Charlottesville, VA 22904-4742

804.924.6362

 

January 2006
Research Report No. UVACTS-15-0-101
 

Disclaimer

The contents of this report reflect the views of the authors, who are responsible for the facts and the accuracy of the information presented herein.  This document is disseminated under the sponsorship of the Department of Transportation, University Transportation Centers Program, in the interest of information exchange.  The U.S. Government assumes no liability for the contents or use thereof.

Text Box: Research Report No. UVACTS-15-0-101
Date: 01-20-06

Freeway Crash Predictions Based on Real-Time Pattern Changes in Traffic Flow Characteristics

 

 

 

By: Lili Luo

Dr. Nicholas J. Garber

 

 

 

 

 

 

 

 

 

 

 

 

A Research Project Report for the Intelligent Transportation Systems Implementation Center (ITS)

A U.S. DOT University Transportation Center

 

 

Dr. Nicholas J. Garber

Department of Civil Engineering

Email: njg@virginia.edu

 

 

Center for Transportation Studies at the University of Virginia produces outstanding transportation professionals, innovative research results and provides important public service. The Center for Transportation Studies is committed to academic excellence, multi-disciplinary research and to developing state-of-the-art facilities. Through a partnership with the Virginia Department of Transportation¡¯s (VDOT) Research Council (VTRC), CTS faculty hold joint appointments, VTRC research scientists teach specialized courses, and graduate student work is supported through a Graduate Research Assistantship Program. CTS receives substantial financial support from two federal University Transportation Center Grants: the Mid-Atlantic Universities Transportation Center (MAUTC), and through the National ITS Implementation Research Center (ITS Center). Other related research activities of the faculty include funding through FHWA, NSF, US Department of Transportation, VDOT, other governmental agencies and private companies.

 

Disclaimer: The contents of this report reflect the views of the authors, who are responsible for the facts and the accuracy of the information presented herein.  This document is disseminated under the sponsorship of the Department of Transportation, University Transportation Centers Program, in the interest of information exchange.  The U.S. Government assumes no liability for the contents or use thereof.

Text Box: CTS Website						                    Center for Transportation Studies
http://cts.virginia.edu							        University of Virginia
351 McCormick Road, P.O. Box 400742
Charlottesville, VA 22904-4742
434.924.6362

 


1. Report No. UVACTS-15-0-101

2. Government Accession No.

3. Recipient¡¯s Catalog No.

 

 

 

 

4. Title and Subtitle

5. Report Date

Freeway Crash Predictions Based on Real-Time Pattern Changes in Traffic Flow Characteristics

 

January 20, 2006

 

6. Performing Organization Code

 

 

7. Author(s)

Lili Luo

Dr. Nicholas J. Garber (Academic Advisor)

8. Performing Organization Report No.

 

 

 

 

9. Performing Organization and Address

10. Work Unit No. (TRAIS)

 

Center for Transportation Studies

 

University of Virginia

11. Contract or Grant No.

PO Box 400742

Charlottesville, VA 22904-7472

 

12. Sponsoring Agencies' Name and Address

13. Type of Report and Period Covered

Office of University Programs, Research Innovation and Technology Administration

US Department of Transportation

400 Seventh Street, SW

Washington DC 20590-0001

 

Final Report

 

 

14. Sponsoring Agency Code

 

 

 

15.  Supplementary Notes

 

 

16. Abstract

In recent years, attempts were made to develop a crash prediction model based on real-time detector data.  Since studies in this field are primarily theoretical, improvements can be made in various aspects.  It is expected that the final product of this study will be a program that integrates with the Advance Traffic Management System so that operators of Smart Travel Centers can take action to prevent or at least reduce the chances of crash occurrence.  At this first stage, efforts were made to identify the crash leading patterns and the factors describing the patterns.

Crashes that occurred on interstate highway basic segments between July 1, 2003 and June 30, 2004 from Northern Virginia were obtained from police crash reports.  The associated traffic conditions as well as the normal non-crash conditions defined by the traffic parameters were collected from Smart Travel Lab.

By applying three different pattern recognition techniques - the K-means clustering method; Naïve-Bayes method; and Discriminant Analysis - it was found that the overall classification error rate remained at about 50% and was unable to identify the crash leading patterns.

 

17 Key Words

18. Distribution Statement

Freeway Crash Predictions, Traffic Flow Characteristics

No restrictions. This document is available to the public.

19. Security Classif. (of  this report)

20. Security Classif. (of this page)

21. No. of Pages

22. Price

 Unclassified

Unclassified

 

N/A

 

 

 

Abstract

In recent years, attempts were made to develop a crash prediction model based on real-time detector data.  Since studies in this field are primarily theoretical, improvements can be made in various aspects.  It is expected that the final product of this study will be a program that integrates with the Advance Traffic Management System so that operators of Smart Travel Centers can take action to prevent or at least reduce the chances of crash occurrence.  At this first stage, efforts were made to identify the crash leading patterns and the factors describing the patterns.

Crashes that occurred on interstate highway basic segments between July 1, 2003 and June 30, 2004 from Northern Virginia were obtained from police crash reports.  The associated traffic conditions as well as the normal non-crash conditions defined by the traffic parameters were collected from Smart Travel Lab.

By applying three different pattern recognition techniques - the K-means clustering method; Naïve-Bayes method; and Discriminant Analysis - it was found that the overall classification error rate remained at about 50% and was unable to identify the crash leading patterns.

Possible reasons for the unsuccessful classification were discussed and recommendations were made for future studies.

Acknowledgements

Thanks should be given to all who have helped in this project - I especially want to thank my advisor, Dr. Nicholas J. Garber, for his advice and support.  I also want to thank Dr. John Miller and Lewis Woodson of Virginia Transportation Research Council for the crash data collection.  Special thanks should also be given to the Smart Travel Lab staff, Xiaoning Lu and Guimin Zhang, who assisted in extracting the traffic data for this project.  I also want to thank Yuan Lu for his help in data collection and his constructive discussions.


Table of Contents

Abstract i

Acknowledgements. vi

Table of Contents. vii

List of Figures. ix

List of Tables. xi

CHAPTER 1: INTRODUCTION.. 1

1.1 Project Purpose and Scope. 2

1.2 Thesis Overview.. 3

CHAPTER 2: LITERATURE REVIEW.. 5

2.1 Study of Oh et al. 5

2.1.1 Overview.. 5

2.1.2 Data Collection and Reduction. 6

2.1.3 Crash and Non-Crash Traffic Pattern Recognition. 7

2.1.4 Comment 8

2.2 Study of Golob et al. 9

2.2.1 Overview.. 9

2.2.2 Data Collection and Reduction. 10

2.2.3 Crash Traffic Pattern Recognition. 10

2.2.4 Comment 12

2.3 Study of Lee et al. 13

2.3.1 Overview.. 13

2.3.2 Data Collection and Reduction. 13

2.3.3 Crash and Non-Crash Traffic Pattern Recognition. 15

2.3.4 Comment 17

2.4 Study of Abdel-Aty et al. 18

2.4.1 Overview.. 18

2.4.2 Data Collection and Reduction. 18

2.4.3 Crash and Non-Crash Traffic Pattern Recognition. 20

2.4.4 Comment 23

2.5 Summary. 24

CHAPTER 3: DATA COLLECTION AND REDUCTION.. 26

3.1 Crash data. 27

3.2 Traffic Data. 28

CHAPTER 4: METHODOLOGY. 36

4.1 Preliminary Analysis. 36

4.1.1 Correlations of the traffic variables. 37

4.1.2 Distribution of the Variables. 38

4.2 Pattern Recognition Methods. 49

4.2.1 K-means clustering Method. 50

4.2.2 Naïve-Bayes Method. 51

4.2.3 Discriminant Analysis. 53

4.3 Notations. 54

4. 4 Summary. 56

CHAPTER 5: ANALYSIS RESULTS. 57

5.1 Classification results based on single variable. 58

5.2 Classification results based on volume mean and speed variance. 65

5.3 Classification results based on all variables. 67

5.4 Classification results based on the changes of the variables. 69

5.5 Classification results of day time traffic patterns. 71

5.6 Summary. 74

CHAPTER 6: RESULT DISCUSSION AND RECOMMENDATION.. 75

6.1 Reasons of unsuccessful identification of the crash leading patterns. 76

6.2 Recommendations for future study. 77

6.3 Summary. 79

REFERENCES. 79

APPENDIX: SUMMARY TABLES OF CLASSIFICATION RESULTS. 80

 


List of Figures

Figure 1 Implication of accident by traffic dynamics. 6

Figure 2 Determination of actual time of crash occurrence from speed profile. 15

Figure 3 No significant changes in the traffic flow during a crash. 30

Figure 4 Significant changes are shown during crash. 30

Figure 5 Detector stop working. 31

Figure 6 Observations of traffic parameters at interval 45. 36

Figure 7 Data fitting of occupancy mean. 40

Figure 8 Data fitting of occupancy variance. 41

Figure 9 Data fitting of speed mean. 43

Figure 10 Data fitting of speed variance. 44

Figure 11 Data fitting of volume mean. 46

Figure 12 Data fitting of volume variance. 48

Figure 13 Input features of combination 30_5 in the case of crash occurred at 9:00. 56

Figure 14 Classification results based on occupancy mean by using k-means clustering. 59

Figure 15 Classification results based on occupancy mean by using Naïve-Bayes. 59

Figure 16 Classification results based on occupancy mean by using Quadratic Discriminant Analysis. 59

Figure 17 Classification results based on volume mean by using k-means clustering. 60

Figure 18 Classification results based on volume mean by using Naïve-Bayes. 60

Figure 19 Classification results based on volume mean by discriminant analysis. 60

Figure 20 Classification results based on speed mean by k-means clustering. 61

Figure 21 Classification results based on speed mean by Naïve-Bayes. 61

Figure 22 Classification results based on speed mean by discriminant analysis. 61

Figure 23 Classification results based on occupancy variance by k-means clustering. 62

Figure 24 Classification results based on occupancy variance by Naïve-Bayes. 62

Figure 25 Classification results based on occupancy variance by discriminant analysis. 62

Figure 26 Classification results based on volume variance by k-means clustering. 63

Figure 27 Classification results based on volume variance by naïve-bayes. 63

Figure 28 Classification results based on volume variance by discriminant analysis. 63

Figure 29 Classification results based on speed variance by k-means clustering. 64

Figure 30 Classification results based on speed variance by Naïve-Bayes. 64

Figure 31 Classification results based on speed variance by discriminant analysis. 64

Figure 32Error rate based on volume mean & speed variance by using k-means clustering. 66

Figure 33 Error rate based on volume mean & speed variance by using naïve-bayes. 66

Figure 34 Error rate based on volume mean & speed variance by using discriminant analysis. 66

Figure 35 Error rate based on multiple variables by using k-means clustering. 67

Figure 36 Error rate based on multiple variables by using naïve-bayes. 68

Figure 37 Error rate based on multiple variables by using discriminant analysis. 68


Figure 38 Classification error rate based on the changes of the variables by using k-means clustering. 69

Figure 39 Classification error rate based on the changes of the variables by using Naïve-Bayes. 70

Figure 40 Classification error rate based on the changes of the variables. 70

Figure 41 Classification error rate of daytime pattern based on speed variance. 71

Figure 42 Classification error rate of daytime pattern based on speed variance. 71

Figure 43 Classification error rate of daytime pattern based on speed variance. 72

Figure 44 Classification error rate of daytime pattern based on volume mean and speed variance. 72

Figure 45 Classification error rate of daytime pattern based on volume mean and speed variance. 72

Figure 46 Classification error rate of daytime pattern based on volume mean and speed variance. 73

Figure 47 Classification error rate of daytime pattern based on all 6 variables. 73

Figure 48 Classification error rate of daytime pattern based on all 6 variables. 73

Figure 49 Classification error rate of daytime pattern based on all 6 variables. 74

 


List of Tables

Table 1 Illustration of the imputation process. 19

Table 2 Summary of the previous studies. 25

Table 3 Summary of crashes on I-66. 28

Table 4 Summary of crashes on I-95. 28

Table 5 Summary of crashes on I-395. 28

Table 6 Summary of crashes on I395 HOV. 28

Table 7 Crash Prediction Error Rate based on visual inspection of the change in speed & occupancy. 33

Table 8 Visual identification of crash cases from the corresponding non-crash cases. 34

Table 9 Crash Prediction Error Rate based on visual inspection of 34

Table 10 Crash Prediction Error Rate based on Occupancy Mean Using K-means Clustering. 35

Table 11 Correlations of the variables of combination 45_5 at time interval 9. 38

Table 12 Estimated parameters of Weibull Distribution for occupancy. 39

Table 13 Estimated parameters of Log-logistic Distribution for occupancy variance. 42

Table 14 Kolmogorov-Smirnov test of speed mean. 43

Table 15 Estimated Parameters of Log-logistic Distribution for speed variance. 45

Table 16 Estimated parameters of Log-logistic Distribution for volume mean. 47

Table 17 Kolmogorov-Smirnov test of Volume variance. 48

Table 18 Covariance matrix of crash pattern based on combination of 15_15. 58

Table 19 Covariance matrix of non-crash pattern based on combination of 15_15. 58

 


CHAPTER 1: INTRODUCTION

Every year since 2000, over 40,000 lives are lost due to motor vehicle traffic crashes, which makes it the eighth cause of death overall and the third cause of death for young people aged from 3 to 33 in the United States.[1]  Highway safety may be intangible to the public; indeed it is one of the most challenging issues presented before transportation researchers and traffic engineers.

Efforts have been made to relate traffic flow characteristics and crash occurrence.  However, the crash prediction models based on AADT, even hourly volume, have limited performance.  The highly aggregated traffic data are considered one of the factors that conceal the unique characteristics that lead to crashes.

On the other hand, studies have been conducted based on the so-called ¡®real-time¡¯ traffic data ever since data from inductive loop detectors became available.  A good example is the incident detection study - algorithms and models were developed to help operators of the Smart Travel Center to detect an incident in a timely and accurate way so that rescue teams could respond as quickly as possible.  In this case, researchers try to improve highway safety in a reactive way.

Naturally, researchers begin to access the crash prediction problem by using ¡®real-time¡¯ traffic data.  Only a few studies have been carried out since 2000.  Most researchers claimed it was likely that traffic patterns that lead to crashes differ from the patterns that do not lead to crashes.  This is viewed as a potentially feasible way in real-time crash prediction and prevention.

1.1 Project Purpose and Scope

The purpose of this project is to identify the traffic patterns leading to crashes so that action can be taken to reduce the risk of having a crash when hazardous patterns occur.  The goal is to integrate the crash prediction algorithms into the Traffic Management System in the Smart Travel Center.  When traffic patterns leading to crashes are detected, alarms are released to the operator in the Smart Travel Center.  The operator can then arrange to change the traffic patterns so that the probability of crash occurrence can be reduced.  As the first stage of the study, this thesis only focused on discriminating the traffic patterns leading to crashes from the traffic patterns that were not leading to crashes.  Since the existence of ramps and interchanges affects the traffic characteristics, only the freeway basic segments were considered in this study.

Crash data of the study were extracted from the FR300 police crash reports at the Virginia Department of Transportation.  The corresponding traffic data under crash and non-crash condition were obtained from the Smart Travel Lab which stores loop detector data from Smart Travel Centers of both the Hampton Roads and Northern Virginia area.

It should be mentioned that the performance of classifying the traffic patterns is measured by the classification error rates which includes the overall error rates of misclassification, the error rate of misclassifying the crash condition to non-crash conditions and the error rate of misclassifying the non-crash conditions to crash conditions. Successful pattern recognition should have low classification error rates for all three measurements.

¡­¡­¡­¡­¡­¡­¡­¡­¡­¡­¡­¡­..(1)

The error rate of misclassifying crash traffic pattern to non-crash traffic pattern and the error rate of misclassifying non-crash traffic pattern to crash traffic pattern are defined in Equations 2 and 3 respectively.

 (2)

 

 (3)

1.2 Thesis Overview

The remainder of this thesis consists of the following:

Chapter 2: Literature review ¨C a review of past studies on crash prediction models based on loop detector data

Chapter 3: Data collection and reduction ¨C a description of construction of the study data set

Chapter 4: Methodology ¨C an overview of the pattern recognition techniques used in the project

Chapter 5: Analysis result ¨C the results given by the pattern recognition techniques

Chapter 6: Result discussion and recommendations ¨C summary of the conclusions and suggestions for future study.


CHAPTER 2: LITERATURE REVIEW

Freeway crash prediction is one of the most challenging research topics in transportation study today and only a few studies based on real-time detector data have been carried out since 2000.  Some prediction models were developed based on previous studies by using speed variance as the major input while others made efforts to identify the precursors before constructing the crash prediction models.  A review of these previous studies is detailed below.  Although the prediction models are also mentioned, the focus is given to the identification of traffic patterns leading to crashes.

2.1 Study of Oh et al

2.1.1 Overview

The first study was conducted by Oh, Ritchie and Chang in 2000.[2]  In this study, traffic conditions were classified into two patterns - disruptive and normal.  The disruptive traffic condition was the one that potentially led to an accident occurrence and a normal traffic condition was the one not involved in an accident.  It was assumed that the traffic dynamics index could represent an obvious difference between the two traffic conditions and Figure 1 illustrates this idea.

Figure 1 Implication of accident by traffic dynamics

(Source: Oh, Ritchie and Chang (2000))

Six variables - mean and standard deviation of occupancy, flow and speed - were considered as the candidate traffic dynamics index and formed two data sets for the normal condition and for the disruptive condition.  By performing t-test, the standard deviation of speed was found to be the most statistically significant factor to represent the change of traffic conditions.  Then the probability of accident occurrence at a given standard deviation of speed was modeled based on the Bayesian Decision Theory.

2.1.2 Data Collection and Reduction

The test bed for this research was I-880 in Hayward, California - about 9.2 miles long having 4-5 lanes with a separate high occupancy vehicle lane (HOV).  Seventeen detector stations for northbound and 18 for southbound are located in the study freeway segment.  The 10-second real time traffic flows, occupancies and speeds were collected from February 16 to March 19, 1993 during two periods (5 am-10 am, and 2pm-7pm) by loop detectors.  The accident profile was developed by running four probe vehicles on the study section throughout the data collecting period.

A total of 91 accidents occurred during the data collection period but only 52 were selected for analysis.  The lane-specific 10-second data from upstream stations were extracted and averaged to station level.  The data were then reduced to generate two new data sets; one for the normal traffic condition, which is a 5-minute period 30 minutes before an accident occurrence while the other represented the disruptive traffic condition which is the 5-minute period right before an accident.  Therefore, there are 52 groups of 10-second flow, occupancy and speed data for the 5-minute period for each of the two data sets.

2.1.3 Crash and Non-Crash Traffic Pattern Recognition

Based on the two data sets, six measurements for both traffic conditions were considered to present the traffic patterns and investigated by using t-test.  These measurements were mean and variance of volume, occupancy and speed.  The results indicated that the 5-minute standard deviation of speed was the most statistically significant for differentiating normal and disruptive traffic conditions.  The 5-minute standard deviation of 10-second speed was, therefore, identified to discriminate the normal and disruptive traffic conditions.

2.1.4 Comment

The study of Oh et al established a start for studying the accident prediction algorithm using real-time traffic data.  It presents a view of the future that the crash prediction algorithm can be integrated into the Advanced Traffic Management System (ATMS) or Advanced Traffic Information System (ATIS) for creating a proactive traffic operation system.

However, some limitations exist in the model:

(1)         Insufficient accidents impaired the robustness of the model

According to the study report, a total of 91 crashes occurred during the data collection period but only 52 of them were used.  The other 39 accidents were eliminated due to ¡°unmatched with real time traffic data¡±.  It was uncertain if the traffic data were bad for the 39 accidents periods or the traffic data could not show the same pattern that the 52 accident periods did.  However, 52 crashes were insufficient to ensure the reliability and robustness of the derived probability function and it was therefore suggested that new crash and traffic data be obtained for the model validation.

(2)         The impact of aggregated station-level speed

In Oh¡¯s study, the lane-specific speeds were aggregated to a station-level speed.  It is believed the traffic flow characteristics are different with respect to different lanes.  It is suspected that such speed aggregation will lose important traffic pattern information and affect the accuracy of the proposed accident prediction model.

(3)         Uncertainty of normal condition 30 minutes before an accident

The normal traffic condition represented by traffic data 30 minutes prior to a crash is doubtful since it is likely that the traffic condition 30 minutes prior to a crash has already been involved in the disruptive condition.

(4) Potential violation of the t-test assumption

In this study, Oh did not indicate whether this analysis was undertaken to test the validation of the variables normally being distributed.  If those variables do not follow normal distribution, the use of the t-test is not appropriate.

2.2 Study of Golob et al

2.2.1 Overview

In early 2002, Golob and his research group published their findings about the relationship between different types of freeway crashes and real-time traffic flow.[3]  Based on their findings, they developed a real-time safety monitoring tool ¨C Flow Impacts on Traffic Safety (FITS).  They collected 30-second interval volume and occupancy data and converted them into several traffic flow regimes.  Each regime is correspondent to a specific pattern of crash types which were determined through non-linear multivariate analyses.  When the 30-second traffic flow is observed to fall in one of the traffic flow regimes, the types of crashes that are most likely to occur will be predicted.

 

2.2.2 Data Collection and Reduction

The study area of this research covered six major freeway routes (a total of 130 miles with a number of lanes between 3 and 6 in each direction) in Orange County, California in 1998.  Since only the single loop detectors were available in the study area, no speed data were collected.

Crash data were obtained from the Traffic Accident Surveillance and Analysis System (TASAS) database which documented all police-reported crashes in the California Highway System.  A total of 1192 mainline crashes were used for the analysis which had the valid loop detector data for a full 30 minutes preceding the crash for three types of lanes (right, interior and left lane) at a station closest to the crash.

The 30-second interval volume and occupancy data were collected for 30 minutes prior to the crash.  The crash time was based on the reported crash time.  Noting that the reported crash times are typically rounded off to the nearest 5 minutes, the research team decided to discard traffic data 2.5 minutes preceding the reported crash time to reduce the inclusion of post-crash condition.

2.2.3 Crash Traffic Pattern Recognition

Golob only focused on the crash condition analysis.  In the first step, he classified the crashes by crash type (based on type of collision: rear end, side swipe or hit object); crash locations (left lane, interior lane, right lane, within shoulder or off-road beyond the shoulder area); and crash severity (injures or fatalities).

Then, by using the volume and occupancy data, he constructed four groups of statistic measures for each left, interior and right lanes which include:

(1) The median of the ratio of volume to occupancy;

(2) The difference of the 90th percentile and 50th percentile in the ratio of volume

 to occupancy;

(3) The mean volumes over the entire 27.5 minute period preceding the accident;

(4) The standard deviations of 30-second volumes

Next, Golob applied Principal Components Analysis (PCA) on all the 12 variables for the crashes during daylight or dusk-dawn on dry roads to identify the independent variables.  Simply speaking, PCA is trying to find an orthogonal transformation:

                                                                                                                                                     (4)

Such that components of Z are orthogonal and for component Zi, it has the maximum variance among all normalized linear combinations uncorrelated with Z1, Z2, ..., Zi-1.

The central idea of this transformation is to form a series of mutual independent variables, or say, components, which are linear combinations of the original variables and then order these components by their contributions to the variance of the system input.  The components dominating the variance of the inputs are then identified.  It can be proved that the variance of X is the same as the variance of Z.


In PCA, the first six converted variables were found to account for 86.8% of the variance and explained by the following original variables:

a. The median of the ratio of volume to occupancy for interior lane;

b. The difference of the 90th percentile and 50th percentile in the ratio of volume

 to occupancy for interior lane;

c. The difference of the 90th percentile and 50th percentile in the ratio of volume to

 occupancy for right lane;

d. The mean volumes over the entire 27.5-minute period preceding the accident for

 left lane;

e. The standard deviations of 30-second volumes for interior lane;

f. The standard deviations of 30-second volumes for right lane.

With these six traffic variables, Golob performed K-means clustering analysis combined with the Non-linear Canonical Correlation Analysis (NLCCA) to separate them into the optimal number of traffic flow clusters which can best be correlated to the different types of crashes.  It was found that eight traffic flow clusters best correspond to specific type of crashes during daylight on dry roads.

2.2.4 Comment

Applying non-parametric discriminate analysis (K-means clustering) and NLCCA on traffic volume variables preceding the crash occurrence and crash classes, Golob identified the correspondence between the traffic flow regimes and different types of crashes during daylight on dry roads.  However, it is difficult for the analysis results to be implemented in the field since we only know the tendency of one type of crash is higher than other types of crash under certain traffic conditions.  Without considering the non-crash normal condition, the operator of Smart Travel Center cannot tell if a crash will occur or not.

2.3 Study of Lee et al

2.3.1 Overview

In 2002, Lee, Saccomanno and Hellinga developed a Log-linear algorithm to estimate crash frequency by inputting the effect of the speed variations along a lane, across lanes and traffic density at given roadway geometry, weather and time of day.[4]  One year later, in a later study, the research team found the impact of the speed variation across the lanes was insignificant.[5]  As a result, the speed variation across the lanes was deleted from the previous model.  In addition, one more traffic factor: the traffic queue was supplemented into the model.

2.3.2 Data Collection and Reduction

The data used for the model calibration were collected from a 10 km section of the Gardiner Expressway in Toronto, Canada with 3 or 4 lanes.  The study freeway section had 38 loop detector stations (19 were located upstream of ramps, 19 were located upstream of straight sections).  The real-time loop detector data were collected over a 13-month period from January 1998 to January 1999 for 24 hours each weekday.  Traffic data of speed, volume and occupancy were pooled every 20 seconds.  The incident logs maintained by the Traffic Control Center were used to create a crash profile.

A total of 234 crashes were used for the analysis.  To verify the actual time of crashes, the visual speed profile analysis was performed for all 234 crashes.  As illustrated in Figure 4, the sudden drop in speed for all three lanes at the upstream detectors indicated the actual start time of a crash.  With the actual time of crash occurrence, the speed five minutes prior to the crash and the density on the crash occurrence could be extracted.  This step is critical in developing crash prediction models.  In most cases, the recorded crash occurrence time is not exactly the actual crash occurrence time.  Failure to determine the actual crash occurrence time will lead to the unconscious inclusion of the traffic patterns after the crash.

Figure 2 Determination of actual time of crash occurrence from speed profile

(Source: Lee, Saccomanno and Hellinga (2003))

 

2.3.3 Crash and Non-Crash Traffic Pattern Recognition

Unlike Oh¡¯s study which applied the t-test on all six traffic dynamic indicators to determine the most appropriate one, there is no clear process that was used in Lee¡¯s study to identify the traffic indicators that differentiated between crash and non-crash conditions.  Although the author stated that two basic principles were used to determine the traffic indicator (first, the traffic indicators should have significant different values for crash and non-crash cases; second, the most appropriate observation time slice can maximize the difference in traffic indicator values between crash and non-crash cases), no experiment methodology and result were documented in the report.  In the 2003 model, the author updated the effect of the traffic factors by generating three traffic indicators: the Coefficient of Variation of Speed along the lane (CVS1), traffic queue on crash occurrence, (which was defined by the absolute difference of speed between upstream and downstream ends of road sections), and the average station-level density at the instant the crash occurred (D).  They are expressed in Equations 5, 6 and 7:

                                  (5)

(Source: Lee, Saccomanno and Hellinga (2003))

where:

 

 = actual time of crash

 

 = observation time of crash

 = standard deviation of speed on lane computed over period

 = average speed on lane computed over period (km/hour)

 = polling interval of loop detectors (seconds)

 = speed on lane  at time t (km/hour)

 = total number of lanes

 

                                                                                             (6)

(Source: Lee, Saccomanno and Hellinga (2003))


where:

 = volume on lane  at the time of crash (vehicle/hour)

 = speed on lane  at the time of crash (km/hour)

 

                                          (7)

(Source: Lee, Saccomanno and Hellinga (2003))

where:

 = average speed difference between upstream and downstream locations (km/hour)

 = average speeds computed over period of upstream and downstream of a location respectively (km/hour)

 = time interval of observation of speed (seconds)

 = observation time slice duration (seconds)

 

 = time of crash occurrence

2.3.4 Comment

Compared to Oh¡¯s study, Lee¡¯s study was based on more crash samples and a longer data collection time period.  By identifying the actual crash occurrence time, he reduced the error of including traffic characteristics after the crash occurrence.  His effort mostly focused on developing the crash prediction model using the log-linear method.  Lee used the term ¡®crash precursor¡¯ as the input of his prediction model but he selected these crash precursors based simply on previous studies.  A more detailed procedure of identifying traffic characteristics is needed before constructing the crash prediction model.  The established model takes into account the comprehensive crash-caused factors, such as three traffic factors, three exterior factors in terms of roadway geometry, weather and time of day, all which suggest the necessity to control environmental factors.  In Lee¡¯s study, density at the moment of crash is required, which is unrealistic in the real-time prediction and limits its application in the Smart Travel Center.

2.4 Study of Abdel-Aty et al

2.4.1 Overview

Abdel-Aty and his research team carried out a series of studies of crash prediction modeling since 2004.  Different from the previous researchers, he conducted a series of crash and non-crash traffic pattern classification analysis before developing crash prediction models.

2.4.2 Data Collection and Reduction

The test bed for this research was the divided 13.25-mile stretch of I-4 in Orlando, Florida where 28 Dual Loop Detectors are located.  Traffic data including volume, speed and occupancy (30-second pooling interval) between April 1, 1999 and November 30, 1999 were available.  A total of 377 crashes with correspondent traffic data were used for the analysis.  A crash profile was built up based on police crash reports.

Instead of only extracting data from the station where the crash happened adjacently, Abdel-Aty also obtained data from six other stations, five upstream and one downstream in a period of 30 minutes before the crash was recorded.  All the 30-minute data were divided into 5-minute intervals.  In each interval, six variables were considered, the mean and variance of speed, occupancy and volume.  For the seven stations included in the study, a total of 252 traffic input features were constructed.  Meanwhile, traffic data for the correspondent non-crash conditions were also extracted.  The correspondence means the same location, same time of day and day of the week.  Therefore, external factors such as roadway geometry, and time of day were controlled.

Detector data were aggregated to station level data for analysis.  Since not all the 377 crashes had good raw data at the seven stations, Abdel-Aty inputted the missing data by referring to the traffic data on the adjacent upstream station 5 minutes before or the downstream station 5 minutes after the current time period of the studied station.  The imputation process was shown in Table 1.

Table 1 Illustration of the imputation process

 

8:30-8:35

8:35-8:40

8:40-8:45

8:45-8:50

8:50-8:55

8:55-9:00

station D

30

25

21

27

25

34

station E

21

31

Missed

26

30

33

station F

37

32

20

29

33

27

 

 

 

 

 

 

 

 

8:30-8:35

8:35-8:40

8:40-8:45

8:45-8:50

8:50-8:55

8:55-9:00

station D

30

25

21

27

25

34

station E

21

31

Imputed as 25

26

30

33

station F

37

32

20

29

33

27


2.4.3 Crash and Non-Crash Traffic Pattern Recognition

In his first study, Abdel-Aty used the Probabilistic Neural Network (PNN) method to classify traffic patterns into non-crash and crash classes.  PNN is a distance-based classifier to identify certain patterns.  In the classical PNN method, Euclidean Distance is the measurement applied to classify the inputs.  Since the elements of the input vector are not independent, the direct application of the Euclidean Distance will call the method into question.  Thus, Abdel-Aty used Principal Component Analysis to transform the original input vectors into a new set of vectors.  Within each vector, the elements are independent.

To identify the most significant factors as inputs to the PNN model, the author used matched case control logistic regression as follows:

                                                                         (8)

where:

 = conditional mean of dummy variable Y representing crash occurrence

x = one of the traffic variables such as average volume, speed and so on

Then he defined the hazard ratio as the exponential of ¦Â1 to indicate the risk of crashes.[8]  If the hazard ratio is high, the probability of having a crash is also high.  Average volume, occupancy and logarithms of coefficient of variance of each time interval and each station were fed into the matched case control logistic regression model.  It was shown that the logarithms of coefficient of variance (standard deviation /average) of speed at two consecutive upstream stations plus the station where the crash occurred dominate the pattern changes.  For the purpose of prediction, he chose logarithms of coefficient of variance in speed within the time period 10-15 minutes before the crash as the inputs of PNN model.

In the first study, Abdel-Aty not only evaluated the accuracy of crash prediction but also estimated the false alarm rate.  However, the results of the classical PNN method (without applying the Principal Component Analysis) did not show significant difference from the modified PNN method.  The result indicated the classical PNN performed better than the modified PNN method in terms of overall error rate but worse with respect to crash identification accuracy [6].  He also mentioned in the study that the small spread value of PNN method makes it act as the nearest neighbor classifier.

In the second study, Abdel-Aty developed Generalized Estimating Equations (GEE) for crash prediction by using the same set of data from a previous study.[7]  A different technique was used to identify the variables which have significant effects on crash occurrence.  Because of the high correlation between occupancy and speed, he eliminated the occupancy from the data set.  As a result, 252 traffic input features were reduced to 168.  He also applied a one-on-one match for the non-crash and crash cases which gives the same number of non-crash cases as the crash cases.  Furthermore, eight geometric factors were considered in this study.  Then he used stepwise logistic regression analysis to eliminate the factors without significant impacts on crashes.

The general idea of forward stepwise regression is to begin with no model terms.  Then at each step it adds the most statistically significant term based on the highest F statistic until no significant terms are left.[9]  It was shown that the standard deviation of volume at station D within 15 minutes prior to the crash, the standard deviation of speed at station F within 15 minutes prior to the crash, the standard deviation of speed at station E within 5 minutes prior to the crash and the average speed at station E within 5 minutes prior to the crash were the most significant traffic characteristics to discriminate traffic patterns leading to crashes from the patterns that do not lead to crashes.  Here Station F was the nearest station to the crash; Stations E and D were the two adjacent upstream stations of station F.  Each one was about 0.5 miles away from the other.  By identifying these traffic factors and the other three geometry factors, the input vectors for the GEE model were ready.

The GEE is one of the Generalized Linear Regression methods.  The central idea is to take the correlation within one cluster into account in model development.  Different from the previous studies, Abdel-Aty viewed traffic data collected from the same location as correlated.  Thus, he treated the 56 stations as 56 clusters to develop the Generalized Estimating Equations.  Three different correlation structures including independent, exchangeable and autoregressive were tested and compared in the model.  The result showed that both the exchangeable and autoregressive structures performed better than the independent model.[7]


He drew the conclusion that the increase of the speed variance over 15 minutes at certain location increases the crash probability.  Low variability in volume increases crash probability 1 mile downstream.  Low average speed at certain location increases crash probability 0.5 miles downstream.  The conclusion may be true when the traffic flow is under a congested regime.  However, the explanation of crash probability based on a single variable such as low variability in volume or low average speed is doubtful.

2.4.4 Comment

Based on the experience and lessons of the pioneers, Abdel-Aty realized the importance of identifying traffic patterns leading to crashes and performed a series of studies in an attempt to explore the difference between crash pattern and non-crash pattern.  He expanded the station spatial scope to seven stations which could bring information about the spatial variation in traffic pattern.  The PNN-based classifier and the forward stepwise regression method he used in his studies provided innovative research methods for later studies but his attempt to explain the cause of crashes by individual traffic parameters makes the conclusion questionable.  Since the traffic patterns on the freeway are mainly described by mean and variance of volume, occupancy and speed, the traffic pattern leading to a crash is also described by these variables.  Therefore it would be more appropriate to use the joint effects of traffic variables rather than one single variable to predict the crash probability.  Thus, the application of results of Abdel-Aty¡¯s study may lead to unnecessary high false alarm and noise.

2.5 Summary

In this chapter, the studies from four representative researchers in the field of crash prediction modeling using real-time loop detector data were reviewed.  Among them, Oh and Abdel-Aty conducted the crash and non-crash traffic pattern recognition analysis before developing the prediction model.  Lee¡¯s study did not show a clear procedure in this facet and Golob approached his research only based on crash conditions.  Data collection and reduction for each of the studies were reviewed because it significantly affected the input feature construction.  The critical issue in each study was also discussed at the end of the section for each specific study.  Although some efforts were made to identify traffic patterns leading to crashes, the implementation of the findings of the above researches to the field is difficult because of unrealistic data requirement or lack of consideration of joint effects of the traffic variables.  Table 2 summarized key factors of the previous studies:


Table 2 Summary of Previous Studies

Author

Crash occurrence time

Crash Data

Non-crash traffic data

Methods to determine variables that separate two conditions

Variables used in prediction model

Data level in analysis

Aggregation level

Data pulling interval

Sample size

Oh

Visual determined

5-minute period just before crash

5-minute period 30 minutes prior to crash

t-test

standard deviation of speed

station level

5 minutes

10 seconds

52

Golob

Removed the last 2.5 minutes

30-minute period before crash

N/A

N/A

1.standard deviation of volume

2.mean volume

3.volume to occupancy ratio

detector level

30 seconds

30 seconds

1192

Lee

Visual determined

5-minute period before crash

Other 5-minute periods of the same data

No detail process

1. standard deviation of speed

2.speed difference between upstream and downstream station

3. Density

detector level

5 minutes

20 seconds

234

Abdel-Aty

Based on crash report

15-minute period just before crash

The same time of day, day of week and station

PNN, stepwise logistic regression

1. speed variance

2. volume variance

3. mean speed

station level

5 minutes

30 seconds

377

 

In order to address the limitation of the previous studies, up to 75 combinations of time period and time intervals were studied to find appropriate prediction compounding in this report.  Furthermore, the joint effects of two or more traffic variables were also studied to determine traffic patterns leading to a crash.

CHAPTER 3: DATA COLLECTION AND REDUCTION

The construction of the study data set is the fundamental task before performing any analysis.  The quality of the study data set has significant impacts on the analysis results.  Thus, a great amount of effort was spent on verifying the validity of the traffic data as well as extracting data from different sources.

The first attempt was to extract both crash and traffic data from Smart Travel Lab (STL) for the Hampton Roads Area because of the availability of over five years of records on a stretch of 13 miles on Interstate 64, 264 and 564.  Over 2000 crashes with occurrence time, date and approximate location were extracted from the STL database for the years 1999, 2000, and 2001.  The corresponding traffic data 45 minutes before and 45 minute after crashes were also retrieved.  However, by simply plotting traffic data at the station level, only about 30% of data could be viewed as available.  Furthermore, the 1-minute interval traffic data of the Hampton Roads area were obtained by dividing the 2-minute aggregated data by two which essentially modified the original traffic data.  Even worse, in most cases (although not all), the speed over 65 miles per hour was cut and recorded as 65 miles per hour in this database.  Due to the data quality issue of the Hampton Roads area, we turned our attention to the Northern Virginia area and were able to obtain the required data for the study.

3.1 Crash data

Before the end of 2004, no crash data from Northern Virginia were stored in the Smart Travel Lab.  Other sources had to be found to obtain crash information.  In this study, FR 300 crash reports from Virginia Transportation Research Council were obtained and manually tabulated into Excel files.  The information included the crash number, route, direction, date, day of week, recorded occurrence time, jurisdiction, occurrence location in terms of mile marker and description, number of vehicles involved in the crash, crash type, roadway alignment, weather, roadway surface condition, roadway defect, lighting condition, number of injured, injured type, number of fatalities, and vehicle maneuver.  The total number of lanes at the crash occurrence location and the lane at which the crash occurred were also extracted by reading the crash diagram.  Although not all the information was used in this study, further study can benefit from the data collection efforts.  Due to the time constraint, only one year¡¯s crash data from July 1, 2003 to June 30, 2004 were obtained which contains a total of 2908 crash records from both directions of I-95, I-66 and I-395 on a stretch of 50 miles.  Since no detectors were laid on the shoulder, crashes that occurred there were eliminated from the data set when the shoulder was open for traffic which left 2865 crashes.  Tables 3 through 6 give the summary of crashes.  Because the traffic characteristics near the interchanges and ramps were quite different from the freeway basic segments, only the crashes on the freeway basic segments were considered in this study.  The definition of freeway basic segments is 500 ft upstream and 2500 ft downstream of an on ramp, 2500 ft upstream and 5 feet downstream of an off ramp, or 500 ft upstream of the merge point marking the beginning of the weaving area and 500 ft downstream of the diverging point forming the end of the weaving area. (HCM 1985)

Table 3 Summary of Crashes on I-66

I66

EB

WB

Subtotal

percent

Basic Segment

437

490

927

0.609868

Ramp

293

300

593

0.390132

subtotal

730

790

 

 

total

 

 

1520

 

Table 4 Summary of Crashes on I-95

I95

NB

SB

Subtotal

percent

Basic Segment

399

350

749

0.827624

Ramp

64

92

156

0.172376

subtotal

463

442

 

 

total

 

 

905

 

Table 5 Summary of Crashes on I-395

I395

NB

SB

Subtotal

percent

Basic Segment

70

54

124

0.300242

Ramp

139

150

289

0.699758

subtotal

209

204

 

 

total

413

 

413

 

Table 6 Summary of Crashes on I395 HOV

 

I395HOV

 

Southbound HOV

Reversible HOV

Northbound HOV

3

17

7

Total

27

 

3.2 Traffic Data

Because of the time constraint, only traffic data from the closest upstream station of the crashes were extracted.  This is based on the assumption that the closest upstream station to the crash records the traffic patterns right before the occurrence of the crash.  For most cases, this is not accurate because the crash occurrence location will always have some distance from the station.  However, since the stations in the Northern Virginia area are densely distributed (at least one station within one mile, but in most cases 0.5 miles apart), the difference is relatively small and can be neglected.  A total of 97 main line stations on the study freeway network were obtained.  The number of detectors varies from 2 to 5 according to the total number of lanes at the location.  These detectors on the main line freeway in Northern Virginia are all double loop detectors which can provide us with accurate speed readings.

The first step to retrieve the traffic data was to locate the corresponding upstream station for each crash by matching the crash mile maker with the station mile marker.  In this process, some of the crashes were found to occur too far away from the last station of the study area (over 5 miles) and therefore were removed from the data set.  After this process, 1172 crashes were left.  Second, the station level traffic data were plotted to determine the actual crash occurrence time and to eliminate the crashes of which the detector did not function during the crash occurrence period.  Speed and occupancy over 1.5 hours (45 minutes before and 45 minutes after the crash) were plotted for each case.  Figures 3 through 5 illustrated the plotted results.  The x-axis is the time in minutes from the beginning of the day.  The blue lines are the plots of the occupancy and the red lines are the plot of the speed.

Text Box: OccupancyText Box: Speed

Figure 3 No significant changes in traffic flow during crash

Note: recorded crash time is 15:00 which is 900 minutes away from the beginning of the day.

Text Box: OccupancyText Box: Speed
 


Figure 4 Significant changes are shown during crash


Note: Recorded Crash Time is 6:20, which is 380 minutes away from the beginning of the day.  The identified actual crash time is 385 minutes away from the beginning of the day, which is 6:25 in the morning.

 

Text Box: OccupancyText Box: Speed

Figure 5 Detector stopped working

The reasons for no significant changes in the traffic parameters in some of the crashes are not clear.  It may result from both the characteristics of traffic flow and crashes.  For instance, when the volume on the freeway is low, a minor crash may not result in serious impact on the traffic flow.  Therefore, no significant changes can be observed from traffic parameters.

After removing the cases where the detector was not working, only 586 crashes were left, among which 50% of the crash time could not be identified by reading from the plot as shown in Figure 3.  Hence, the recorded crash time was taken as the actual crash time.  The actual crash times were identified for the remaining 50% of crashes and re-input into the query to extract the corresponding traffic data.

Next, the detector level 1-minute interval traffic data for all lanes of the corresponding station 45 minutes before and 45 minutes after the corrected crash time were obtained from the database.  Although only the 45 minutes before crashes data will be used in this study, data after the crashes can be applied to later research.

From previous studies[4][5], the traffic on the adjacent lanes has insignificant effects on the crash occurrence.  Thus, only traffic data from the crash occurrence lane were used in this study.

When traffic data 45 minutes before the crash were obtained, the next step was to extract the corresponding non-crash traffic data.  This data from the same station during the same time period for the whole year were extracted from the database.  A matlab program was written to filter both the crash and non-crash traffic data.  The purpose of the filtering was to remove the records with missing data points during the 45-minutes study period and to eliminate abnormal data.  The abnormal data includes:

1. Two out of three traffic parameters are zero when the other is not;

2. One of the three traffic parameters is zero when the other two are not;

3. Occupancy greater than 100;

4. Any of the traffic parameters less than zero.

After this step, 446 crashes were left.  In some cases, crashes were removed from the study data set due to the lack of good quality non-crash data.


To ensure the non-crash traffic condition was similar to the crash condition, the non-crash data were randomly selected from the same day of the week as the crash data.  As a result, the same number of non-crash cases as crash cases was obtained.  However, after a close scan of the 446 pair traffic data, it was shown that another form of abnormal data should be excluded from the data set because the readings of traffic parameters remained the same for all the 45 1-minute intervals.

At last, the traffic data of 391 cases and the same number of non-crash cases were obtained.  Among these 391 crash cases, 123 showed a visual change in either occupancy or speed after a crash occurred.  The result is summarized in Table 7:

Table 7 Crash Prediction Error Rate based on visual inspection of the change in speed and occupancy

Scenario

After the crash

There is a visual change in speed and occupancy (Figure 4)

123 crashes

There is no visual change in speed and occupancy (Figure 3)

268 crashes

Mean Error Rate of determining if it is a crash based on visual inspection of the change in speed and occupancy after crash occurred

68.5%

Total

391 crashes

By checking the traffic data 45 minutes before crashes, 114 crashes were visually identified to be different from the corresponding non-crash cases.  The criteria of visual identification are listed in Table 8:


Table 8 Visual identification of crash cases from corresponding non-crash cases

Criteria

Meaning

1.The plot of crash cases are generally above non-crash cases

The mean of traffic parameter is greater than the mean of the non-crash case

2.The plot of crash cases are generally below non-crash cases

The mean of parameter is less than the mean of the non-crash case