Final report of ITS Center project:
Identification of traffic patterns leading to crashes
UVA Center for
Transportation Studies
A Research Project
Report
For the Center for
ITS Implementation Research
A U.S. DOT University
Transportation Center
FREEWAY CRASH PREDICTIONS BASED ON REAL-TIME PATTERN CHANGES IN
TRAFFIC FLOW CHARACTERISTICS
Principal Investigators:
Dr. Nicholas Garber
Lili Luo
Center for Transportation Studies
University of Virginia
Thornton Hall
351 McCormick Road, P.O. Box 400742
Charlottesville, VA 22904-4742
804.924.6362
January 2006
Research Report No. UVACTS-15-0-101
The contents of this report reflect the views of the authors, who are responsible for the facts and the accuracy of the information presented herein. This document is disseminated under the sponsorship of the Department of Transportation, University Transportation Centers Program, in the interest of information exchange. The U.S. Government assumes no liability for the contents or use thereof.


Freeway Crash Predictions Based on Real-Time Pattern Changes
in Traffic Flow Characteristics
By: Lili Luo
Dr. Nicholas J. Garber
A Research Project Report for the Intelligent Transportation Systems Implementation Center (ITS)
A U.S. DOT University Transportation Center
Dr. Nicholas J. Garber
Department of Civil Engineering
Email: njg@virginia.edu
Center for
Transportation Studies at the University of Virginia produces outstanding
transportation professionals, innovative research results and provides
important public service. The Center for Transportation Studies is committed to
academic excellence, multi-disciplinary research and to developing
state-of-the-art facilities. Through a partnership with the Virginia Department
of Transportation¡¯s (VDOT) Research Council (VTRC), CTS faculty hold joint
appointments, VTRC research scientists teach specialized courses, and graduate
student work is supported through a Graduate Research Assistantship Program.
CTS receives substantial financial support from two federal University
Transportation Center Grants: the Mid-Atlantic Universities Transportation
Center (MAUTC), and through the National ITS Implementation Research Center
(ITS Center). Other related research activities of the faculty include funding
through FHWA, NSF, US Department of Transportation, VDOT, other governmental
agencies and private companies.
Disclaimer: The contents of this report reflect the views of the authors, who are responsible for the facts and the accuracy of the information presented herein. This document is disseminated under the sponsorship of the Department of Transportation, University Transportation Centers Program, in the interest of information exchange. The U.S. Government assumes no liability for the contents or use thereof.
|
1. Report No. UVACTS-15-0-101 |
2. Government Accession No. |
3. Recipient¡¯s Catalog No. |
|||||
|
|
|
|
|||||
|
4. Title and Subtitle |
5. Report Date |
||||||
|
Freeway Crash Predictions Based on Real-Time Pattern Changes in
Traffic Flow Characteristics |
January 20, 2006 |
||||||
|
|
6. Performing Organization Code |
||||||
|
|
|
||||||
|
7. Author(s) Lili Luo Dr. Nicholas J. Garber (Academic Advisor) |
8. Performing Organization Report No. |
||||||
|
|
|
||||||
|
9. Performing Organization and Address |
10. Work Unit No. (TRAIS) |
||||||
|
Center for Transportation Studies |
|
||||||
|
University of Virginia |
11. Contract or Grant No. |
||||||
|
PO Box 400742 Charlottesville, VA 22904-7472 |
|
||||||
|
12. Sponsoring Agencies' Name and Address |
13. Type of Report and Period Covered |
||||||
|
Office of University Programs, Research Innovation and Technology
Administration US Department of Transportation 400 Seventh Street, SW Washington DC 20590-0001 |
|
Final Report |
|||||
|
|
|
14. Sponsoring Agency Code |
|||||
|
|
|
|
|||||
|
15. Supplementary Notes |
|||||||
|
|
|||||||
|
16. Abstract |
|||||||
|
In recent years, attempts
were made to develop a crash prediction model based on real-time detector
data. Since studies in this
field are primarily theoretical, improvements can be made in various
aspects. It is expected that the
final product of this study will be a program that integrates with the
Advance Traffic Management System so that operators of Smart Travel Centers
can take action to prevent or at least reduce the chances of crash
occurrence. At this first stage,
efforts were made to identify the crash leading patterns and the factors
describing the patterns. Crashes that occurred on
interstate highway basic segments between July 1, 2003 and June 30, 2004 from
Northern Virginia were obtained from police crash reports. The associated traffic conditions as
well as the normal non-crash conditions defined by the traffic parameters were
collected from Smart Travel Lab. By applying three different
pattern recognition techniques - the K-means clustering method; Naïve-Bayes method; and Discriminant Analysis - it was found that the
overall classification error rate remained at about 50% and was unable to
identify the crash leading patterns. |
|||||||
|
17 Key Words |
18. Distribution Statement |
||||||
|
Freeway Crash Predictions, Traffic Flow Characteristics |
No restrictions. This document is available to the public. |
||||||
|
19. Security Classif. (of
this report) |
20. Security Classif. (of this page) |
21. No. of Pages |
22. Price |
||||
|
Unclassified |
Unclassified |
|
N/A |
||||
In recent years, attempts were made to develop a crash prediction model based on real-time detector data. Since studies in this field are primarily theoretical, improvements can be made in various aspects. It is expected that the final product of this study will be a program that integrates with the Advance Traffic Management System so that operators of Smart Travel Centers can take action to prevent or at least reduce the chances of crash occurrence. At this first stage, efforts were made to identify the crash leading patterns and the factors describing the patterns.
Crashes that occurred on interstate highway basic
segments between July 1, 2003 and June 30, 2004 from Northern Virginia were
obtained from police crash reports.
The associated traffic conditions as well as the normal non-crash
conditions defined by the traffic parameters were collected from Smart Travel
Lab.
By applying three different pattern recognition
techniques - the K-means clustering method; Naïve-Bayes method; and
Discriminant Analysis - it was found that the overall classification error rate
remained at about 50% and was unable to identify the crash leading patterns.
Possible reasons for the unsuccessful classification
were discussed and recommendations were made for future studies.
Thanks should be given to all who have helped in this
project - I especially want to thank my advisor, Dr. Nicholas J. Garber, for
his advice and support. I also
want to thank Dr. John Miller and Lewis Woodson of Virginia Transportation
Research Council for the crash data collection. Special thanks should also be given to the Smart Travel Lab
staff, Xiaoning Lu and Guimin Zhang, who assisted in extracting the traffic
data for this project. I also want
to thank Yuan Lu for his help in data collection and his constructive
discussions.
2.1.2 Data Collection and Reduction
2.1.3 Crash and Non-Crash Traffic
Pattern Recognition
2.2.2 Data Collection and Reduction
2.2.3 Crash Traffic Pattern Recognition
2.3.2 Data Collection and Reduction
2.3.3 Crash and Non-Crash Traffic
Pattern Recognition
2.4.2 Data Collection and Reduction
2.4.3 Crash and Non-Crash Traffic
Pattern Recognition
CHAPTER 3: DATA COLLECTION AND
REDUCTION
4.1.1 Correlations of the traffic
variables
4.1.2 Distribution of the Variables
4.2 Pattern Recognition Methods
4.2.1 K-means clustering Method
5.1 Classification results based on
single variable
5.2 Classification results based on
volume mean and speed variance
5.3 Classification results based on
all variables
5.4 Classification results based on
the changes of the variables
5.5 Classification results of day
time traffic patterns
CHAPTER 6: RESULT DISCUSSION AND
RECOMMENDATION
6.1 Reasons of unsuccessful
identification of the crash leading patterns
6.2 Recommendations for future
study
APPENDIX: SUMMARY TABLES OF
CLASSIFICATION RESULTS
Figure 1 Implication of accident by traffic dynamics
Figure 2 Determination of actual
time of crash occurrence from speed profile
Figure 3 No significant changes in
the traffic flow during a crash
Figure 4 Significant changes are
shown during crash
Figure 5 Detector stop working
Figure 6 Observations of traffic
parameters at interval 45
Figure 7 Data fitting of occupancy
mean
Figure 8 Data fitting of occupancy
variance
Figure 9 Data fitting of speed mean
Figure 10 Data fitting of speed
variance
Figure 11 Data fitting of volume
mean
Figure 12 Data fitting of volume
variance
Figure 13 Input features of
combination 30_5 in the case of crash occurred at 9:00
Figure 14 Classification results
based on occupancy mean by using k-means clustering
Figure 15 Classification results
based on occupancy mean by using Naïve-Bayes
Figure 16 Classification results
based on occupancy mean by using Quadratic Discriminant Analysis
Figure 17 Classification results
based on volume mean by using k-means clustering
Figure 18 Classification results
based on volume mean by using Naïve-Bayes
Figure 19 Classification results
based on volume mean by discriminant analysis
Figure 20 Classification results
based on speed mean by k-means clustering
Figure 21 Classification results
based on speed mean by Naïve-Bayes
Figure 22 Classification results
based on speed mean by discriminant analysis
Figure 23 Classification results
based on occupancy variance by k-means clustering
Figure 24 Classification results
based on occupancy variance by Naïve-Bayes
Figure 25 Classification results
based on occupancy variance by discriminant analysis
Figure 26 Classification results
based on volume variance by k-means clustering
Figure 27 Classification results
based on volume variance by naïve-bayes
Figure 28 Classification results
based on volume variance by discriminant analysis
Figure 29 Classification results
based on speed variance by k-means clustering
Figure 30 Classification results
based on speed variance by Naïve-Bayes.
Figure 31 Classification results
based on speed variance by discriminant analysis
Figure 32Error rate based on volume
mean & speed variance by using k-means clustering
Figure 33 Error rate based on
volume mean & speed variance by using naïve-bayes
Figure 34 Error rate based on
volume mean & speed variance by using discriminant analysis
Figure 35 Error rate based on
multiple variables by using k-means clustering
Figure 36 Error rate based on
multiple variables by using naïve-bayes
Figure 37 Error rate based on
multiple variables by using discriminant analysis
Figure 39 Classification error rate
based on the changes of the variables by using Naïve-Bayes
Figure 40 Classification error rate
based on the changes of the variables.
Figure 41 Classification error rate
of daytime pattern based on speed variance
Figure 42 Classification error rate
of daytime pattern based on speed variance
Figure 43 Classification error rate
of daytime pattern based on speed variance
Figure 44 Classification error rate
of daytime pattern based on volume mean and speed variance
Figure 45 Classification error rate
of daytime pattern based on volume mean and speed variance
Figure 46 Classification error rate
of daytime pattern based on volume mean and speed variance
Figure 47 Classification error rate
of daytime pattern based on all 6 variables
Figure 48 Classification error rate
of daytime pattern based on all 6 variables
Figure 49 Classification error rate
of daytime pattern based on all 6 variables
Table 1 Illustration of the
imputation process
Table 2 Summary of the previous
studies
Table 3 Summary of crashes on I-66
Table 4 Summary of crashes on I-95
Table 5 Summary of crashes on I-395
Table 6 Summary of crashes on I395
HOV
Table 7 Crash Prediction Error Rate
based on visual inspection of the change in speed & occupancy
Table 8 Visual identification of
crash cases from the corresponding non-crash cases
Table 9 Crash Prediction Error Rate
based on visual inspection of
Table 10 Crash Prediction Error
Rate based on Occupancy Mean Using K-means Clustering
Table 11 Correlations of the
variables of combination 45_5 at time interval 9
Table 12 Estimated parameters of
Weibull Distribution for occupancy
Table 13 Estimated parameters of
Log-logistic Distribution for occupancy variance
Table 14 Kolmogorov-Smirnov test of
speed mean
Table 15 Estimated Parameters of
Log-logistic Distribution for speed variance
Table 16 Estimated parameters of
Log-logistic Distribution for volume mean
Table 17 Kolmogorov-Smirnov test of
Volume variance
Table 18 Covariance matrix of crash
pattern based on combination of 15_15.
Table 19 Covariance matrix of
non-crash pattern based on combination of 15_15
Every year since 2000, over 40,000 lives are lost due
to motor vehicle traffic crashes, which makes it the eighth cause of death
overall and the third cause of death for young people aged from 3 to 33 in the
United States.[1]
Highway safety may be intangible to the public; indeed it is one of the
most challenging issues presented before transportation researchers and traffic
engineers.
Efforts have been made to relate traffic flow characteristics
and crash occurrence. However, the
crash prediction models based on AADT, even hourly volume, have limited
performance. The highly aggregated
traffic data are considered one of the factors that conceal the unique characteristics
that lead to crashes.
On the other hand, studies have been conducted based
on the so-called ¡®real-time¡¯ traffic data ever since data from inductive loop
detectors became available. A good
example is the incident detection study - algorithms and models were developed
to help operators of the Smart Travel Center to detect an incident in a timely
and accurate way so that rescue teams could respond as quickly as possible. In this case, researchers try to
improve highway safety in a reactive way.
Naturally, researchers begin to access the crash
prediction problem by using ¡®real-time¡¯ traffic data. Only a few studies have been carried out since 2000. Most researchers claimed it was likely
that traffic patterns that lead to crashes differ from the patterns that do not
lead to crashes. This is viewed as
a potentially feasible way in real-time crash prediction and prevention.
The purpose of this project is to identify the traffic
patterns leading to crashes so that action can be taken to reduce the risk of
having a crash when hazardous patterns occur. The goal is to integrate the crash prediction algorithms
into the Traffic Management System in the Smart Travel Center. When traffic patterns leading to
crashes are detected, alarms are released to the operator in the Smart Travel
Center. The operator can then
arrange to change the traffic patterns so that the probability of crash
occurrence can be reduced. As the
first stage of the study, this thesis only focused on discriminating the
traffic patterns leading to crashes from the traffic patterns that were not
leading to crashes. Since the existence
of ramps and interchanges affects the traffic characteristics, only the freeway
basic segments were considered in this study.
Crash data of the study were extracted from the FR300
police crash reports at the Virginia Department of Transportation. The corresponding traffic data under
crash and non-crash condition were obtained from the Smart Travel Lab which
stores loop detector data from Smart Travel Centers of both the Hampton Roads
and Northern Virginia area.
It should be mentioned that the performance of
classifying the traffic patterns is measured by the classification error rates
which includes the overall error rates of misclassification, the error rate of
misclassifying the crash condition to non-crash conditions and the error rate
of misclassifying the non-crash conditions to crash conditions. Successful
pattern recognition should have low classification error rates for all three
measurements.
¡¡¡¡¡¡¡¡¡¡¡¡..(1)
The error rate of misclassifying crash traffic pattern
to non-crash traffic pattern and the error rate of misclassifying non-crash
traffic pattern to crash traffic pattern are defined in Equations 2 and 3
respectively.
(2)
(3)
The remainder of this thesis consists of the
following:
Chapter 2: Literature review ¨C a review of past
studies on crash prediction models based on loop detector data
Chapter 3: Data collection and reduction ¨C a
description of construction of the study data set
Chapter 4: Methodology ¨C an overview of the pattern
recognition techniques used in the project
Chapter 5: Analysis result ¨C the results given by the
pattern recognition techniques
Chapter 6: Result discussion and recommendations ¨C
summary of the conclusions and suggestions for future study.
Freeway crash prediction is one of the most challenging research topics in transportation study today and only a few studies based on real-time detector data have been carried out since 2000. Some prediction models were developed based on previous studies by using speed variance as the major input while others made efforts to identify the precursors before constructing the crash prediction models. A review of these previous studies is detailed below. Although the prediction models are also mentioned, the focus is given to the identification of traffic patterns leading to crashes.
The first study was conducted by Oh, Ritchie and Chang
in 2000.[2] In this
study, traffic conditions were classified into two patterns - disruptive and
normal. The disruptive traffic
condition was the one that potentially led to an accident occurrence and a
normal traffic condition was the one not involved in an accident. It was assumed that the traffic
dynamics index could represent an obvious difference between the two traffic
conditions and Figure 1 illustrates this idea.

Figure 1 Implication of accident by traffic dynamics
(Source: Oh, Ritchie and Chang (2000))
Six variables - mean and standard deviation of
occupancy, flow and speed - were considered as the candidate traffic dynamics
index and formed two data sets for the normal condition and for the disruptive
condition. By performing t-test,
the standard deviation of speed was found to be the most statistically
significant factor to represent the change of traffic conditions. Then the probability of accident occurrence
at a given standard deviation of speed was modeled based on the Bayesian
Decision Theory.
The test bed for this research was I-880 in Hayward,
California - about 9.2 miles long having 4-5 lanes with a separate high occupancy
vehicle lane (HOV). Seventeen
detector stations for northbound and 18 for southbound are located in the study
freeway segment. The 10-second
real time traffic flows, occupancies and speeds were collected from February 16
to March 19, 1993 during two periods (5 am-10 am, and 2pm-7pm) by loop
detectors. The accident profile
was developed by running four probe vehicles on the study section throughout
the data collecting period.
A total of 91 accidents occurred during the data
collection period but only 52 were selected for analysis. The lane-specific 10-second data from
upstream stations were extracted and averaged to station level. The data were then reduced to generate
two new data sets; one for the normal traffic condition, which is a 5-minute
period 30 minutes before an accident occurrence while the other represented the
disruptive traffic condition which is the 5-minute period right before an
accident. Therefore, there are 52
groups of 10-second flow, occupancy and speed data for the 5-minute period for
each of the two data sets.
Based on the two data sets, six measurements for both
traffic conditions were considered to present the traffic patterns and
investigated by using t-test.
These measurements were mean and variance of volume, occupancy and
speed. The results indicated that
the 5-minute standard deviation of speed was the most statistically significant
for differentiating normal and disruptive traffic conditions. The 5-minute standard deviation of
10-second speed was, therefore, identified to discriminate the normal and
disruptive traffic conditions.
The study of Oh et al established a start for studying
the accident prediction algorithm using real-time traffic data. It presents a view of the future that
the crash prediction algorithm can be integrated into the Advanced Traffic
Management System (ATMS) or Advanced Traffic Information System (ATIS) for
creating a proactive traffic operation system.
However, some limitations exist in the model:
(1)
Insufficient accidents
impaired the robustness of the model
According to the study report, a total of 91 crashes
occurred during the data collection period but only 52 of them were used. The other 39 accidents were eliminated
due to ¡°unmatched with real time traffic data¡±. It was uncertain if the traffic data were bad for the 39
accidents periods or the traffic data could not show the same pattern that the
52 accident periods did. However,
52 crashes were insufficient to ensure the reliability and robustness of the
derived probability function and it was therefore suggested that new crash and
traffic data be obtained for the model validation.
(2)
The impact of aggregated
station-level speed
In Oh¡¯s study, the lane-specific speeds were
aggregated to a station-level speed.
It is believed the traffic flow characteristics are different with
respect to different lanes. It is
suspected that such speed aggregation will lose important traffic pattern
information and affect the accuracy of the proposed accident prediction model.
(3)
Uncertainty of normal
condition 30 minutes before an accident
The normal traffic condition represented by traffic
data 30 minutes prior to a crash is doubtful since it is likely that the
traffic condition 30 minutes prior to a crash has already been involved in the
disruptive condition.
(4) Potential violation of the t-test assumption
In this study, Oh did not indicate whether this
analysis was undertaken to test the validation of the variables normally being
distributed. If those variables do
not follow normal distribution, the use of the t-test is not appropriate.
In early 2002, Golob and his research group published
their findings about the relationship between different types of freeway
crashes and real-time traffic flow.[3] Based on their findings, they developed a real-time safety
monitoring tool ¨C Flow Impacts on Traffic Safety (FITS). They collected 30-second interval volume
and occupancy data and converted them into several traffic flow regimes. Each regime is correspondent to a
specific pattern of crash types which were determined through non-linear
multivariate analyses. When the
30-second traffic flow is observed to fall in one of the traffic flow regimes,
the types of crashes that are most likely to occur will be predicted.
The study area of this research covered six major
freeway routes (a total of 130 miles with a number of lanes between 3 and 6 in
each direction) in Orange County, California in 1998. Since only the single loop detectors were
available in the study area, no speed data were collected.
Crash data were obtained from the Traffic Accident
Surveillance and Analysis System (TASAS) database which documented all police-reported
crashes in the California Highway System. A total of 1192 mainline crashes were used for the analysis
which had the valid loop detector data for a full 30 minutes preceding the
crash for three types of lanes (right, interior and left lane) at a station closest
to the crash.
The 30-second interval volume and occupancy data were
collected for 30 minutes prior to the crash. The crash time was based on the reported crash time. Noting that the reported crash times are
typically rounded off to the nearest 5 minutes, the research team decided to
discard traffic data 2.5 minutes preceding the reported crash time to reduce
the inclusion of post-crash condition.
Golob only focused on the crash condition analysis. In the first step, he classified the
crashes by crash type (based on type of collision: rear end, side swipe or hit object);
crash locations (left lane, interior lane, right lane, within shoulder or
off-road beyond the shoulder area); and crash severity (injures or fatalities).
Then, by using the volume and occupancy data, he
constructed four groups of statistic measures for each left, interior and right
lanes which include:
(1) The median of the ratio of volume to occupancy;
(2) The difference of the 90th percentile and 50th
percentile in the ratio of volume
to
occupancy;
(3) The mean volumes over the entire 27.5 minute
period preceding the accident;
(4) The standard deviations of 30-second volumes
Next, Golob applied Principal Components Analysis
(PCA) on all the 12 variables for the crashes during daylight or dusk-dawn on
dry roads to identify the independent variables. Simply speaking, PCA is trying to find an orthogonal
transformation:
(4)
Such that components of Z are orthogonal and for
component Zi, it has the maximum variance among all normalized linear
combinations uncorrelated with Z1, Z2, ..., Zi-1.
The central idea of this transformation is to form a
series of mutual independent variables, or say, components, which are linear
combinations of the original variables and then order these components by their
contributions to the variance of the system input. The components dominating the variance of the inputs are then
identified. It can be proved that
the variance of X is the same as the variance of Z.
In PCA, the first six converted variables were found
to account for 86.8% of the variance and explained by the following original
variables:
a. The median of the ratio of volume to occupancy for
interior lane;
b. The difference of the 90th percentile and 50th
percentile in the ratio of volume
to
occupancy for interior lane;
c. The difference of the 90th percentile and 50th
percentile in the ratio of volume to
occupancy
for right lane;
d. The mean volumes over the entire 27.5-minute period
preceding the accident for
left lane;
e. The standard deviations of 30-second volumes for
interior lane;
f. The standard deviations of 30-second volumes for
right lane.
With these six traffic variables, Golob performed
K-means clustering analysis combined with the Non-linear Canonical Correlation
Analysis (NLCCA) to separate them into the optimal number of traffic flow
clusters which can best be correlated to the different types of crashes. It was found that eight traffic flow
clusters best correspond to specific type of crashes during daylight on dry
roads.
Applying non-parametric discriminate analysis (K-means
clustering) and NLCCA on traffic volume variables preceding the crash occurrence
and crash classes, Golob identified the correspondence between the traffic flow
regimes and different types of crashes during daylight on dry roads. However, it is difficult for the
analysis results to be implemented in the field since we only know the tendency
of one type of crash is higher than other types of crash under certain traffic
conditions. Without considering
the non-crash normal condition, the operator of Smart Travel Center cannot tell
if a crash will occur or not.
In 2002, Lee, Saccomanno and Hellinga developed a Log-linear
algorithm to estimate crash frequency by inputting the effect of the speed
variations along a lane, across lanes and traffic density at given roadway geometry,
weather and time of day.[4] One year later, in a later study, the research team found the
impact of the speed variation across the lanes was insignificant.[5] As a result, the speed variation across
the lanes was deleted from the previous model. In addition, one more traffic factor: the traffic queue was
supplemented into the model.
The data used for the model calibration were collected
from a 10 km section of the Gardiner Expressway in Toronto, Canada with 3 or 4
lanes. The study freeway section
had 38 loop detector stations (19 were located upstream of ramps, 19 were
located upstream of straight sections).
The real-time loop detector data were collected over a 13-month period
from January 1998 to January 1999 for 24 hours each weekday. Traffic data of speed, volume and
occupancy were pooled every 20 seconds.
The incident logs maintained by the Traffic Control Center were used to
create a crash profile.
A total of 234 crashes were used for the
analysis. To verify the actual
time of crashes, the visual speed profile analysis was performed for all 234
crashes. As illustrated in Figure
4, the sudden drop in speed for all three lanes at the upstream detectors
indicated the actual start time of a crash. With the actual time of crash occurrence, the speed five
minutes prior to the crash and the density on the crash occurrence could be
extracted. This step is critical
in developing crash prediction models.
In most cases, the recorded crash occurrence time is not exactly the
actual crash occurrence time.
Failure to determine the actual crash occurrence time will lead to the
unconscious inclusion of the traffic patterns after the crash.

Figure 2 Determination of actual time of crash occurrence from speed profile
(Source: Lee, Saccomanno and Hellinga
(2003))
Unlike Oh¡¯s study which applied the t-test on all six
traffic dynamic indicators to determine the most appropriate one, there is no
clear process that was used in Lee¡¯s study to identify the traffic indicators that
differentiated between crash and non-crash conditions. Although the author stated that two
basic principles were used to determine the traffic indicator (first, the
traffic indicators should have significant different values for crash and
non-crash cases; second, the most appropriate observation time slice can
maximize the difference in traffic indicator values between crash and non-crash
cases), no experiment methodology and result were documented in the report. In the 2003 model, the author updated
the effect of the traffic factors by generating three traffic indicators: the
Coefficient of Variation of Speed along the lane (CVS1), traffic queue on crash
occurrence, (which was defined by the absolute difference of speed between
upstream and downstream ends of road sections), and the average station-level
density at the instant the crash occurred (D). They are expressed in Equations 5, 6 and 7:
(5)
(Source: Lee, Saccomanno and Hellinga (2003))
where:
= actual time of
crash
= observation
time of crash
= standard
deviation of speed on lane
computed over period![]()
= average speed
on lane
computed over period
(km/hour)
= polling
interval of loop detectors (seconds)
= speed on lane
at time t
(km/hour)
= total number
of lanes
(6)
(Source: Lee, Saccomanno and Hellinga (2003))
where:
= volume on lane
at the time of
crash (vehicle/hour)
= speed on lane
at the time of
crash (km/hour)
(7)
(Source: Lee, Saccomanno and Hellinga (2003))
where:
= average speed
difference between upstream and downstream locations (km/hour)
= average speeds
computed over period of
upstream and downstream of a location respectively (km/hour)
= time interval
of observation of speed (seconds)
= observation
time slice duration (seconds)
= time of crash
occurrence
Compared to Oh¡¯s study, Lee¡¯s study was based on more
crash samples and a longer data collection time period. By identifying the actual crash
occurrence time, he reduced the error of including traffic characteristics
after the crash occurrence. His
effort mostly focused on developing the crash prediction model using the
log-linear method. Lee used the
term ¡®crash precursor¡¯ as the input of his prediction model but he selected
these crash precursors based simply on previous studies. A more detailed procedure of
identifying traffic characteristics is needed before constructing the crash
prediction model. The established
model takes into account the comprehensive crash-caused factors, such as three
traffic factors, three exterior factors in terms of roadway geometry, weather
and time of day, all which suggest the necessity to control environmental
factors. In Lee¡¯s study, density
at the moment of crash is required, which is unrealistic in the real-time
prediction and limits its application in the Smart Travel Center.
Abdel-Aty and his research team carried out a series
of studies of crash prediction modeling since 2004. Different from the previous researchers, he conducted a
series of crash and non-crash traffic pattern classification analysis before
developing crash prediction models.
The test bed for this research was the divided
13.25-mile stretch of I-4 in Orlando, Florida where 28 Dual Loop Detectors are
located. Traffic data including
volume, speed and occupancy (30-second pooling interval) between April 1, 1999
and November 30, 1999 were available.
A total of 377 crashes with correspondent traffic data were used for the
analysis. A crash profile was built
up based on police crash reports.
Instead of only extracting data from the station where
the crash happened adjacently, Abdel-Aty also obtained data from six other
stations, five upstream and one downstream in a period of 30 minutes before the
crash was recorded. All the
30-minute data were divided into 5-minute intervals. In each interval, six variables were considered, the mean
and variance of speed, occupancy and volume. For the seven stations included in the study, a total of 252
traffic input features were constructed.
Meanwhile, traffic data for the correspondent non-crash conditions were
also extracted. The correspondence
means the same location, same time of day and day of the week. Therefore, external factors such as
roadway geometry, and time of day were controlled.
Detector data were aggregated to station level data
for analysis. Since not all the 377
crashes had good raw data at the seven stations, Abdel-Aty inputted the missing
data by referring to the traffic data on the adjacent upstream station 5
minutes before or the downstream station 5 minutes after the current time
period of the studied station. The
imputation process was shown in Table 1.
Table 1 Illustration of the imputation process
|
|
8:30-8:35 |
8:35-8:40 |
8:40-8:45 |
8:45-8:50 |
8:50-8:55 |
8:55-9:00 |
|
station D |
30 |
25 |
21 |
27 |
25 |
34 |
|
station E |
21 |
31 |
Missed |
26 |
30 |
33 |
|
station F |
37 |
32 |
20 |
29 |
33 |
27 |
|
|
|
|
|
|
|
|
|
|
8:30-8:35 |
8:35-8:40 |
8:40-8:45 |
8:45-8:50 |
8:50-8:55 |
8:55-9:00 |
|
station D |
30 |
25 |
21 |
27 |
25 |
34 |
|
station E |
21 |
31 |
Imputed as 25 |
26 |
30 |
33 |
|
station F |
37 |
32 |
20 |
29 |
33 |
27 |
In his first study, Abdel-Aty used the Probabilistic Neural
Network (PNN) method to classify traffic patterns into non-crash and
crash classes. PNN is a distance-based
classifier to identify certain patterns.
In the classical PNN method, Euclidean Distance is the measurement
applied to classify the inputs.
Since the elements of the input vector are not independent, the direct
application of the Euclidean Distance will call the method into question. Thus, Abdel-Aty used Principal
Component Analysis to transform the original input vectors into a new set of
vectors. Within each vector, the
elements are independent.
To identify the most significant factors as inputs to
the PNN model, the author used matched case control logistic regression as
follows:
(8)
where:
= conditional
mean of dummy variable Y representing crash occurrence
x =
one of the traffic variables such as average volume, speed and so on
Then he defined the hazard ratio as the exponential of
¦Â1 to indicate the risk of crashes.[8] If the hazard ratio is high, the
probability of having a crash is also high. Average volume, occupancy and logarithms of coefficient of variance
of each time interval and each station were fed into the matched case control
logistic regression model. It was shown
that the logarithms of coefficient of variance (standard deviation /average) of
speed at two consecutive upstream stations plus the station where the crash occurred
dominate the pattern changes. For
the purpose of prediction, he chose logarithms of coefficient of variance in
speed within the time period 10-15 minutes before the crash as the inputs of
PNN model.
In the first study, Abdel-Aty not only evaluated the
accuracy of crash prediction but also estimated the false alarm rate. However, the results of the classical
PNN method (without applying the Principal Component Analysis) did not show
significant difference from the modified PNN method. The result indicated the classical PNN performed better than
the modified PNN method in terms of overall error rate but worse with respect
to crash identification accuracy [6]. He also mentioned in the study that the small spread value
of PNN method makes it act as the nearest neighbor classifier.
In the second study, Abdel-Aty developed Generalized
Estimating Equations (GEE) for crash prediction by using the same set of data
from a previous study.[7]
A different technique was used to identify the variables which have
significant effects on crash occurrence.
Because of the high correlation between occupancy and speed, he
eliminated the occupancy from the data set. As a result, 252 traffic input features were reduced to
168. He also applied a one-on-one
match for the non-crash and crash cases which gives the same number of
non-crash cases as the crash cases.
Furthermore, eight geometric factors were considered in this study. Then he used stepwise logistic
regression analysis to eliminate the factors without significant impacts on
crashes.
The general idea of forward stepwise regression is to
begin with no model terms. Then at
each step it adds the most statistically significant term based on the highest
F statistic until no significant terms are left.[9] It was shown that the standard
deviation of volume at station D within 15 minutes prior to the crash, the
standard deviation of speed at station F within 15 minutes prior to the crash,
the standard deviation of speed at station E within 5 minutes prior to the crash
and the average speed at station E within 5 minutes prior to the crash were the
most significant traffic characteristics to discriminate traffic patterns
leading to crashes from the patterns that do not lead to crashes. Here Station F was the nearest station
to the crash; Stations E and D were the two adjacent upstream stations of
station F. Each one was about 0.5
miles away from the other. By
identifying these traffic factors and the other three geometry factors, the
input vectors for the GEE model were ready.
The GEE is one of the Generalized Linear Regression
methods. The central idea is to
take the correlation within one cluster into account in model development. Different from the previous studies,
Abdel-Aty viewed traffic data collected from the same location as correlated. Thus, he treated the 56 stations as 56
clusters to develop the Generalized Estimating Equations. Three different correlation structures
including independent, exchangeable and autoregressive were tested and compared
in the model. The result showed
that both the exchangeable and autoregressive structures performed better than
the independent model.[7]
He drew the conclusion that the increase of the speed
variance over 15 minutes at certain location increases the crash probability. Low variability in volume increases
crash probability 1 mile downstream.
Low average speed at certain location increases crash probability 0.5
miles downstream. The conclusion
may be true when the traffic flow is under a congested regime. However, the explanation of crash
probability based on a single variable such as low variability in volume or low
average speed is doubtful.
Based on the experience and lessons of the pioneers,
Abdel-Aty realized the importance of identifying traffic patterns leading to
crashes and performed a series of studies in an attempt to explore the
difference between crash pattern and non-crash pattern. He expanded the station spatial scope
to seven stations which could bring information about the spatial variation in
traffic pattern. The PNN-based
classifier and the forward stepwise regression method he used in his studies
provided innovative research methods for later studies but his attempt to
explain the cause of crashes by individual traffic parameters makes the
conclusion questionable. Since the
traffic patterns on the freeway are mainly described by mean and variance of
volume, occupancy and speed, the traffic pattern leading to a crash is also
described by these variables.
Therefore it would be more appropriate to use the joint effects of
traffic variables rather than one single variable to predict the crash
probability. Thus, the application
of results of Abdel-Aty¡¯s study may lead to unnecessary high false alarm and
noise.
In this chapter, the studies from four representative researchers
in the field of crash prediction modeling using real-time loop detector data were
reviewed. Among them, Oh and
Abdel-Aty conducted the crash and non-crash traffic pattern recognition
analysis before developing the prediction model. Lee¡¯s study did not show a clear procedure in this facet and
Golob approached his research only based on crash conditions. Data collection and reduction for each
of the studies were reviewed because it significantly affected the input
feature construction. The critical
issue in each study was also discussed at the end of the section for each specific
study. Although some efforts were
made to identify traffic patterns leading to crashes, the implementation of the
findings of the above researches to the field is difficult because of
unrealistic data requirement or lack of consideration of joint effects of the
traffic variables. Table 2
summarized key factors of the previous studies:
Table 2
Summary of Previous Studies
|
Crash occurrence time |
Crash Data |
Non-crash traffic data |
Methods to determine variables that
separate two conditions |
Variables used in prediction model |
Data level in analysis |
Aggregation level |
Data pulling interval |
Sample size |
|
|
Oh |
Visual determined |
5-minute period just before crash |
5-minute period 30 minutes prior to crash |
t-test |
standard deviation of speed |
station level |
5 minutes |
10 seconds |
52 |
|
Golob |
Removed the last 2.5 minutes |
30-minute period before crash |
N/A |
N/A |
1.standard deviation of volume 2.mean volume 3.volume to occupancy ratio |
detector level |
30 seconds |
30 seconds |
1192 |
|
Visual
determined |
5-minute period before
crash |
Other 5-minute periods
of the same data |
No
detail process |
1.
standard deviation of speed 2.speed
difference between upstream and downstream station 3.
Density |
detector
level |
5
minutes |
20
seconds |
234 |
|
|
Abdel-Aty |
Based on crash report |
15-minute period just before crash |
The same time of day, day of week and
station |
PNN, stepwise logistic regression |
1. speed variance 2. volume variance 3. mean speed |
station level |
5 minutes |
30 seconds |
377 |
In order to address the limitation of the previous
studies, up to 75 combinations of time period and time intervals were studied
to find appropriate prediction compounding in this report. Furthermore, the joint effects of two
or more traffic variables were also studied to determine traffic patterns
leading to a crash.
The construction of the study data set is the
fundamental task before performing any analysis. The quality of the study data set has significant impacts on
the analysis results. Thus, a
great amount of effort was spent on verifying the validity of the traffic data
as well as extracting data from different sources.
The first attempt was to extract both crash and
traffic data from Smart Travel Lab (STL) for the Hampton Roads Area because of
the availability of over five years of records on a stretch of 13 miles on
Interstate 64, 264 and 564. Over
2000 crashes with occurrence time, date and approximate location were extracted
from the STL database for the years 1999, 2000, and 2001. The corresponding traffic data 45
minutes before and 45 minute after crashes were also retrieved. However, by simply plotting traffic
data at the station level, only about 30% of data could be viewed as
available. Furthermore, the
1-minute interval traffic data of the Hampton Roads area were obtained by
dividing the 2-minute aggregated data by two which essentially modified the
original traffic data. Even worse,
in most cases (although not all), the speed over 65 miles per hour was cut and
recorded as 65 miles per hour in this database. Due to the data quality issue of the Hampton Roads area, we
turned our attention to the Northern Virginia area and were able to obtain the
required data for the study.
Before the end of 2004, no crash data from Northern
Virginia were stored in the Smart Travel Lab. Other sources had to be found to obtain crash
information. In this study, FR 300
crash reports from Virginia Transportation Research Council were obtained and
manually tabulated into Excel files.
The information included the crash number, route, direction, date, day
of week, recorded occurrence time, jurisdiction, occurrence location in terms
of mile marker and description, number of vehicles involved in the crash, crash
type, roadway alignment, weather, roadway surface condition, roadway defect,
lighting condition, number of injured, injured type, number of fatalities, and
vehicle maneuver. The total number
of lanes at the crash occurrence location and the lane at which the crash
occurred were also extracted by reading the crash diagram. Although not all the information was
used in this study, further study can benefit from the data collection
efforts. Due to the time
constraint, only one year¡¯s crash data from July 1, 2003 to June 30, 2004 were
obtained which contains a total of 2908 crash records from both directions of
I-95, I-66 and I-395 on a stretch of 50 miles. Since no detectors were laid on the shoulder, crashes that
occurred there were eliminated from the data set when the shoulder was open for
traffic which left 2865 crashes.
Tables 3 through 6 give the summary of crashes. Because the traffic characteristics
near the interchanges and ramps were quite different from the freeway
basic segments, only the crashes on the freeway basic segments were
considered in this study. The
definition of freeway basic segments is 500 ft upstream and 2500 ft downstream
of an on ramp, 2500 ft upstream and 5 feet downstream of an off ramp, or 500 ft
upstream of the merge point marking the beginning of the weaving area and 500
ft downstream of the diverging point forming the end of the weaving area. (HCM
1985)
Table 3 Summary of Crashes on I-66
|
I66 |
EB |
WB |
Subtotal |
percent |
|
Basic
Segment |
437 |
490 |
927 |
0.609868 |
|
Ramp |
293 |
300 |
593 |
0.390132 |
|
subtotal |
730 |
790 |
|
|
|
total |
|
|
1520 |
|
Table 4 Summary of Crashes on I-95
|
I95 |
NB |
SB |
Subtotal |
percent |
|
Basic
Segment |
399 |
350 |
749 |
0.827624 |
|
Ramp |
64 |
92 |
156 |
0.172376 |
|
subtotal |
463 |
442 |
|
|
|
total |
|
|
905 |
|
Table 5 Summary of Crashes on I-395
|
I395 |
NB |
SB |
Subtotal |
percent |
|
Basic
Segment |
70 |
54 |
124 |
0.300242 |
|
Ramp |
139 |
150 |
289 |
0.699758 |
|
subtotal |
209 |
204 |
|
|
|
total |
413 |
|
413 |
|
Table 6 Summary of Crashes on I395 HOV
|
|
I395HOV |
|
|
Southbound HOV |
Reversible HOV |
Northbound HOV |
|
3 |
17 |
7 |
|
Total |
27 |
|
Because of the time constraint, only traffic data from
the closest upstream station of the crashes were extracted. This is based on the assumption that
the closest upstream station to the crash records the traffic patterns right
before the occurrence of the crash.
For most cases, this is not accurate because the crash occurrence
location will always have some distance from the station. However, since the stations in the
Northern Virginia area are densely distributed (at least one station within one
mile, but in most cases 0.5 miles apart), the difference is relatively small
and can be neglected. A total of
97 main line stations on the study freeway network were obtained. The number of detectors varies from 2
to 5 according to the total number of lanes at the location. These detectors on the main line
freeway in Northern Virginia are all double loop detectors which can provide us
with accurate speed readings.
The first step to retrieve the traffic data was to
locate the corresponding upstream station for each crash by matching the crash
mile maker with the station mile marker.
In this process, some of the crashes were found to occur too far away
from the last station of the study area (over 5 miles) and therefore were
removed from the data set. After
this process, 1172 crashes were left.
Second, the station level traffic data were plotted to determine the
actual crash occurrence time and to eliminate the crashes of which the detector
did not function during the crash occurrence period. Speed and occupancy over 1.5 hours (45 minutes before and 45
minutes after the crash) were plotted for each case. Figures 3 through 5 illustrated the plotted results. The x-axis is the time in minutes from
the beginning of the day. The blue
lines are the plots of the occupancy and the red lines are the plot of the
speed.
![]()




Figure 3 No significant changes in traffic flow during crash
Note: recorded crash time is 15:00 which is 900
minutes away from the beginning of the day.



![]()

Figure 4 Significant changes are shown during crash
Note: Recorded Crash Time is 6:20, which is 380
minutes away from the beginning of the day. The identified actual crash time is 385 minutes away from
the beginning of the day, which is 6:25 in the morning.




Figure 5 Detector stopped working
The reasons for no significant changes in the traffic
parameters in some of the crashes are not clear. It may result from both the characteristics of traffic flow
and crashes. For instance, when
the volume on the freeway is low, a minor crash may not result in serious
impact on the traffic flow.
Therefore, no significant changes can be observed from traffic
parameters.
After removing the cases where the detector was not
working, only 586 crashes were left, among which 50% of the crash time could
not be identified by reading from the plot as shown in Figure 3. Hence, the recorded crash time was
taken as the actual crash time.
The actual crash times were identified for the remaining 50% of crashes
and re-input into the query to extract the corresponding traffic data.
Next, the detector level 1-minute interval traffic
data for all lanes of the corresponding station 45 minutes before and 45
minutes after the corrected crash time were obtained from the database. Although only the 45 minutes before
crashes data will be used in this study, data after the crashes can be applied
to later research.
From previous studies[4][5], the traffic on the adjacent lanes has insignificant effects on the crash occurrence. Thus, only traffic data from the crash occurrence lane were used in this study.
When traffic data 45 minutes before the crash were
obtained, the next step was to extract the corresponding non-crash traffic
data. This data from the same
station during the same time period for the whole year were extracted from the
database. A matlab program was
written to filter both the crash and non-crash traffic data. The purpose of the filtering was to
remove the records with missing data points during the 45-minutes study period
and to eliminate abnormal data.
The abnormal data includes:
1. Two out of three traffic parameters are zero when
the other is not;
2. One of the three traffic parameters is zero when
the other two are not;
3. Occupancy greater than 100;
4. Any of the traffic parameters less than zero.
After this step, 446 crashes were left. In some cases, crashes were removed
from the study data set due to the lack of good quality non-crash data.
To ensure the non-crash traffic condition was similar
to the crash condition, the non-crash data were randomly selected from the same
day of the week as the crash data.
As a result, the same number of non-crash cases as crash cases was
obtained. However, after a close
scan of the 446 pair traffic data, it was shown that another form of abnormal
data should be excluded from the data set because the readings of traffic
parameters remained the same for all the 45 1-minute intervals.
At last, the traffic data of 391 cases and the same
number of non-crash cases were obtained.
Among these 391 crash cases, 123 showed a visual change in either
occupancy or speed after a crash occurred. The result is summarized in Table 7:
Table 7 Crash Prediction Error Rate based on visual inspection of the change in speed and occupancy
|
Scenario |
After the
crash |
|
There is a visual change in
speed and occupancy (Figure 4) |
123 crashes |
|
There is no visual change
in speed and occupancy (Figure 3) |
268 crashes |
|
Mean Error Rate of
determining if it is a crash based on visual inspection of the change in
speed and occupancy after crash occurred |
68.5% |
|
Total |
391 crashes |
By checking the traffic data 45 minutes before
crashes, 114 crashes were visually identified to be different from the
corresponding non-crash cases. The
criteria of visual identification are listed in Table 8:
Table 8 Visual identification of crash cases from corresponding non-crash cases
|
Criteria |
Meaning |
|
1.The plot of crash cases
are generally above non-crash cases |
The mean of traffic
parameter is greater than the mean of the non-crash case |
|
2.The plot of crash cases
are generally below non-crash cases |
The mean of parameter is
less than the mean of the non-crash case |