Final report of ITS Center project: Incident duration forecasting

A Research Project Report

For the National ITS Implementation Research Center

A U.S. DOT University Transportation Center

FORECASTING THE CLEARANCE TIME OF FREEWAY ACCIDENTS

 

Kevin Smith
Department of Civil Engineering

Dr. Brian L. Smith
Department of Civil Engineering
Email:
briansmith@Virginia.EDU

 

Center for Transportation Studies
University of Virginia
CTS Website http://cts.virginia.edu
351 McCormick Road, P.O. Box 400742
Charlottesville, VA 22904-4742
434.924.6362

Smart Travel Lab Report No. STL-2001-01

Center for Transportation Studies at the University of Virginia produces outstanding transportation professionals, innovative research results and provides important public service. The Center for Transportation Studies is committed to academic excellence, multi-disciplinary research and to developing state-of-the-art facilities. Through a partnership with the Virginia Department of Transportation’s (VDOT) Research Council (VTRC), CTS faculty hold joint appointments, VTRC research scientists teach specialized courses, and graduate student work is supported through a Graduate Research Assistantship Program. CTS receives substantial financial support from two federal University Transportation Center Grants: the Mid-Atlantic Universities Transportation Center (MAUTC), and through the National ITS Implementation Research Center (ITS Center). Other related research activities of the faculty include funding through FHWA, NSF, US Department of Transportation, VDOT, other governmental agencies and private companies.

Disclaimer: The contents of this report reflect the views of the authors, who are responsible for the facts and the accuracy of the information presented herein. This document is disseminated under the sponsorship of the Department of Transportation, University Transportation Centers Program, in the interest of information exchange. The U.S. Government assumes no liability for the contents or use thereof.

Abstract

Freeway congestion is a major and costly problem in many U.S. metropolitan areas. From a traveler’s perspective, congestion has costs in terms of longer travel times and lost productivity. From the traffic manager’s perspective, congestion causes a freeway to operate inefficiently and below capacity. There are also environmental costs associated with congestion such as increased pollution and noise. Researchers have estimated that "non-recurring" congestion due to freeway incidents such as accidents, disabled vehicles, and weather events accounts for one-half to three-fourths of the total congestion on metropolitan freeways in this country.

The objective of this study is to develop a forecasting model that can predict the clearance time of a freeway accident. This can aid traffic managers in making decisions regarding the appropriate response to freeway incidents. Three models were investigated in this paper; a stochastic model, nonparametric regression model, and classification tree model. The stochastic model was not applied to forecasting future accidents due to the lack of a probabilistic distribution to fit the clearance time data. The Weibull and lognormal distributions have been applied to incident duration in the past, but were not applicable to the accident clearance time data used in this study. The other two models were developed but suffered from poor performance in predicting the clearance time of future accidents. However, the classification tree model appears to be well suited for forecasting the phases of incident duration given a database of incidents with reliable and informative characteristics.

 

Table of Contents

Chapter 1: Introduction *

1.1 Project Definition *

1.2 Problem Rationale *

1.3 Project Purpose and Scope *

1.4 Report Overview *

Chapter 2: Review of Relevant Literature *

2.1 Freeway Traffic Incidents *

2.2 Past Research on Incident Duration Prediction *

2.2.1 Probabilistic Distributions *

2.2.2 Linear Regression Models *

2.2.3 Conditional Probabilities *

2.2.4 Time Sequential Models *

2.2.5 Decision Trees *

2.3 New Forecasting Techniques *

2.4 Classification Trees *

2.4.1 Tree Construction *

2.4.2 Classification and Regression Tree (CART) Software *

2.5 Non-parametric Regression *

2.5.1 Neighborhood Definition *

2.5.2 Forecast Generation *

2.6 Summary *

Chapter 3: Research Framework *

3.1 Methodology *

3.2 Data Source *

3.3 Database Structure *

3.3.1 Incident Table *

3.3.2 Agency Table *

3.3.3 Assist Table *

3.3.4 Automobile Table *

3.3.5 Roadway Table *

3.3.6 Location Table *

3.4 Data Collection *

3.4.1 Data Reduction *

3.5 Potential Independent Variables *

3.5.1 Physical Independent Variables *

3.5.2 Vehicle Independent Variables *

3.5.3 Accident Response Independent Variables *

3.6 ANOVA Significance Test *

3.7 Model Selection *

Chapter 4: Stochastic Model *

4.1 Model Background *

4.2 Probability Density Functions *

4.2.1 Goodness-of-fit Test *

4.3 Model Development *

4.3.1 Model for All Accidents *

4.3.2 Model for Accident Severity *

4.3.3 Model for Accident Time of Day *

4.4 Summary *

Chapter 5: Nonparametric Regression Model *

5.1 Model Development *

5.1.1 Neighborhood Definition *

5.1.2 Distance Metric *

5.1.3 Forecast Generation *

5.2 Model Algorithm *

5.3 Measures of Effectiveness *

5.4 Selection of Neighborhood Size *

5.4 Model Results *

5.5 Result Summary *

Chapter 6: Classification Tree Model *

6.1 Model Development *

6.2 Measures of Effectiveness *

6.3 Model Results *

6.3.1 Prediction Accuracy *

6.4 Result Summary *

Chapter 7: Conclusion *

7.1 Project Conclusions *

7.2 Recommendations *

7.2.1 Forecasting Models *

7.2.2 Incident Databases *

7.2.3 Data Entry Procedure *

7.2.4 Needed Accident Information *

7.2.5 Incident Duration *

7.3 Future Research *

7.5 Summary *

References *

Appendix A: ANOVA Significance Test Results *

Appendix B: Visual Basic Code for Nonparametric Regression Model *

Appendix C: Nonparametric Regression Model Results for all Neighborhood Sizes *

Appendix D: Map of HRSTC Location Zones *

 

List of Figures

Figure 2-1: The 4 phases of a freeway incident over time. *

Figure 2-2: General shape of a lognormal probability distribution. *

Figure 2-3: Decision tree for incident clearance time prediction (Ozbay and Kachroo, 1999). *

Figure 2-4: Example of a mechanic’s classification tree. *

Figure 2-5: Classification tree structure. *

Figure 2-6: Example of a kernel neighborhood of size 6. *

Figure 2-7: Example of a nearest neighbor neighborhood (k=8). *

Figure 3-1: Map of the Hampton Roads region of Virginia. *

Figure 3-2: Structure of incident database tables. *

Figure 4-1: Clearance time histogram for all accidents. *

Figure 4-2: Histogram and distribution overlay for all accidents. *

Figure 4-3: Clearance time histogram for single vehicle accidents. *

Figure 4-4: Clearance time histogram for two vehicle accidents. *

Figure 4-5: Clearance time histogram for three or more vehicle accidents. *

Figure 4-6: Histogram and distribution overlay for single vehicle accidents. *

Figure 4-7: Histogram and distribution overlay for two vehicle accidents. *

Figure 4-8: Histogram and distribution overlay for three or more vehicle accidents. *

Figure 4-9: Clearance time histogram of peak weekday accidents. *

Figure 4-10: Clearance time histogram of off-peak weekday accidents. *

Figure 4-11: Clearance time histogram of weekend accidents. *

Figure 4-12: Histogram and distribution overlay for peak weekday accidents. *

Figure 4-13: Histogram and distribution overlay for off-peak weekday accidents. *

Figure 4-14: Histogram and distribution overlay for weekend accidents. *

Figure 5-1: Pseudo-code for nonparametric regression procedure. *

Figure 5-2: Pseudo-code for nonparametric regression testing procedure *

Figure 5-3: Mean absolute prediction error for range of neighborhood sizes. *

Figure 5-4: Number of test accidents predictions within X minutes of actual. *

Figure 5-5: Number of prediction errors less than or equal to 5 minutes. *

Figure 5-6: Number of prediction errors less than or equal to 10 minutes. *

Figure 5-7: Number of prediction errors less than or equal to 15 minutes. *

Figure 5-8: Number of prediction errors less than or equal to 30 minutes. *

Figure 5-9: Number of prediction errors less than or equal to 60 minutes. *

Figure 6-1: Classification tree model diagram. *

 

List of Tables

Table 3-1: Potential model independent variables. *

Table 3-2: Independent variable significance test results. *

Table 4-1: Distribution parameters for all accidents. *

Table 4-2: Chi-square test for all accidents. *

Table 4-3: Distribution parameters for single vehicle accidents. *

Table 4-4: Distribution parameters for two vehicle accidents. *

Table 4-5: Distribution parameters for three or more vehicle accidents. *

Table 4-6: Chi-square test for single vehicle accidents. *

Table 4-7: Chi-square test for two vehicle accidents. *

Table 4-8: Chi-square test for three or more vehicle accidents. *

Table 4-9: Distribution parameters for peak weekday accidents. *

Table 4-10: Distribution parameters for off-peak weekday accidents. *

Table 4-11: Distribution parameters for weekend accidents. *

Table 4-12: Chi-square test for peak weekday accidents. *

Table 4-13: Chi-square test for off-peak weekday accidents. *

Table 4-14: Chi-square test for weekend accidents. *

Table 5-1: Nonparametric regression independent variables. *

Table 5-2: Example of distance metric. *

Table 5-3: Nonparametric Regression Model Results. *

Table 6-1: Classification tree model prediction accuracy. *

 

Chapter 1: Introduction

1.1 Project Definition

Freeway congestion is a major and costly problem in many U.S. metropolitan areas. From a traveler’s perspective, congestion has costs in terms of longer travel times and lost productivity. From the traffic manager’s perspective, congestion causes a freeway to operate inefficiently and below capacity. There are also environmental costs associated with congestion such as increased pollution and noise. The type of congestion most people are familiar with is the "recurring" congestion patterns of rush hour. Traffic managers and politicians have been fighting this congestion for many years through the use of High-Occupancy Vehicle (HOV) lanes, ride-sharing programs, transit incentives, and ramp metering. However, the "non-recurring" congestion due to unpredictable incidents and events warrants immediate response. The actions taken by traffic managers require a full understanding of the nature and tendencies of freeway incidents.

 

1.2 Problem Rationale

Past researchers have estimated that "non-recurring" congestion due to freeway incidents such as accidents, disabled vehicles, and weather events accounts for one-half to three-fourths of the total congestion on metropolitan freeways in this country (Giuliano, 1989). Thus, the specific field of Incident Management has become an important component of traffic management. Incident Management involves the steps of clearing traffic incidents quickly and then minimizing the congestion effects on the traffic flow. Clearing an incident quickly involves managerial support among agencies, clear guidelines for action, and immediate identification of the incident. Minimizing the incident congestion involves the use of traveler information systems such as dynamic message signs and advisory radio, reversible direction lanes, and vehicle re-routing. Ideally, these techniques should be employed as soon as possible instead of waiting for the congestion to begin. The difficult task, from a traffic manager’s perspective, is estimating the duration of the incident to help decide on the appropriate course of action.

This situation can be illustrated with an example. A traffic manager observes a freeway accident near a major interchange that has resulted in the closure of one lane. The manager knows that if the queue of stopped vehicles reaches the major interchange, they will need to activate the variable message signs to alert motorists on both roadways about the stopped traffic at the interchange. Queuing models are in place currently that predict queue characteristics based on demand flows, speeds, and available capacity. However, the queue length depends on the length of time that the incident is active and the capacity is reduced by the one lane closure. In this case, if the manager can anticipate the clearance time of the accident, they can determine the length of the queue and make a decision on the activation of variable message signs.

At the start of a freeway incident, traffic managers have an impression on the nature of the incident. These people have constant interaction with traffic incidents and may be able to use past experiences to predict the duration of the current incident. Only a few studies have been undertaken to provide a quantitative model to support the managers’ projection based on personal experience.

 

1.3 Project Purpose and Scope

The goal of this project is to develop methods to forecast the clearance time of a freeway accident based on its characteristics. The accident data to support model development will come from the Smart Travel Lab at the University of Virginia. The Smart Travel Lab receives traffic data from VDOT’s Smart Traffic Center in Virginia Beach, VA. It is anticipated that the forecasting models will be applicable to any freeway system.

It should be stressed that this project will attempt to forecast accident clearance time, which is the length of time that emergency and other personnel are present on the freeway. It is assumed that the clearance time is a good indication of the total duration of an incident. The importance of understanding incident duration is that is a major factor in determining queues, delay, and other non-recurring congestion effects.

 

1.4 Report Overview

The remainder of this report is composed of the following chapters:

 

 

Chapter 2: Review of Relevant Literature

2.1 Freeway Traffic Incidents

A freeway incident is defined as any planned or unplanned event that effects the traffic flow on the roadway (Sethi, et al., 1994). Some examples of freeway incidents include accidents and crashes, disabled or abandoned vehicles, vehicle fires, weather events, road debris, construction, etc. The Highway Capacity Manual (TRB, 1994) states that incidents

Incidents reduce the level of service by lowering speeds (Giuliano, 1989). Other motorists slow down to allow emergency vehicles to respond, avoid debris and vehicles, and also to "rubberneck" or look at the incident. Capacity is also reduced during incidents due to lane closures or impediments. One study claims that a single lane blockage on a three lane roadway reduces capacity by fifty percent (TRB, 1994). Additionally, it should not be overlooked that freeway incidents do result in fatalities, personal injuries, and property damage.

The duration of an incident is composed of four important and distinct components; detection, response, clearance, and recovery (TRB, 1994) as shown in Figure 2-1.

Figure 2-1: The 4 phases of a freeway incident over time.

Together the four phases represent the total duration of the incident or the period of time from the occurrence of an incident to the return of normal traffic flow conditions.

Even though research has dissected incident duration into these four phases, it is possible for an incident to not exhibit all of the phases. For example, an incident may not have a response phase if police or response teams discover the incident while patrolling the area. Likewise, if an incident is observed occurring on a surveillance camera, there will not be a detection phase. Also, for minor incidents that have short detection, response, and clearance phases there might not be an effect on traffic flow and no noticeable recovery phase. Finally, it would appear that the clearance phase would always be present, but some minor incident may not necessitate emergency vehicles or police and can treated by the people involved without assistance.

 

2.2 Past Research on Incident Duration Prediction

There has been research in the past looking into predictive techniques applied to

incident duration. The results from these studies have been mixed and comparisons between different methods are difficult due to data issues. Almost each study uses a different source of incident data with different descriptive variables and reporting techniques. Some studies suffer from a small sample size, and others from inaccurate data or data with missing values.

 

2.2.1 Probabilistic Distributions

A simple method to predict incident duration is to model the duration value as a

random variable and attempt to fit a probability density function to the data. From this distribution, the traffic manager has an idea on the mean and variance of incident duration. Another useful piece of information is the ability to say there is an x probability of the incident lasting over y minutes.

In 1987, Golob et al. analyzed freeway accidents that involved trucks. The data used in the analysis was 332 freeway and 193 ramp accidents around Los Angeles, California over a two-year period. The authors theorized that an incident is comprised of the sequential phases listed in the previous section, but added that the length of each phase is influenced by the length of the preceding phases (Golob, et al., 1987). From this hypothesis, they were able to theorize that the total duration of an incident is modeled according to a lognormal distribution. Kolmogorov-Smirnov tests of the truck data supported the lognormal distribution of all incidents and each specific incident type (Golob, et al., 1987). Lacking in the analysis was a test of the assumption that each incident phase is time-dependent on the previous phases. This assumption could not be tested since the incident data only contained the total duration of the incident.

Other research studies by Giuliano in 1989, Garib et al. in 1997, and Sullivan in 1997 have supported the use of a lognormal distribution to describe freeway incident duration. Jones et al. used a similar distribution, the log-logistic distribution, in 1991 to a specific data set from the Seattle area. Nam and Mannering in 2000 found that the Weibull distribution could also be used to describe some incident data. The common theme among all three of these distributions is a shift to the left that shows a larger proportion of short-duration incidents (see Figure 2-2).

Figure 2-2: General shape of a lognormal probability distribution.

Recent research by Ozbay and Kachroo in 1999 went one step further in terms of the probabilistic distribution of freeway incidents. Using 650 incidents from Northern Virginia over a one-year period, it was found that the incident duration had a shape similar to a lognormal distribution, but was rejected by several statistical significance tests. However, they found that if the set of incidents is divided into subsets of incidents that have the same type and similar severity a normal distribution of duration is found. This conclusion supports the theory that the duration of similar incidents are random variables.

 

2.2.2 Linear Regression Models

Another simple prediction model is a linear regression function. In terms of

incident duration, the regression usually has a number of binary variables that represent certain incident characteristics. A 1991 unpublished paper from Northwestern University (Ozbay and Kachroo, 1999) studies incident clearance data of 121 incidents from the Chicago area and found 9 statistically significant variables: heavy wrecker (WRECKER), assistance from other response agencies (OTHER), sand/salt pavement operations (SAND), number of heavy vehicles involved (NTRUCK), heavy loading (HEAVY), liquid or uncovered broken loadings in heavy vehicles (NONCON), severe injuries in vehicles (SEVINJ), freeway facility damage caused by incident (RDSIDE), and extreme weather conditions (WX). Two other variables were deemed useful but not statistically significant: response time (RESP) and incident report (HAR). The regression model developed has the form

Clearance Time = 14.03 + 35.57(HEAVY) + 16.47(WX) + 18.84(SAND) – 2.31(HAR) +

0.69(RESP) + 27.97(OTHER) + 35.81(RDSIDE) + 18.44(NTRUCK) + 32.76(NONCON) + 22.90(SEVINJ) + 8.34(WRECKER)

Ozbay and Kachroo do not report on the validity of this regression model, such as r-squared values, or any testing techniques.

In 1997, Garib et al. also developed a linear regression model to predict incident duration. The analysis consisted on 205 incidents over a two-month period from Oakland, California, and found six significant variables: number of lanes affected (X1), number of vehicles involved (X2), binary variable for truck involvement (X5), binary variable for time of day (X6), natural logarithm of the police response time (X7), and a binary variable for weather conditions (X8). The log-based regression model is given by

Log(Duration) = 0.87 + 0.027 X1 X2 + 0.2 X5 – 0.17 X6 + 0.68 X7 – 0.24 X8

The adjusted R-square value of this regression model is 0.81 (Garib, et al., 1997). The authors make the conclusion that the model is thus 81% accurate at predicting incident duration without performing any tests on incidents not used to develop the model.

 

2.2.3 Conditional Probabilities

Another use of probability in incident duration is to develop conditional

probabilities. Traffic managers may be interested in the probability of an incident lasting 30 minutes given that it has already been active for 15 minutes, or similar cases. Most research has focused on unconditional probabilities such as the probability of an incident lasting exactly 30 minutes. Jones et al. reported on conditional probabilities in 1991. Nam and Mannering followed up on the concept by applying hazard-based models developed in the biometrics and industrial engineering fields to incident duration. Hazard-based models also use conditional probabilities to find the likelihood that an incident will end in the next short time period given its continuing duration (Nam and Mannering, 2000). The use of conditional probabilities is based on the theory developed by Golob et al. in 1987 that each incident phase is influenced by the length of previous phases of the incident. To date, these types of models have been used to find the accident characteristics that have the greatest influence on incident duration instead of explicitly forecasting the duration for empirical testing purposes.

 

2.2.4 Time Sequential Models

A 1995 paper by Khattak et al. makes the statement that most incident duration prediction models have no operational value since they require knowledge about all incident variables. In the field, accident information is acquired sequentially and this progression should be reflected in the model.

To develop the time sequential model, the authors identified ten distinct stages of the incident duration based on the availability of information (Khattak, et al., 1995). The length of time for each stage differs for each incident, but it is truncated after a maximum of 10 minutes. Each stage has a separate truncated regression model, and the models progressively add more variables. The time sequential model was not tested or validated in the study due to a small sample size of 109 freeway accidents. This study intends to demonstrate the methodology of time sequential models rather than show its performance in traffic operations (Khattak, et al., 1995). It does not appear that this model approach was ever applied to a large sample size or used in any future study on forecasting incident duration.

 

2.2.5 Decision Trees

All of the methods of incident duration prediction discussed so far have had a probabilistic basis. This may be preferable in many cases because you can add confidence intervals or other probabilities to the forecasted output. For example, a model could tell a traffic operator that the current incident will last 20 minutes with a 95% confidence level. However, if these models do not produce accurate results, it is of no use to the operator to know that an incident will last 20 minutes with a 60% confidence level. In this case new models and methods for duration prediction are needed that can identify patterns in data without an underlying probabilistic distribution.

One such pattern-recognition model that has recently been applied to incident duration prediction is a decision tree. A specific type of decision tree, a classification tree, will be discussed thoroughly in the next section. In 1999, Ozbay and Kachroo used decision trees to predict incident clearance times in the Northern Virginia region. Their published work describes a comprehensive study in a step-by-step manner to show all of the data collection and analysis processes. After collecting a large sample of incident data, the authors followed a series of trial prediction methods with poor results. They first tried linear regression techniques with a low R-square value (about 0.35), and then found the duration values did not follow either a lognormal or log-logistic distribution (Ozbay and Kachroo, 1999). The next step was to develop a decision tree similar to the construction of the classification and regression trees (CART) developed by Breiman et al. and will be defined later in this chapter

Before constructing the decision tree, Ozbay and Kachroo first determined the significant independent variables using ANOVA tests of the data. Some types of incidents, such as HAZMAT and weather related incidents, the number of samples in the database was too small to make any conclusion on the variables’ importance. Thus, these variables were also excluded from the construction of the decision tree. It should also be noted that the intended output of the model is an average duration of past incidents that are similar to the current. Other outputs could be a range of possible durations or a minimum and maximum duration value.

A portion of the final decision tree is included in Figure 2-3 (Ozbay and Kachroo, 1999).

Figure 2-3: Decision tree for incident clearance time prediction (Ozbay and Kachroo, 1999).

The decision tree above is the main tree where the first decision is based on incident type. For example, in the above tree an incident where the type is unknown is immediately given a mean duration of 45 minutes. A disabled truck is assumed to have a clearance time with a mean of 60 minutes. But, if more information is available about the use of a wrecker the prediction is refined to 32 minutes for no wrecker or 76 minutes for a wrecker. The decision tree can handle differing levels of information knowledge about the current incident.

Ozbay and Kachroo tested the decision tree and found a satisfactory performance where 44 out of 77 test incidents were predicted with less than 10 minutes of prediction error. One important finding was that there were a number of outliers that had a large difference between actual and predicted durations. These outliers have the potential to skew some performances of measure like the Mean Absolute Error (MSE) that average the difference in actual and predicted durations for all test incidents.

 

2.3 New Forecasting Techniques

The following sections will describe two fairly new forecasting techniques that will be used for this project. The techniques are classification tress and nonparametric regression.

 

2.4 Classification Trees

A classification tree is a type of decision tree that represents of a number of

yes/no questions that sort object into distinct classes. The difference between a classification tree and a decision tree similar to the one described above is that the classification tree assigns a class instead of a deterministic value. When a mechanic is inspecting a car to find why it doesn’t run they progress through a checklist of different parts to inspect. The result of each question leads the mechanic down a different path. If the checklist were plotted as a number of nodes and links, a classification tree would be formed (see Figure 2-4).

Figure 2-4: Example of a mechanic’s classification tree.

The goal of a classification tree is to take a set of objects with characteristics or measurements and find a systematic method of assigning the objects to a number of distinct classes. This systematic method of splitting a sample into two sub-samples is referred to as a classifier (or classification rule), and the combination of numerous classifiers forms a classification tree (Breiman, et al., 1984).

Data is the most important key to constructing a classifier (Breiman, et al., 1984). As seen in the example above, the mechanic’s checklist is based on past experience with similar vehicles and problems. The data used in a classifier can have either numerical or categorical variables. An example of a numerical variable would be the number of miles on the vehicle, while a categorical variable would be the type of engine in the vehicle. The ability to find patterns in both numerical and categorical variables makes decision trees very useful for modeling real-world processes like diagnostic problems (Park, 1995).

To date, there has been no published study on the use of classification trees to forecast incident duration. Ozbay and Kachroo have shown that decision trees are a promising technique that needs to be developed further. Classification trees have the advantage of sorting classes of incident duration that can be defined by the user. In addition, the categorical output may be better suited for the categorical inputs that are used in describing the nature of a freeway incident.

 

2.4.1 Tree Construction

This section deals specifically with binary tree structured classifiers, which consist of simple yes or no splitting decisions. The classifier splits a sample into two descendant subsets. A classification tree is formed by repeated splits of descendant samples (Breiman, et al., 1984). For a binary tree structure, the splitting rule is of the nature, "Is x = y?" with a path for objects where this statement is true and another path for a false statement. At some point a tree must stop splitting and terminal subsets (or nodes) are declared. Each terminal subset is the assigned to a distinct class and it is possible for two or more subsets to belong to the same class (Breiman, et al., 1984). See Figure 2-5 for the structure of a classification tree.

Figure 2-5: Classification tree structure.

The construction of a classification tree is dependent on three processes (Breiman, et al., 1984):

 

2.4.2 Classification and Regression Tree (CART) Software

The CART software program is based on the decision tree methodology developed by Breiman, Friedman, Olshen, and Stone in 1984. The software program incorporates a binary-recursive portioning algorithm, where parent nodes are always split into two child nodes and each child node is considered a future parent node if it is determined that a split is needed (Salford Systems, 2000). One key feature of the CART program is that the classifiers are nonparametric and thus do not require a prior knowledge about the probabilistic distribution of the underlying data (Salford Systems, 2000).

The first problem in constructing a classification tree is the method to determine the splits that will divide the parent node data into two smaller samples. CART is based on the fundamental idea that each split should be selected so that the data in each descendant subset are "purer" than the data in the parent node (Breiman, et al., 1984). This measure of impurity is based on the proportion of cases in the node belonging to each class. Node impurity is largest when all classes are equally mixed together and smallest when the node contains only one class (Breiman, et al., 1984). Consider a parent node t, which contains data belonging to J number of classes. The proportions for each class in the node are given by

p(j | t) where j = 1, 2,…, J and S p(j | t) = 1 for all j

Based on the Gini diversity index, the impurity function i(t) for node t is given by

Thus, consider a parent node t that uses splitting rule d to split into two nodes tL and tR, where pR and pL are the respective proportions of cases from the parent node in each sub-sample. The original impurity of the parent node is given by i(t), and the impurity for the two new nodes by i(tL) and i(tR). The decrease in impurity of this split d is given by

D i(d , t) = i(t) - pL i(tL) – pR i(tR)

When selecting a node split it is important to note that there are a definite number of possible splits of the data (Breiman, et al., 1984). Thus, CART uses a brute force method that examines each possible split for each possible variable. Using the impurity function discussed above, the split with the largest decrease in impurity is chosen for that node. The assumption is that the impurity will never increase from a split. If the impurity can not be decreased, then a terminal node is declared and that portion of the tree stops growing. The class assigned to the terminal node is based on which class became "purer" from the final split. It should be noted that using the Gini rule for class assignment is different from the plurality rule. Thus, it is possible that a terminal node will be assigned one class, when there is a higher proportion of another class in the node.

The conclusion of Breiman et al. was that the properties of the final classification tree are insensitive to the specific splitting rule used in development. A much more criterion is the pruning method, used to determine the best size tree to use. CART constructs the largest possible tree such that the impurity of all terminal nodes can not be decreased. Pruning involves taking the large tree and recombining splits into parent nodes (Cios, et al., 1998). The pruning moves upward and produces a decreasing sequence of sub-trees. The sub-trees are then tested for their predictive accuracy using a separate test data set or cross-validation techniques (testing techniques will be discussed later). The sub-tree that has the lowest misclassification rate of the test data is selected as the optimal classification tree for that data (Breiman, et al., 1984). Pruning is widely accepted in the construction of classification trees as one method to avoid overfitting a test data set (Cios, et al., 1998). Overfitting is the phenomena where a process of fitting a model or method to a data set goes too far and attempts to define or explain every instance in the data set.

Since classification trees are data-dependent, there must be adequate testing of the tree with data not used to develop the classifiers. CART uses two testing procedures: learning samples and cross validation (Salford Systems, 2000). When there is a large data set to develop a tree, the sample can be divided into learning and testing sub-samples. CART uses the learning sample to develop potential trees during the pruning process, while the testing sample is used to compare the tree performance and select the optimal tree. When there is a small data sample available, CART uses cross validation for testing. In this process the data set is divided into ten equal samples. Nine of the samples are then used as a learning sample, with the remaining one used as a testing sample. This process of growing a tree and testing continues until each sample has been used for testing. The results from these 10 trees are then combined to form error rates for trees of each possible size (Salford Systems, 2000). The optimal testing situation would be to use a learning and testing sample, and CART has historically performed about 10 to 15 percent better using testing samples than cross validation (Salford Systems, 2000).

 

2.5 Non-parametric Regression

Nonparametric regression is a forecasting technique that has been used in the past for predicting traffic flows in the short-term. The technique has provided positive results and is considered a viable choice for traffic condition forecasting for freeway management systems, and is especially important when there are difficulties developing parametric models (Smith, et al., 2001). The basis of nonparametric regression is to make current decisions based on past, similar experience. Thus, it relies heavily on data describing the relationship between dependent and independent variables. The basic approach is to locate the state of the current system (as defined by the independent variables) in a neighborhood of past, similar states. Once a neighborhood is defined, the past cases in the neighborhood are used to estimate the value of the current dependent variable (Smith, et al., 2001).

To date, there has been no published study of the use of nonparametric regression for predicting incident duration. For such an application, the system state of an incident can be described using a number of independent variables such as time of day and the number of vehicles involved. The dependent variable and forecasting output would be the duration of the incident. One attractive feature of nonparametric regression as a forecasting tool is that the knowledge of the relationship is in the data instead of the model (Smith, et al., 2001)

The key to the effective use of nonparametric regression is defining an appropriate neighborhood and then generating a forecast based on the cases within the given neighborhood.

 

2.5.1 Neighborhood Definition

The accuracy of nonparametric regression is dependent directly on the quality of the neighborhood and its ability to include similar cases (Smith, et al., 2001). The two basic approaches to defining neighborhoods are kernel and nearest neighbor (Altman, 1992). Kernel neighborhoods have a constant bandwidth, and thus occupy a specific range on the independent variable space (Smith, et al., 2001). Nearest neighbor neighborhoods are defined as containing a constant number of past cases. This is commonly referred to as k-nearest neighbor (KNN) nonparametric regression, where k is the number of past cases used to define the neighborhood (Smith, et al., 2001). The two methods of neighborhood definition are best seen on a graph. Figures 2-6 and 2-7 show a sample of data points with corresponding independent and dependent variable values. In this example the problem is how to define a neighborhood for an independent variable value of 35 (dashed line). Figure 2-6 uses a kernel size of 6 to return a neighborhood of 16 data points. Figure 2-7 uses a nearest neighbor approach with k=8 to return a neighborhood of 8 data points.

 

Figure 2-6: Example of a kernel neighborhood of size 6.

Figure 2-7: Example of a nearest neighbor neighborhood (k=8).

An issue with neighbor definition is how to measure the distance between cases in the database. The basic method is to use the Euclidean distance to measure proximity. This method is applicable for a single numerical independent variable. When more than one independent variable is present, it may be beneficial to add weight factors to each variable to determine the total distance. Assigning weight factors to rank variables is heuristic in nature and requires careful consideration by the model developer (Smith, et al., 2001). A similar situation is present when independent variables are categorical as opposed to numerical in value.

 

2.5.2 Forecast Generation

Once a neighborhood of past cases has been defined, it is necessary to generate a prediction based on those neighbors. A straightforward method is to compute a simple average of the dependent variable values of the cases in the neighborhood. The weakness of this approach is that it ignores the distance metric information developed in the neighborhood creation (Smith, et al., 2001). A more appropriate approach may be to weigh the average so that past cases nearer to the current case should have more importance in generating the forecast. Other approaches involve linear regression of the dependent values of the cases in the neighborhood, and other weighting techniques. As with neighborhood definition, forecast generation has many possible approaches and the final choice of a technique should be thoroughly tested and evaluated.

 

2.6 Summary

The purpose of this chapter was to investigate past attempts at forecasting incident duration and present new methods that will be developed in this project. The next chapter will present the structure of the project and information on the data to be used for model development and testing.

 

 

Chapter 3: Research Framework

3.1 Methodology

The main goal of this study is to investigate models to forecast the clearance time of freeway accidents. The two main methods that will be discussed in depth are nonparametric regression and classification trees. Below is the methodology to construct and evaluate the forecasting methods.

 

3.2 Data Source

The majority of the forecasting models studied in the previous sections have had an empirical rather than theoretical basis for model development. Having a large sample of past accidents with reliable information is the most important key to producing accurate predictions.

The accident data used in this project was obtained from the Smart Travel Lab located at the University of Virginia in Charlottesville, Virginia. The Smart Travel Lab was created through a partnership of the Virginia Department of Transportation (VDOT) and the Department of Civil Engineering at U.Va. The Lab is a state-of-the-art facility for research and education in the field of Intelligent Transportation Systems (ITS). The Lab maintains a number of direct data connections with VDOT facilities around the state. One such VDOT facility that shares data with the Lab is the Hampton Roads Smart Traffic Center (HRSTC) in Virginia Beach, Virginia. The HRSTC is a freeway management system that monitors traffic in the Norfolk and Virginia Beach region using 203 detector stations and 38 surveillance cameras. This area encompasses the I-64 and I-264 corridors (see Figure 3-1).

Figure 3-1: Map of the Hampton Roads region of Virginia.

The HRSTC also serves as the headquarters for the Freeway Incident Response Team (FIRT) that patrols the freeways and assists motorists and emergency vehicles. The Smart Travel Lab receives video, station data, and incident data directly from the HRSTC.

The Hampton Roads Incident Database is maintained by the operators and personnel at the HRSTC. This includes the persons monitoring the freeway cameras and other devices and the supervisors. Freeway incidents are identified by the operators watching the cameras, the incident response team on the freeways, state police radio, phone calls from motorists, and other sources. The incident is then manually entered into a database using a graphical user interface program at the HRSTC. All of the information is entered by hand into the database. The database began collecting incidents in January of 1997 and is still in use today with all new entries being sent to the Smart Travel Lab.

3.3 Database Structure

The actual incident database in the Lab is built on 5 different tables. Each unique incident recorded at the HRSTC is given a unique ID number (named the TMS call number) that is used to join the tables in the database. The structure of the database is shown in Figure 3-2.

Figure 3-2: Structure of incident database tables.

 

3.3.1 Incident Table

The main table in the database is the incident table (hr.incident). This table contains important information on the beginning time and date of the incident along with the ending time and date. A single entry in the start and end fields contains both the date and time together, such as MM/DD/YYYY HH24:MI. The duration of the incident is defined as the distance between the start and end times.

The type of incident is also recorded in this table. The options for the incident type are

TEOC refers to VDOT’s Transportation Emergency Operations Center, a statewide coordinating unit that informs VDOT agencies and the public on significant weather conditions that may affect traffic conditions (VDOT, 2000). These TEOC entries only represent about 2 percent of the total entries in the database.

The next field in the table is the weather. The person entering the incident into the database has the choices of

This field is subjective from the operators’ perspective and is often left blank during data entry.

The next field is the detection source. This is the manner in which the traffic manager was informed of the incident. Some of the possible detection methods are

In the database there are a total of over 1000 unique entries for this field. Sometimes a specific police officer or camera operator is called by name in the field. Also, sometimes the camera number is recorded. Again, sometimes the entry person leaves this field blank.

The final field in the incident table is a text description of the incident. This field is considered optional and is only used by the operator to provide additional useful information that was not included in the other entries. Often, this description field is used to provide police codes on the specific type of incident, or the presence of personal injuries. Also, the database only stores a certain number of characters for this field, but the operator can enter as long a description as needed. Thus, some description entries are cut off in mid sentence and some information is not stored in the database. Overall, it is not possible to search this field for information due to the inconsistent nature of the entries.

 

3.3.2 Agency Table

This table (hr.inc_agency) lists the specific agencies that responded to the scene during the course of the incident. The entries are listed in alphabetical order in the database, so it is not clear which agency was the first on-scene. The amount of time each agency was on-scene is also not reported. This table is connected to the main incident table through the unique TMS call number. Some of the most frequently recorded responding agencies are

Overall, there are over 100 unique entries for the responding agency. In some cases the actual city is listed instead of local police. Other entries are unidentifiable or include specific names of police officers or FIRT personnel.

 

3.3.3 Assist Table

The next table in the incident database is the assist table (hr.inc_assist). This table lists the specific assistance given by the FIRT team on–scene or the traffic operators back at the HRSTC. Some of the most frequently recorded entries for the assistance are

This field also suffers from many different entries, over 1000 unique entries are in the database. Many of these entries are similar, but each operator has their own personal phrases or spellings that they use to enter the assistance. It should be noted that the assistance listed applies to only FIRT and the HRSTC and not the other responding agencies such as EMS or police.

 

3.3.4 Automobile Table

This table (hr.inc_automobile) lists the specific automobiles that were involved in the incident. Two fields record the make and model of each vehicle. In some cases, a tractor-trailer is listed as the vehicle make, but in other cases the truck is listed by the specific make and model such as a Volvo 5100. This situation is also present for other types of vehicles such as motorcycles, buses, and emergency equipment. The license plate number and originating state are recorded at the HRSTC, but the plate number is stripped when the data is passed along to the Smart Travel Lab. Also included in this table is the towing company that was used for each vehicle. Again, the entries in the automobile table are joined to the main incident table through the unique TMS call number.

 

3.3.5 Roadway Table

This table (hr.inc_roadway) is concerned with the location of the incident as opposed to the incident characteristics. However, the location entries in this table are joined to the incident characteristics in the main incident table through the unique TMS call number. The first field in the table is the specific road or interstate where the incident occurred. The choices for this entry are

The interstates are straightforward, but the others are not. In the Hampton Roads area, there are two river interstate river crossings; the Hampton Roads Bridge Tunnel (HRBT), and the Monitor Merrimac Memorial Bridge Tunnel (MMMBT). The HRBT is used by I-64, while the MMMBT is serviced by I-664. However, if an incident occurs on either of these systems the road entry is only given as bridge/tunnel. Thus, it is impossible to differentiate between incidents on the HRBT and MMMBT. The off-highway entry is unclear, but probably refers to incidents that occurred on major arterial roads in the region that interchange with the interstates, and thus may cause back-ups on some of the interstate exit ramps. It should also be noted that the road entry does not say whether the incident occurred on the main lanes or HOV lanes on the road.

The road direction is also given in this table to differentiate between the two opposing travel lanes. The lane field states which travel lanes are affected by the incident. In this field, some entries state the lanes by name (left and center lanes) or by lane number (1 and 2) depending on the method preferred by the operator. In addition, the lane field includes shoulder lanes, ramps, and reversible lanes. The lane field is the place where main lines are differentiated from HOV lanes.

Perhaps the most important field in this table is the specific location of the incident. The HRSTC has defined specific sections of the interstates as different zones, using names such as W64-01 (see Appendix D for a complete map). The east and west part of the zone name does not refer to the direction of travel, but rather if the zone is located east or west of the large I-64/I-264 interchange in Norfolk. The zone boundaries are the interchanges along the roadway, so a location zone may be 1 to 2 miles in length and refer to both directions of travel. This is the most specific location of an incident that is available in the incident database.

 

3.3.6 Location Table

The location table is the only table in the incident database that is not joined to the main table through the unique TMS call number. This table instead is joined to the roadway table through the unique location zone name. The table lists some important information about each location zone. The road and corresponding city of the location is given in two fields. The text description field gives the name of the two interchanges that bound the location. The final field is for the HRSTC and tells which traffic cameras are located within the zone.

 

3.4 Data Collection

The Incident Database from the Smart Travel Lab includes all types of incidents

from January of 1997 and is updated daily. This project uses accident data up to the end of December 2000, which gives a total of 7,396 unique freeway accidents.

 

3.4.1 Data Reduction

As with any project that includes a large amount of data, the first step in analysis

is to determine which data is of use to the project. This involves reducing the data by eliminating useless data. A number of accidents had missing values in the database, especially in the automobile and weather fields, and were thus removed from the analysis. The focus of this project is on freeway accidents, so any other accidents were removed. Some entries in the incident database listed an ‘off-highway’ entry in the roadway field, and were excluded from the analysis. Other accidents were removed due to errors in the duration. Accidents that have a zero or negative value of duration are assumed to be entry errors by the HRSTC. Also, accidents with a duration greater than 12 hours were assumed to be operation error and removed from the analysis data. The 12 hour cutoff was used because it includes the case where the operator incorrectly enters the AM or PM part of the time.

After data reduction, there are 6,828 accidents that are assumed to be valid in terms of the clearance time and characteristics. This population of accidents is divided into learning and testing samples. The accidents in the learning sample will be used for model development and calibration, while the testing sample accidents will be used to evaluate the performance of the forecasting model. The testing sample is comprised of one-quarter of the accident population or 1707 accidents. Thus, the learning sample consists of 5121 accidents. It should be noted that the accidents for the testing sample were chosen chronologically rather than randomly. The testing sample represents the most recent accidents of the total population. Normally, a random sample of the total population is used for the testing sample. However, the goal of the forecasting models used in this study is to predict the clearance time of future accidents using knowledge from past accidents. For this reason, a chronological division was used for the learning and testing samples.

 

3.5 Potential Independent Variables

The goal of a forecasting model is to emulate a relationship between the dependent and independent variables. For this example, the dependent variable is the duration of the accident. Numerous independent variables are possible from the large amount of data recorded in the incident database for each accident. Table 3-1 gives a summary of the independent variables considered for the forecasting models. All of the independent variables are categorical with 2 or 3 possible values.

 

 

Table 3-1: Potential model independent variables.

Variable

Name

Value

Physical

Time of Day

PEAK

1 = Peak (6-8am, 4-6pm)

0 = Off-peak

Day of the Week

WEEKDAY

1 = Weekday

0 = Weekend

Weather

WEATHER

1 = Normal (Clear, Cloudy, Cool, Hot/Humid, Warm)

0 = Adverse (Cold/Ice, Fog, Natural Disaster, Rain, Sleet, Snow)

Response

EMS Response

EMS

1 = Yes

0 = No

Fire Response

FIRE

1 = Yes

0 = No

FIRT Response

FIRT

1 = Yes

0 = No

Hazardous Material Agency

HAZMAT

1 = Yes

0 = No

Police Response

POLICE

1 = Yes

0 = No

VDOT Response

VDOT

1 = Yes

0 = No

Tow Truck Response

TOW

1 = Yes

0 = No

Vehicle

Number of Vehicles

NUMVEH

1 = Single Vehicle

2 = Two Vehicles

3 = Three or More Vehicles

Truck Involvement

TRUCK

1 = Yes

0 = No

Passenger Bus Involvement

BUS

1 = Yes

0 = No

 

3.5.1 Physical Independent Variables

The physical independent variables describe the nature of the accident in terms of

time and place. The first independent variable is the accident time of day. This variable has possible values of peak and off-peak. In this case, the peak hours are defined as 6am to 8am inclusive and 4pm to 6pm inclusive. Thus, off-peak hours are 8:01am to 3:59pm and 6:01pm to 5:59am. These peak hours were chosen because they correspond to the hours of operation for the High Occupancy Vehicle (HOV) reversible lanes that run along the median of I-64 in the region. The next variable is the day of the week. This variable has possible values of weekday (Monday to Friday) or weekend (Saturday and Sunday). For both the time of day and day of week, the variable value is based on the start time of the accident regardless of the duration. A final physical variable is the weather, which takes on values of normal or adverse. Adverse weather is defined as fog, rain, ice, snow, sleet, or natural disaster. Normal weather includes all other conditions.

 

3.5.2 Vehicle Independent Variables

The vehicle independent variables attempt to provide information about the

number and types of vehicles involved in the accident. The number of vehicles variable has three different values; single vehicle, two vehicles, and three or more vehicles. A number of other variables were used to reflect the types of vehicles involved in the accident. The involvement of a truck or tractor-trailer will give the truck variable a yes value. Similarly, the involvement of a passenger bus will give the bus variable a yes value. The assumption is that the accident only involved passenger automobiles unless otherwise noted by the truck and bus variables.

 

3.5.3 Accident Response Independent Variables

Another important independent variable related to accident clearance time is

which emergency agencies responded to the scene. These variables give some sense of the severity of the accident. Binary variables were used to note the response of EMS, Fire Department, FIRT, Hazardous Material Agency, Police (local and state), Virginia Department of Transportation (VDOT) personnel, and tow-trucks. It should be noted that there are no variables to distinguish the response order or time of the above agencies.

 

3.6 ANOVA Significance Test

The above independent variables were identified from the available accident

data. It is possible that some of the independent variables are not significant with regards to affecting accident clearance time. For example, some of the variables may have an influence on other factors than clearance time such as accident frequency, accident severity, and accident detection time. Thus, it was necessary to perform statistical significance tests on the independent variables using ANOVA for the proposed dependent variable of accident clearance time.

Analysis of variance (ANOVA) refers to a collection of experimental situations and statistical procedures for the analysis of quantitative responses from experimental units (Devore, 1995). A single-factor ANOVA table analyzes data from two or more population samples where one factor is used to differentiate between the populations. The null hypothesis being tested is given below (Devore, 1995).

Ho: m 1 = m 2 = … = m I

 

Versus the alternative hypothesis

Ha: at least two of the m I’s are different

Where

I = the number of samples being compared

m 1 = the mean of sample 1 when a single factor is applied to population

m I = the mean of sample I when a single factor is applied to population

This ANOVA is easily able to handle samples with different sample sizes. The major assumption with single-factor ANOVA is that each sample of the population is normally distributed with the same variance (Devore, 1995). However, Ozbay and Kachroo used ANOVA tables successfully to test variable significance and even found a normal distribution for incident duration. As data sets of incidents were divided into smaller data sets so that the incidents were all of the same severity and nature, a normal distribution trend was found and confirmed by statistical tests (Ozbay and Kachroo, 1999). Thus, the ANOVA assumption appears to be valid.

For this project, an ANOVA table was applied to each independent variable. For example, for the time of day independent variable, the single factor used was peak versus off-peak. The total population of accidents was divided into two samples, peak accidents and off-peak accidents. Each sample has a sample mean and sample variance. The ANOVA test compared the two samples to determine if the underlying mean of each sample was the same (null hypothesis) or significantly different (alternative hypothesis). The output of the ANOVA table is a p-value, which is the smallest level of significance (a ) at which the null hypothesis (that the two sample means are the same) can be rejected (Devore, 1995). Common levels of significance are 0.05 and 0.01. If the p-value is less than or equal to the level of significance, the null hypothesis is rejected and we can say that the two samples have different means and the independent variable is significant in terms of clearance time. Thus, the clearance time of an accident is assumed to be dependent on all significant variables.

The ANOVA table was applied to each independent variable and the corresponding p-values are given below. The full ANOVA results for each independent variable are given in Appendix A.

 

Table 3-2: Independent variable significance test results.

Independent Variable

ANOVA p-value

PEAK

2.30 x10-5

WEEKDAY

7.83 x10-6

WEATHER

0.235

EMS

1.49 x10-66

FIRE

1.31 x10-60

FIRT

0.958

HAZMAT

6.21 x10-28

POLICE

2.70 x10-31

VDOT

2.18 x10-11

NUMVEH

1.41 x10-44

TRUCK

1.95 x10-19

BUS

0.0440

TOW

1.80 x10-181

The ANOVA analysis shows that all of the independent variables are significant except for weather and FIRT response. The bus involvement variable is significant at a 0.05 level, but not a 0.01 level. This borderline independent variable was included in the forecasting models none the less, because it intuitively appears to have a significant impact on accident clearance time. With a passenger bus, there is the probability of numerous victims and a relatively large vehicle to evacuate the accident scene.

 

3.7 Model Selection

The next step in the experimental framework is to select potential models for forecasting accident clearance time. There is a wide range of forecasting techniques that may be applicable to accident clearance time. This study will focus on three different forecasting models.

The first model to be evaluated is a stochastic model using probability density functions to describe clearance time. Bast research on incident duration has shown that the duration of an incident can be modeled as a random variable using a lognormal or Weibull distribution. The second forecasting model is a nonparametric regression model. Nonparametric regression techniques have been used successfully to forecast other traffic conditions such as flow. The final forecasting model is a classification tree model. This model was chosen based on promising research performed recently using decision trees to predict incident clearance times. The next three chapter will outline the development of the three forecasting models and investigate the performance of the each model.

 

 

Chapter 4: Stochastic Model

4.1 Model Background

Many events in nature are assumed to behave in some random manner. Even though the events are random, there may be tendencies and trends in the behavior of the events that can be used to describe the system as a whole. A stochastic model attempts to describe the randomness of the events (Higgins and Keller-McNulty, 1995). For example, flipping a coin produces two possible results in showing heads or tails. A single event of flipping the coin is not dependent on any factors, so there is a random outcome of heads or tails. However, over a long period of time and many trials, it is expected that the proportion of heads outcomes will be 50 percent. This is a simple example of a stochastic model of the random event of flipping a coin.

 

4.2 Probability Density Functions

One method to describe the behavior of a random event is through a probability

density function. The probability distribution shows how probability density is distributed across the possible values of a random variable (Higgins and Keller-McNulty, 1995). The equation to describe continuous random variables for a specific distribution are referred to as a probability density functions

Past research on incident duration has shown that the duration tends to show a

Weibull or lognormal probabilistic distribution. One major deficiency with a number of these results is the relatively small sample size used to test different probability density functions. If such distributions are applicable, it may be stated that accident duration can be modeled as a random variable with a known distribution.

A random variable is said to have a Weibull distribution if the probability density function of the random variable is given by (Devore, 1995).

where x is the value of the random variable, a is a shape parameter, and b is a scale parameter.

Likewise, a random variable is said to have a lognormal distribution if the log of the variable has a normal distribution. The probability density function of a lognormal distribution is given by (Devore, 1995).

where x is the value of the random variable, m is a scale parameter, and s is a shape parameter.

 

4.2.1 Goodness-of-fit Test

The assumption from past theoretical and quantitative research is that the Weibull or lognormal distributions can describe incident duration. In this project there is a large sample to verify or disprove this assumption using statistical goodness-of-fit tests (Ang and Tang, 1975). One common goodness-of-fit test that will be applied to the above probability density functions is the chi-square test.

The chi-square test compares the observed interval frequencies with the theoretical frequencies for the distribution to be tested (Ang and Tang, 1975). The test statistic is given as

where k is the number of intervals, ni is the observed frequency for the ith interval, and ei is the theoretical frequency of the ith interval. The test statistic, c 2, will approach the chi-square distribution c f2 with f=k-1 degrees of freedom (Ang and Tang, 1975). The critical value of the c f2 distribution at the cumulative probability of 1-a is given by c1-a ,f where a is referred to as the level of significance (Ang and Tang, 1975). Thus, the assumed distribution is an acceptable fit at the a significance level if c 2 is less than c1-a , f. Otherwise, the assumed theoretical distribution is not supported by the observed data (Ang and Tang, 1975).

 

4.3 Model Development

This chapter discusses a collection of different stochastic models for a certain accident characteristic. Two factors were tested to develop the models, accident severity and time of day. In addition a stochastic model was developed for all accidents regardless of the severity and time of day. The distributions that are emphasized in the analysis are the Weibull and lognormal distributions. The ExpertFit program was used to select the optimal probability density function parameters for 30 possible distributions (Law, 2001). It is worth noting that either the lognormal or Weibull distribution was the best fitting distribution for each case. Once the stochastic models were developed, the chi-square goodness-of-fit test was used to evaluate the fit of the probabilistic distribution.

 

4.3.1 Model for All Accidents

The simplest stochastic model for this project is one that models any accident, regardless of the accident characteristics. Figure 4-1 shows a histogram of the clearance time of all accidents.

 

Figure 4-1: Clearance time histogram for all accidents.

This graph shows a definite left-shifted tendency towards accidents with smaller clearance times. The ExpertFit program identified the Weibull distribution as the best candidate distribution. The parameters for the Weibull and lognormal probability density functions are given in Table 4-1.

 

Table 4-1: Distribution parameters for all accidents.

Weibull Distribution

Lognormal Distribution

b

a

m

s

43.3

1.33

3.34

0.968

Overlaying the two probability density functions on the original histogram gives a comparison between the two distributions.

Figure 4-2: Histogram and distribution overlay for all accidents.

It appears from the graph that the Weibull distribution is a better fit for the clearance time data. The chi-square test was used to test the assumption that the accident data follows the Weibull and lognormal distributions.

 

 

Table 4-2: Chi-square test for all accidents.

Number of samples, N

6,828

Number of Intervals

40

Degrees of Freedom

39

Weibull Test Statistic, c 2

612.996

Lognormal Test Statistic, c 2

2,005.369

 

Significance Level

a

Critical Value

c1-a ,f

Accept Weibull distribution?

Accept Lognormal distribution?

0.25

44.539

No

No

0.15

48.126

No

No

0.10

50.660

No

No

0.05

54.572

No

No

0.01

62.428

No

No

The chi-square results show that both the Weibull and lognormal stochastic models do not adequately describe the clearance time values for all accidents. The next step was to introduce accident severity into the stochastic model.

 

4.3.2 Model for Accident Severity

This stochastic model attempts to fit a probabilistic density function to three different subcategories of accidents; single vehicle, two vehicle, and three or more vehicle accidents. The same procedure outlined for all accidents was used for these models. First, histograms were prepared for each category of accident severity.

Figure 4-3: Clearance time histogram for single vehicle accidents.

Figure 4-4: Clearance time histogram for two vehicle accidents.

Figure 4-5: Clearance time histogram for three or more vehicle accidents.

ExpertFit evaluated a large number of probabilistic distributions and found that the Weibull distribution was the best fit for all three of the histograms above. The parameters for the Weibull and lognormal distributions are given below.

 

Table 4-3: Distribution parameters for single vehicle accidents.

Weibull Distribution

Lognormal Distribution

b

a

m

s

44.8

1.30

3.37

0.980

 

 

Table 4-4: Distribution parameters for two vehicle accidents.

Weibull Distribution

Lognormal Distribution

b

a

m

s

36.6

1.27

3.15

1.00

 

Table 4-5: Distribution parameters for three or more vehicle accidents.

Weibull Distribution

Lognormal Distribution

b

a

m

s

53.5

1.72

3.64

0.779

For a visual comparison, the two distributions are overlaid on the histograms created above.

Figure 4-6: Histogram and distribution overlay for single vehicle accidents.

Figure 4-7: Histogram and distribution overlay for two vehicle accidents.

Figure 4-8: Histogram and distribution overlay for three or more vehicle accidents.

Using the distribution parameters the models were tested using the chi-square test.

 

Table 4-6: Chi-square test for single vehicle accidents.

Number of samples, N

2,716

Number of Intervals

40

Degrees of Freedom

39

Weibull Test Statistic, c 2

381.290

Lognormal Test Statistic, c 2

992.365

 

Significance Level

a

Critical Value

c1-a ,f

Accept Weibull distribution?

Accept Lognormal distribution?

0.25

44.539

No

No

0.15

48.126

No

No

0.10

50.660

No

No

0.05

54.572

No

No

0.01

62.428

No

No

 

 

 

Table 4-7: Chi-square test for two vehicle accidents.

Number of samples, N

2,687

Number of Intervals

40

Degrees of Freedom

39

Weibull Test Statistic, c 2

336.044

Lognormal Test Statistic, c 2

780.317

 

Significance Level

a

Critical Value

c1-a ,f

Accept Weibull distribution?

Accept Lognormal distribution?

0.25

44.539

No

No

0.15

48.126

No

No

0.10

50.660

No

No

0.05

54.572

No

No

0.01

62.428

No

No

 

 

Table 4-8: Chi-square test for three or more vehicle accidents.

Number of samples, N

1,425

Number of Intervals

40

Degrees of Freedom

39

Weibull Test Statistic, c 2

116.305

Lognormal Test Statistic, c 2

353.947

 

Significance Level

a

Critical Value

c1-a ,f

Accept Weibull distribution?

Accept Lognormal distribution?

0.25

44.539

No

No

0.15

48.126

No

No

0.10

50.660

No

No

0.05

54.572

No

No

0.01

62.428

No

No

Again, these chi-square results show that accident data does not support the Weibull or lognormal distributions for the three models. A full range of significance levels were tested and for each case, the data overwhelmingly rejected the assumption of the probabilistic distribution.

 

4.3.3 Model for Accident Time of Day

This stochastic model attempts to fit probabilistic distributions to accident clearance time based on the time of day. Three different categories of accidents were used; peak period weekday, off-peak period weekday, and weekend accidents. The histograms for the clearance times of each category are given below.

Figure 4-9: Clearance time histogram of peak weekday accidents.

Figure 4-10: Clearance time histogram of off-peak weekday accidents.

Figure 4-11: Clearance time histogram of weekend accidents.

Again, ExpertFit evaluated a number of different distributions and selected the Weibull distribution as the best candidate distribution for the three clearance time samples. The distribution parameters for the Weibull and lognormal distributions are given below.

 

Table 4-9: Distribution parameters for peak weekday accidents.

Weibull Distribution

Lognormal Distribution

b

a

m

s

40.4

1.34

3.26

0.973

 

 

Table 4-10: Distribution parameters for off-peak weekday accidents.

Weibull Distribution

Lognormal Distribution

b

a

m

s

43.3

1.32

3.34

0.958

 

Table 4-11: Distribution parameters for weekend accidents.

Weibull Distribution

Lognormal Distribution

b

a

m

s

46.7

1.37

3.42

0.976

For a visual comparison, the two distributions are overlaid on the histograms created above.

Figure 4-12: Histogram and distribution overlay for peak weekday accidents.

Figure 4-13: Histogram and distribution overlay for off-peak weekday accidents.

Figure 4-14: Histogram and distribution overlay for weekend accidents.

Using these distribution parameters the models were tested using the chi-square test.

 

 

Table 4-12: Chi-square test for peak weekday accidents.

Number of samples, N

1,797

Number of Intervals

40

Degrees of Freedom

39

Weibull Test Statistic, c 2

195.810

Lognormal Test Statistic, c 2

494.130

 

Significance Level

a

Critical Value

c1-a ,f

Accept Weibull distribution?

Accept Lognormal distribution?

0.25

44.539

No

No

0.15

48.126

No

No

0.10

50.660

No

No

0.05

54.572

No

No

0.01

62.428

No

No

 

Table 4-13: Chi-square test for off-peak weekday accidents.

Number of samples, N

3,384

Number of Intervals

40

Degrees of Freedom

39

Weibull Test Statistic, c 2

337.631

Lognormal Test Statistic, c 2

874.770

 

Significance Level

a

Critical Value

c1-a ,f

Accept Weibull distribution?

Accept Lognormal distribution?

0.25

44.539

No

No

0.15

48.126

No

No

0.10

50.660

No

No

0.05

54.572

No

No

0.01

62.428

No

No

 

 

Table 4-14: Chi-square test for weekend accidents.

Number of samples, N

1,647

Number of Intervals

40

Degrees of Freedom

39

Weibull Test Statistic, c 2

227.706

Lognormal Test Statistic, c 2

676.910

 

Significance Level

a

Critical Value

c1-a ,f

Accept Weibull distribution?

Accept Lognormal distribution?

0.25

44.539

No

No

0.15

48.126

No

No

0.10

50.660

No

No

0.05

54.572

No

No

0.01

62.428

No

No

As with the previous stochastic models, the Weibull and lognormal distributions are rejected based on the available clearance time data.

 

4.4 Summary

In this chapter accident clearance time data was used to produce a number of stochastic models. Unfortunately, no stochastic models were applied to future accident scenarios due to the inability to accurately fit any probabilistic distribution to the accident data. It is possible that some of the variance in accident clearance time may be explained by more specific accident characteristics. The next chapter investigates two deterministic models that incorporate independent variables gathered from accident characteristics in the incident database.

 

 

Chapter 5: Nonparametric Regression Model

5.1 Model Development

The second forecasting model developed for this study was a nonparametric regression model. This model attempts to emulate a deterministic relationship between the accident characteristics and the clearance time. The nonparametric regression model was presented in gener