Algorithm construction of environmental perception and hazard prediction
The convolution neural network (CNN) and recursive neural network (RNN) are used to complete the task of environmental perception. The model based on LSTM variable gives different weights to different features. It can not only adapt to complex background, but also can deal with multiple targets. In addition, the endtoend method of expression model proposed by Northwest University of science and technology in 2018 can be fully described.
Hazard prediction is divided into two parts: Target Detection and Hazard Degree prediction, in which target detection is the application scenario of deep learning. Compared with the traditional algorithm, the algorithm based on deep learning has obvious advantages in detection accuracy and efficiency. An improved SSD based target detection algorithm is proposed in this paper.
Extracting the feature information of important objects in traffic scene is the beginning of the work. Based on the supervised learning method, the attribute set is trained by multilabel classification, and the attribute prediction is carried out by training the deep convolution neural network corresponding to the loss function.
The supplementary description of environmental perception belongs to the category of image semantic recognition, and the method uesd belongs to the ‘endtoend’.
The work of feature extraction is completed by CNN classification model. After classification, it is represented by LSTM, which is an RNN variant model. It is particularly important to note that the LSTM model is submitted not only to the extracted image features, but also to relevant information such as color, focus range of attention, location, and so on.. The feature of this method lies in dividing attention by color and weighting attention regions appropriately. However, the socalled color attention weight is to detect areas with relatively concentrated areas or large color changes of the same color in the image, especially for red and colors with significant contrast. By the way, the detection is realized by RGB color coding.
The description of model
LSTM is a special form of RNN network, whose structure has a storage unit for storing some events with certain intervals and delays in the training process. The storage unit shown in Fig. 4 regularly balances the content, and the tradeoff is controlled by four gates. A featurebased weight unit is generated during the gate control phase. Besides, the hidden layer state of the previous node and the image features extracted by CNN are input to the unit, and the stimulation features are analyzed by machine vision.
During the encoding phase, pictures and labels exist as vectors in the hidden layer state. Each image extracts features with a trained VGG16 model. At the same time, the label vector is input into the LSTM model through matrix transformation. During the decoding phase, the maximum probability is obtained by multiplying the feature layer of the last layer of the hidden layer by the seventh layer of the fully connected layer. After the comparison, the output model considers the description to be the best match.
Theoretical framework of SSD
In the initial SSD paper, the following structure is presented. SSD is detected using the feature pyramid structure, which uses the characteristic Feature SAMP of conv 4–3, 6–2, 7, 7–2, 8–2, 9–2. At the same time, position regression and Softmax classification are performed. Figure 5 demonstrates that SSD can use VGG16 as the basic network. The feature extraction layer in the second half is also predicted. In addition, the detection is performed not only on the additional feature maps, but also on the underlying conv43 and 7feature positions to achieve compatibility with small goals.
There are three core design concepts of SSD, as follows:

(a)
There are two kinds of feature maps with multiscale feature mapping: large feature mapping corresponds to small target and large target is responsible for small feature mapping.

(b)
The feature map is extracted directly by convolution so that a large feature graph can be obtained with a relatively small convolution kernel.

(c)
Setting a priori box each cell generates a priori box with different size, length and width. As the baseline of the bounding box, a priori frame generates multiple a priori frames in different ways during the training process.
Taking VGG16 as the basic model, the SSD transforms the fullyconnected layer into 3 × 33 × 3 convolution layer CONV6 and 1 × 11 × 1 convolution layer CONV7, and pool5 from 2 × 2 to 3 × 3. Then the FC8 and drain layers are replaced by a series of convolution layers, and finetuned using the detection set. The Conv4 layer with a size of 38 × 38 in VGG16 will serve as the first feature map for detection. However the layer data is too large to be normalized instead.
Five feature graphs were extracted from the new layers, namely Conv7, Conv8_2, Conv9_2, Conv10_2 and Conv11_2, and six original layers of conv4 are added. Their sizes are (38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1), (38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1). They have different priorities, including size, length and width. What’s more, as the size of the feature map increases, the previous box size decreases.
The results are obtained by convoluting the feature graph: Category Confidence and bounding box position, each using 3 × 33 × 3 convolution is completed, the essence of SSD is dense sampling.
Algorithm training and improvement
Training
Prior box matching
Before work, the prior frame matching with the target or part of the target is retrieved, and the matched boundary frame will enter the prediction phase. The first step of prior frame matching is to confirm Before work at least one frame to be identified. If it has a corresponding target, it becomes a positive sample, otherwise it will be a negative sample. Secondly, if there is a target matching degree greater than the threshold (generally 0.5) for the remaining negative sample, the sample will become a positive sample. Moreover, targets may have multiple prior frames that are not necessarily perfectly matched, but one prior frame cannot correspond to multiple targets.
Loss function
The loss function can be understood as the weighted sum of confidence and position error:
$$Lleft(x,c,l.gright)=frac{1}{N}left[{L}_{conf}left(x,cright)+alpha {L}_{loc}left(x,l,gright)right]$$
(8)
where N is the number of positive samples, ({x}_{ij}^{p}in left{mathrm{1,0}right}) is used as an indication parameter, and when ({x}_{ij}^{p}=1), the (Iht) prior box matches the (jht) target with category p. c is the category confidence prediction. And L is the predicted value of the position, which is the position of the boundary of the target selected in the prior frame, and g represents its position parameter. The position error in the loss function only considers positive samples, which is defined by smooth L_{1} loss as follows:
$$Lleft(x,l.gright)={sum }_{iin Pos}^{N}{sum }_{min left{cx,cy,w,hright}}{x}_{ij}^{k}smoot{h}_{L1}left({l}_{i}^{m}{hat{g}}_{j}^{m}right)$$
(9)
$${widehat{g}}_{j}^{cx}=frac{left({g}_{j}^{cx}{d}_{i}^{cx}right)}{{d}_{i}^{w}}$$
(10)
$${widehat{g}}_{j}^{cy}=frac{left({g}_{j}^{cy}{d}_{i}^{cy}right)}{{d}_{i}^{h}}$$
(11)
$${widehat{g}}_{j}^{w}=logfrac{left({g}_{j}^{w}right)}{{d}_{i}^{w}}$$
(12)
$${widehat{g}}_{j}^{h}=logfrac{left({g}_{j}^{h}right)}{{d}_{i}^{h}}$$
(13)
$${smooth}_{{L}_{1}}(x)=left{begin{array}{cc}0.5{x}^{2}& , text{if x<}{1}\ x0.5& , {text{otherwise}}end{array}right.$$
(14)
The parameters are as follows:
$${widehat{g}}_{j}^{cx}=frac{frac{left({g}_{j}^{cx}{d}_{i}^{cx}right)}{{d}_{i}^{w}}}{text{variance}}$$
(15)
$${widehat{g}}_{j}^{cy}=frac{frac{left({g}_{j}^{cy}{d}_{i}^{cy}right)}{{d}_{i}^{h}}}{text{variance}}$$
(16)
$${widehat{g}}_{j}^{w}=frac{mathit{log}left(frac{{g}_{j}^{w}}{{d}_{i}^{w}}right)}{text{variance}}$$
(17)
$${widehat{g}}_{j}^{h}=frac{mathit{log}left(frac{{g}_{j}^{h}}{{d}_{i}^{h}}right)}{text{variance}}$$
(18)
For confidence error, it adopts softmax loss:
$${L}_{conf}left(x,cright)={sum }_{iin Pos}^{N}{x}_{ij}^{p}mathit{log}left(frac{mathit{exp}left({c}_{i}^{p}right)}{sum_{p}exp({c}_{i}^{p})}right)sum_{iin Neg}log({widehat{c}}_{i}^{0})$$
(19)
Improvement based on focal loss
The main reason why singlelevel detection is not as accurate as twolevel detection is the imbalance of sample categories. Category imbalance will bring too many negative samples, which account for most of the loss function. Therefore, the focal loss is proposed as a new loss function. The loss function is modified According to the standard cross entropy loss in Fig. 6. This function can reduce the samples that are easy to classify by changing the evaluation method, so as to apply more weights to the samples that are difficult to classify in the training process. The formula is as follows:
$$FLleft({p}_{t}right)=alpha t{left(1{p}_{t}right)}^{Upsilon}mathit{log}left({p}_{t}right)$$
(20)
Firstly, a factor is added to the original standard cross entropy loss, thereby reducing the loss of easily classified samples. This makes us pay more attention to difficult and misclassified samples. For example, γ = 2, for a positive sample with a prediction result of 0.95, value of the loss function becomes smaller because the power of (1–0.95) is very small. However, for negative samples with a prediction probability of 0.3, the loss becomes relatively large, which is achieved by suppressing the loss of positive samples.
Therefore, the new method pays more attention to this indistinguishable sample. In this way, the influence of simple samples is reduced, and the effect will be more effective only when a large number of samples with low prediction probability are added together. Meanwhile more penalties are required for easily distinguishable negative samples. The actual formula is as follows:
$$L_{{fl}} = left{ {begin{array}{*{20}l} { – alpha {text{ (}}1 – y^{prime } )^{gamma } log;y^{prime } } hfill & {y = {text{postive}};{text{simple}}} hfill \ { – (1 – alpha )y^{{prime gamma }} {text{ }}log({text{1}} – y^{prime } {text{),}}} hfill & {{text{y}} = {text{negtive}};{text{simple}}} hfill \ end{array} } right.$$
(21)
In the experiment, γ = 2 and α = 0.25 have the best effect.
Improvement based on KL loss
The traditional boundary box regression loss (i.e., Smooth L1 loss) does not take the deviation of the actual boundary on the ground into consideration. When the classification score is very high, the regression function is considered to be accurate, but it is not always the case.
Bounding box prediction is modeled as Gaussian distribution, and the boundary box of positive samples is modeled as a Dirac delta function. The asymmetry of these two distributions is measured by KL divergence. When KL divergence approaches 0, these two distributions are very similar. KL loss is the KL divergence of minimizing the Gaussian distribution predicted by the bounding box and the Dirac delta distribution of positive samples. In other words, KL loss makes the bounding box prediction approximate gaussian distribution and close to positive samples. And it converts the confidence into the standard deviation of the bounding box prediction.
The two probability distributions P and Q of a discrete or continuous random variable whose KL divergence is defined as:
$$D(P Q)={sum }_{iin X}P(i)*left[mathit{log}left(frac{P(i)}{Q(i)}right)right]$$
(22)
$$D(P Q)={int }_{x}P(x)*left[mathit{log}left(frac{P(i)}{Q(i)}right)right]dx$$
(23)
Before calculating KL divergence, the bounding box needs to be parameterized.(left({x}_{1},{y}_{1},{x}_{2},{y}_{2}right)) is the upper left and lower right coordinates of the prediction bounding box.(left({x}_{1}^{*},{y}_{1}^{*},{x}_{2}^{*},{y}_{2}^{*}right)) is the coordinates of the upper left and lower right corners of the real box.(left({x}_{1a},{y}_{1a},{x}_{2a},{y}_{2a},{h}_{a},{w}_{a}right)) is an anchored bounding box generated by aggregating all real boxes. Then the deviations of the predicted and real bounding boxes are as follows:
$${t}_{x1}=frac{{x}_{1}{x}_{1a}}{{w}_{a}},{t}_{x2}=frac{{x}_{2}{x}_{2a}}{{w}_{a}}$$
(24)
$${t}_{y1}=frac{{y}_{1}{y}_{1a}}{{h}_{a}},{t}_{y2}=frac{{y}_{2}{y}_{2a}}{{h}_{a}}$$
(25)
$${t}_{x1}^{*}=frac{{x}_{1}^{*}{x}_{1a}}{{w}_{a}},{t}_{x2}^{*}=frac{{x}_{2}^{*}{x}_{2a}}{{w}_{a}}$$
(26)
$${t}_{y1}^{*}=frac{{y}_{1}^{*}{y}_{1a}}{{h}_{a}},{t}_{y2}^{*}=frac{{y}_{2}^{*}{y}_{2a}}{{h}_{a}}$$
(27)
Similarly, the parameter without * indicates the deviation between the prediction and the anchored boundary frame, and the parameter with * indicates the deviation between the real and the anchored boundary frame.
Assuming that the coordinates are independent, a univariate Gaussian function is used for simplicity:
$${P}_{Theta }(x)=frac{1}{sqrt{2pi {sigma }^{2}}}{e}^{frac{{left(x{x}_{e}right)}^{2}}{2{sigma }^{2}}}$$
(28)
where x_{e} is the estimated boundary box position and the standard deviation σ is the estimated uncertainty. When σ → 0, the position accuracy of boundary box is very high.
The real boundary box on the ground can also be expressed by Gaussian distribution, and becomes Dirac delta function when σ → 0:
$${P}_{D}(x)=delta left(x{x}_{g}right)$$
(29)
where x_{g} is the real boundary box position on the ground. At this point, we can construct a bounding box regression function with KL loss, and establish a formula to minimize the KL error of P _{θ} (x) and P_{D} (x) on N samples:
$$widehat{Theta }=underset{Theta }{argmin}frac{1}{N}sum {D}_{KL}left({P}_{D}(x) {P}_{Theta }(x)right)$$
(30)
KL divergence is used as the loss function L_{reg} for bounding box regression, and the classification loss L_{cls} remains unchanged. For a single sample:
$$begin{aligned} L_{reg} & = D_{KL} left( {P_{D} (x)P_{Theta } (x)} right) \ & = smallint P_{D} (x)log P_{D} (x)dx – smallint P_{D} (x)log P_{Theta } (x)dx \ & = frac{{left( {x_{g} – x_{e} } right)^{2} }}{{2sigma^{2} }} + frac{{log (sigma^{2} )}}{2} + frac{log (2pi )}{2} – H(P_{D} (x)) \ end{aligned}$$
(31)
When the prediction of the bounding box is inaccurate, because the prediction closer to the real bounding box is certainly stable and its variance is small, the smallest possible variance can reduce L_{reg}. After the variance of the predicted position of the bounding box is obtained, the candidate positions are voted according to the known variance of adjacent bounding boxes. Besides, the candidate coordinate values with the largest score are selected to be weighted to update the coordinates of the bounding box, so as to make the positioning more accurate. What’s more, border boxes with lower positions and lower colors have higher weights. The new coordinates are calculated as follows:
$$begin{array}{cc}begin{array}{c}{p}_{i}={e}^{{left(1IoUleft({b}_{i},bright)right)}^{2}/{sigma }_{t}}\ x=frac{{sum }_{i}{p}_{i}{x}_{i}/{sigma }_{x,i}^{2}}{{sum }_{i}{p}_{i}/{sigma }_{x,i}^{2}} end{array}& text{subject to }IoUleft({b}_{i},bright)>0end{array}$$
(32)
where ({sigma }_{t}) is an adjustable parameter for variable voting. When (IoUleft({b}_{i},bright)) is larger, ({p}_{i}) is larger, the two bounding boxes overlap each other more and do the same for the remaining coordinate values. SSD detects the generated preselected box computing loss through FL loss function classification and boundary regression. Besides, the boundary regression of SSD is improved based on KL loss method. Frames with large variance and adjacent boundary frames containing the selected frames but too small will get low scores when voting. Moreover, the SSD algorithm can effectively avoid the above anomalies by variance voting instead of IoU overlap degree.
Model testing and analysis
The environment perception is divided into two parts, the micro part is the main perception of the scene by machine vision, which is used to confirm and supplement the macro and micro perception.
First of all, we tested the Roi weighting using live campus photos taken on May 7, 2020. The advantage of this algorithm is that the region of interest can be identified first, and then the further perception can be completed. Therefore, the region of interest test was performed first, and the effect of attention weighting was significant.
Second, the environment perception test was carried out because the region of interest was weighted and the weighted region was described firstly. After testing, the algorithm can complete the perception of the simple traffic scene and recognize the red light of the intersection, the bus and the rightturn sign on the road, and can supplement and confirm the environment perception part.
At the same time, different databases, Google Nic, Log BILINEAR and other different algorithms are used to compare with experiments, because the Algorithm has good performance on Flickr8K, Flickr30K and MS COCO databases, and validated the experimental results of the Northwestern Polytechnic University team. The experimental results on the Flickr8K database are shown in Table 5, Flickr30K database are shown in Table 6, and MS COCO database are shown in Table 7.
The focus will be on target detection in the hazard prediction section. First of all, the vehicle test is carried out by using field test maps and data set pictures. Secondly, dynamic vehicles need to be detected, including their speed, distance and running direction. The vehicle target detection is shown in Fig. 7a and b. The dynamic vehicle direction estimation is shown in Fig. 7c and d. The dynamic vehicle distance estimation is shown in Fig. 7e. The vehicle speed detector is used to detect the speed of the dynamic vehicle in Fig. 7f.