Error Handler

 

Control and Mask register

The start word of the error sequence can be
  • hAA: the error is decoded properly and with the correct priority
  • hEE: the error is not decoded properly and its details are not used.

Inside the Router the errors are organized in order to have, for each class, a global definition (1 bit only) defined as an OR combination of all the errors that belong to the class.

A dedicated register has been implemented within the Router in order to store all the settings needed for the error handling. This register is completely decoupled from the Router operation and also from the Router reset signals, so that the resets do not have any effect on it. This ensures that no additional operations (i.e. new settings, etc) are needed after a reset of this register. The register address (internal displacement) is ‘hf0’, the register format is the following:

BIT NUMBER MASK ERROR
0 Enable / disable error handling
1 TTCRX and QPLL link error
2 Trigger errors from TTC
3 Timeout Bunch Crossing reset error
4 Trigger errors from Router FSM
5 Errors from DAQ state machine
6 Optical link errors (RxReady & RxError)
7 RX error (half-stave optical link)
8 Error format (HS error format communication)
9 Error data transfer (half-stave error optical data transfer not coherent)
10 Error control int (command non properly recognized by the MCM)
11 Error event number (error in MCM event number)
12 HS_0 global error (idle, busy violation, linkRx fatal errors, etc.)
13 HS_1 global error (idle, busy violation, linkRx fatal errors, etc.)
14 HS_2 global error (idle, busy violation, linkRx fatal errors, etc.)
15 HS_3 global error (idle, busy violation, linkRx fatal errors, etc.)
16 HS_4 global error (idle, busy violation, linkRx fatal errors, etc.)
17 HS_5 global error (idle, busy violation, linkRx fatal errors, etc.)
18 Half-stave timeout errors during acquisition
19 Error data format (from Router data format check)
20 Error Fast-OR in data stream
21 Longer busy error
22 High multiplicity error
23 .. 32 Don’t care

 

 

0x1 - TTCRX and QPLL link error  (CDH error 355)

Description
The TTC link is an optical link that comes from the TTC-LTU splitter in 20 parallel links, one for each Router. If this is not ready, the TTCRX chip and the Router do not work properly (the Router clock at 40 MHz is recovered by the local PLL starting from this optical link).
Defined in router_fpga_core.v
 
Actions to do
Hardware problem, no software action is possible.
Check the status of the TTC optical link and the QPLL lock status (LEDs on the Router front panel).
 
Error details
Error number (10 bits)
0 0 0 0 0 0 0 0 0 TTC_QPLL_Link_Error
  • TTCRX_QPLL_Link_error                    0000000001
  • TTCRX_QPLL_Link_error, removed    0000000000
Detail 1 (32 bits): TTCRX error counter (time, expressed in clock cycles, when the error was present)
Detail 2 (32 bits): QPLL error counter
 

 

0x2 - Trigger errors from CTP (CDH error 357, SPDmood 19)

Description
Trigger signals that are not consistent are received by the TTCRX on-board chip of the Routers. This may be due to different reasons:
  • invalid trigger pattern sent to the back-end electronics;
  • L0-L1 delay not properly set;
  • LTU not well configured;
  • TTCRX chip not well configured (as a consequence of a TTCInit command not sent to the chip).
Defined in the trigger sequencer module TSM.v
 
Actions to do
  • reset the TTCRX chip on all the Routers;
  • send a ttcInit command from the LTU client;
  • send a ttcFEEReset command from the LTU client.
 
Error details
Error number (10 bits)
0 0 0 0 L0_error L1_error L1_message
missing
L2_message
missing
L1_message
spurious
L2_message
spurious
  • L0_error                            0000100000
  • L1_error                            0000010000
  • L1_message_missing         0000001000
  • L2_message_missing         0000000100
  • L1_message_spurious        0000000010
  • L2_message_spurious        0000000001
Detail 1 (32 bits): difference L0 - L1
Detail 2 (32 bits): difference L1 - L2
 
 

0x4 - Timeout BCNT reset

Description
Timeout for the synchronization between the half-stave clock phase and the LHC BC reset. It occurs when the ttcFEEreset command is received by the Router but the BCreset signal is not correctly received by the TTCRX chip. As a consequence, the Reset clock phase is not propagated from the Router.
Defined in TTC.v
 
Actions to do
  • reset TTCRX chip on the Router;
  • send TTCInit command from LTU client;
  • send ttcFEEreset command from LTU client.
 
Error details
Error number (10 bits)
0 0 0 0 0 0 0 0 0 Timeout_BCNT_reset
  • In timeout                    0000000001
  • Timeout removed         0000000000
Detail 1 (32 bits): NOT USED
Detail 2 (32 bits): NOT USED
 
 

0x8 - Trigger_errors from master trigger control FSM in router

Description
The trigger signals that come from the TTC are consistent, but some signals are lost inside the Routers; as a consequence, trigger signals do not propagate properly to the LinkRX and to the detector.
Defined in router_fpga_core.v
 
Actions to do
  • Router reset
  • Link Rx reset
  • DPI reset
 
Error details
Error number (10 bits)
0 0 0 0 0 0 0 L0_error_from_L0
control _FSM
L1_error_from_L1
control _FSM
L2_error_from_L2
control _FSM
  • L0_error_from_L0_control_FSM:   0000000100 (highest priority)
  • L1_error_from_L1_control_FSM:   0000000010
  • L2_error_from_L2_control_FSM:   0000000001 (lowest priority)
Detail 1 (32 bits): difference L0 – L1
Detail 2 (32 bits): difference L1 – L2
 
 

0x10 - DAQ Link not ready

Description
The DAQ Optical Link (DDL) is not ready for the data acquisition.
 
Actions to do
Hardware problem, no software action to do, inform the DAQ support.
 
Error details
Error number (10 bits)
0 0 0 0 0 0 0 0 0 DAQ status
  • In timeout                    0000000001
  • Timeout removed         0000000000
Detail 1 (32 bits): NOT USED
Detail 2 (32 bits): NOT USED
 
 

0x20 - Error optical link of half-staves (HS_Optical_Link_Status)

Description
This error indicates which half-stave optical link is missing as a consequence of a problem present on an optical fiber or because the half-stave is switched off.
Also during the data acquisition the error indicates is one half-stave switches off or has optical fiber problems.
Defined in RXLinkStatus.v
 
Actions to do
  • disable the half-stave from the Router manual control OR
  • switch ON the half-stave
 
Error details
Error number (10 bits)
0 0 0 0 0 0 0 0 0 Error_Optical_Link
  • Error_Optical_Link present          0000000001
  • Error_Optical_Link removed         0000000000
Detail 1 (6 bits): number of the half-stave inside the Router with the optical error
Detail 2 (12 bits): status of the G-Link, Serial and Clock connections
 
 

0x40 - Error optical connection

Description
Many errors are taken into account in this flag:
  • errors in the Agilent component of the LinkRX connection;
  • sequences CAV-DAV not coherent;
  • First, Last, Clear bit events not coherent;
  • undefined command decoded by the PILOT2003 deserializer;
  • MCM event counter number not coherent.
Defined in RXLinkStatus.v
 
Actions to do
Hardware problem, check the quality of the optical fiber transmission.
 
Error details
Error number (10 bits)
0 0 0 0 0 Error_eventnumber Error_control Error_data_trans Error_format Rx error
  • Error_eventnumber    0000010000 (highest priority)
  • Error_control              0000001000
  • Error_data_trans       0000000100
  • Error_format              0000000010
  • Rx_error                    0000000001 (lowest priority)
Detail 1: number of the half-stave inside the Router with the optical error
Detail 2: error counter

NOTE:
Error_eventnumber = it must increase by 1 for each L1 received
Error_control = the MCM has decoded an unidentified command
Error_data_trans = sequence first-last words not coherent
Error_format = CAV-DAV sequences not correct
Rx_error: the link is inactive or the half-stave is momentarily unlocked
 
 

0x80 .. 0x1000 - Error_HS_LinkRx_0 ... 5 (CDH error 355, SPDmood 18)

Description
It takes into account the internal errors coming from the LinkRx or the detector.
It happens when the half-stave is not properly configured.
Defined in slm_check.v
 
Actions to do
Global reset:
  • Router reset
  • LinkRx reset
  • DPI + Data Reset
 
Error details
Error number (10 bits):
  • bit 9: linkrx_fatal_error                  1000000000 (highest priority)
  • bit 8: idle_violation                        1100000000
  • bit 7: busy_violation                      1010000000
  • bit 6: fifo_read_overflow_reg        1001000000 (pixel level)
  • bit 5: fifo_write_overflow_reg       1000100000 (pixel level)
  • bit 4: pixel_fifo_full_reg                 1000010000 (pixel level)
  • bit 3: event_fifo_read_overflow    1000001000 (event level)
  • bit 2: event_fifo_write_overflow   1000000100 (event level)
  • bit 1: event_desc_full_violation    1000000010 (event level)
  • bit 0: linkrx_dpm_full                     1000000001
 
Detail 1: NOT USED
Detail 2: NOT USED

NOTE:
Idle violation = L2y or L2n received without the corresponding L1
Busy violation = L1 trigger received with the busy signal asserted
 
 

0x2000 - Timeout HS

Description
The timeout is set during the data acquisition when a half-stave does not transmit any data in 1000 μs (max time defined inside the spdFED in TimeoutReadyEvent register).
It normally occurs when the multi-event buffer is not properly set on the MCM.
Defined in slm_check.v
 
Actions to do
Check the value of the Multi Event Buffer on the DPI manual control.
 
Error details
Error number (10 bits)
0 0 0 0 0 0 0 0 0 timeout_finished
  • timeout_finished present:           0000000001
  • timeout_finished not present:     0000000000
Detail 1 (6 bits): number of the half-stave inside the Router with the problem
Detail 2: NOT USED
 
 

0x4000 - Data error format

Description
This error is set during the data acquisition and takes into account all the errors present in the data stream before the sending operation to the DAQ.
Defined in data_error_checking.v
 
Actions to do
Normally the data acquisition is not stopped, but if the error is continuously present during a run a global reset is needed:
  • Router reset
  • LinkRx reset
  • DPI reset
 
Error details
Error number (10 bits):
  • Bit 5: error_data_header_missing_flag: 0000100000
  • Bit 4: error_wrong_chip_number_flag: 0000010000
  • Bit 3: error_wrong_event_number_flag: 0000001000
  • Bit 2: error_data_missing_flag: 0000000100
  • Bit 1: error_data_trailer_missing_flag: 0000000010
  • Bit 0: error_fill_word_missing_flag: 0000000001

Detail 1: event number [6..0]
Detail 2: chip number

 

0x8000 - FastOr missing in data

Description
The Fast-OR bit is not present in the data stream. The module checks if at least one hit is present inside the chip matrix and if the Fast-OR bit is present in the trailer field. If the Fast-OR bit is not there, the error is generated.
Defined in data_error_checking.v
 
Actions to do
Set the proper delay between the Fast-Or and the L1 signals inside the LinkRx.
 
Error details
Error number (10 bits)
0 0 0 0 0 0 0 0 0 Error_FastOr_missing_in_data
  • Error_FastOr_missing_in_data present:           0000000001
  • Error_FastOr_missing_in_data not present:     0000000000
Detail 1: number of the half-stave inside the Router with the problem
Detail 2: NOT USED

NOTE:
The delay is 46 (??) with MCM stimuli and 83 at P2.
 
 

0x10000 - Error longer busy

Description
Defined in TTC.v
 
Actions to do
.
 
Error details
Error number (10 bits)
0 0 0 0 0 0 0 0 0 Error_Longer_Busy
  • Error longer busy present:           0000000001
  • Error longer busy not present:     0000000000
Detail 1: [3..0] DAQ, Router, HS, triggers in L1 FIFO
Detail 2: L1 ID
 
 

0x20000 - TTCFEEReset during router busy

Description
.
Defined in TTC.v
 
Actions to do
.
 
Error details
Error number (10 bits)
0 0 0 0 0 0 0 0 0 Error_TTCFEEReset_Router_Busy
  • Error TTCFEEReset during router busy present:           0000000001
  • Error TTCFEEReset during router busy not present:     0000000000
Detail 1: NOT USED
Detail 2: orbit number

 

0x40000 - High multiplicity

Description
There is an event with high multiplicity (>10% chip occupancy). The readout of the event is cut to guarantee the overall readout time of 256 us. This is a warning, not a real error.
Defined in ErrorManager.v
 
Actions to do
None.
 
Error details
Error number (10 bits)
0 0 0 0 0 0 0 0 0 High_multiplicity
  • High_multiplicity present:           0000000001
  • High_multiplicity not present:     0000000000

Detail 1: 6 bits to indicate the HS with high multiplicity event
Detail 2: orbit number