Troubleshooting reasoning service initialization issues

Common issues that occur during reasoning service startup and initialization, with symptoms, causes, and resolutions.

Container startup failures

Container fails to start with configuration file errors

The reasoning service container exits immediately after startup with errors about missing or invalid configuration files.

Check the container logs for specific error messages:

kubectl logs <pod-name> -c reasoning-service

Common causes include missing configuration files, invalid JSON syntax, or incorrect file paths. Verify that all required configuration files exist in the config/ directory: llm_config.json, thread_settings.json, mcp_settings.json, logging_config.txt, and cs_graphql_settings.json. Validate JSON syntax using a JSON validator. Check file permissions to ensure the container can read the configuration files. Review the configuration file structure against the examples in the configuration topics.

Container fails with environment variable errors

The container logs show errors about missing or invalid environment variables, particularly for API keys or credentials.

Required environment variables depend on your LLM provider. For Azure OpenAI, ensure AZURE_API_KEY is set if using "${AZURE_API_KEY}" in configuration. For WatsonX, ensure WATSONX_API_KEY is set. For Content Services GraphQL, ensure CS_GRAPHQL_URL is set if not specified in configuration file.

Verify environment variables are set in your deployment manifest. Check that secret references are correct and secrets exist in the namespace. Ensure environment variable names match exactly (case-sensitive). Test environment variable substitution by checking the actual values the container receives.

LLM provider connection failures

Cannot connect to Azure OpenAI endpoint

The service starts but fails when attempting to connect to Azure OpenAI, with errors like "Connection refused" or "Name resolution failed".

Verify the endpoint URL in llm_config.json is correct and includes the protocol (https://). Check that the Azure OpenAI resource exists and is accessible from your cluster. Verify network connectivity from the pod to the Azure endpoint. Check firewall rules and network policies that might block outbound connections. Verify DNS resolution is working correctly in the cluster. Test connectivity using a debug pod with curl or similar tools.

Azure OpenAI authentication failures

Connection succeeds but authentication fails with 401 Unauthorized or 403 Forbidden errors.

Verify the API key is correct and has not expired. Check that the API key has the necessary permissions for the Azure OpenAI resource. If using Entra ID authentication, verify the service principal has the correct role assignments. Ensure the API version specified in configuration is supported by your Azure OpenAI resource. Check that the model deployment name matches exactly (case-sensitive).

WatsonX connection failures

Cannot connect to WatsonX service with errors about invalid URL or connection timeouts.

Verify the WatsonX URL in llm_config.json is correct for your deployment region. For SaaS deployments, ensure you are using the correct regional endpoint (for example, https://us-south.ml.cloud.ibm.com). For lightweight deployments, verify the OpenShift route or service URL is accessible. Check that the space_id or project_id is correct. Verify the instance_id for lightweight deployments. Test connectivity to the WatsonX endpoint from within the cluster.

WatsonX authentication failures

Connection succeeds but authentication fails with API key errors.

Verify the WatsonX API key is valid and has not expired. Check that the API key has access to the specified space or project. For lightweight deployments, verify the username and password are correct. Ensure the version parameter matches your WatsonX deployment version. Check that SSL verification settings are appropriate for your environment.

AWS Bedrock connection failures

Cannot connect to AWS Bedrock with errors about credentials or region configuration.

Verify AWS credentials are configured correctly (access key, secret key, session token if applicable). Check that the AWS region is correct for your Bedrock deployment. Ensure the IAM role or user has the necessary Bedrock permissions. Verify the model ID is correct and available in your region. Check that Bedrock is enabled in your AWS account and region.

MCP server connection failures

Cannot connect to MCP server

The reasoning service starts but fails to connect to configured MCP servers, with connection timeout or refused errors.

Verify the MCP server URL in mcp_settings.json is correct and accessible. Check that the MCP server is running and healthy. Verify network connectivity from the reasoning service pod to the MCP server. Check firewall rules and network policies. Ensure the MCP server is listening on the expected port. Test connectivity using curl or similar tools from within the cluster.

MCP server SSL/TLS errors

Connection fails with SSL certificate verification errors or TLS handshake failures.

If using self-signed certificates, set ssl.verify to false in mcp_settings.json (development only). For production, provide the CA certificate bundle path in ssl.ca_bundle. Verify the certificate is valid and not expired. Check that the certificate hostname matches the MCP server URL. Ensure the certificate chain is complete.

MCP server authentication failures

Connection succeeds but authentication fails with 401 or 403 errors.

Verify that authorization_required is set correctly in mcp_settings.json. If the MCP server requires authorization, ensure the reasoning service is configured to pass user JWT tokens. Check that the JWT token format and claims are correct. Verify the MCP server's authentication configuration accepts tokens from the reasoning service.

MCP server timeout errors

Connections to MCP server time out before completing.

Increase the timeout value in mcp_settings.json (default is 240 seconds). Check if the MCP server is experiencing performance issues or high load. Verify network latency between the reasoning service and MCP server. Consider if the MCP server operations are inherently slow and need longer timeouts. Monitor MCP server logs for errors or performance issues.